← back

AI Kernel Generation: What's working, what's not, what's next – Natalie Serrino, Gimlet Labs

5.3K views · Dec 17, 2025 · 19:14 min · Watch on YouTube ↗
Takeaway

AI agents can already deliver double-digit kernel speedups on real workloads by iterating compile-run-profile loops, but struggle on the most complex kernels — promising for cross-hardware porting.

Summary

  • Natalie Serrino (Gimlet Labs co-founder) is building an agentic inference cloud that splits LLM pipelines across heterogeneous hardware; AI kernel generation is the bridge for porting workloads to new chips without enough human kernel experts.
  • Agent mimics human kernel workflow: get it to compile → execute → produce correct floating-point output → then iteratively optimize using profiling feedback. Watch out for measurement gotchas (warm-ups, cache clearing, kernel-launch vs execution time, input sizing).
  • Demo: PyTorch workload targeted at H100, agent found a candidate 22% faster than torch.compile baseline in ~20 minutes.
  • Apple M4 Metal benchmark on KernelBench v0.1 (250 problems) — standalone agent averages ~24-25% speedup, sweet spot is moderately complex problems with performance dropping on harder ones.
  • Examples: agent wrote a fused C++ kernel for conv+softmax+bias+scale+sigmoid yielding 40% speedup; rewrote PyTorch average-pool-1d as a more-optimized convolution on Metal for 80% improvement.
kernelsperformancegpu
Original description
In this talk, we'll talk about how AI generated kernels can meaningfully speed up custom PyTorch code, without any human effort.

Lots of great frameworks exist to optimize PyTorch with programmatic optimizations, such as Triton and MLX. But the strongest AI performance gains come from hand-written, low-level kernels that are targeted to the exact device and workload. These are tedious and time-consuming to write, especially when supporting multiple platforms. What if we could automate this process with AI?

We'll cover the best practices for AI generating low-level kernels, from how to test and validate the kernels, and what type of agents and contexts are needed to get the best results. We'll cover the research we did where this approach improved PyTorch inference performance on Apple devices.

Speaker:  Natalie Serrino  |  Cofounder, Gimlet Labs
https://x.com/nserrino
https://www.linkedin.com/in/natalieserrino/