Luminal - Search-Based Deep Learning Compilers - Joe Fioti

738 views · Jun 03, 2025 · 24:35 min · Watch on YouTube ↗

Takeaway

Reduce deep learning to ~12 primitives and let search-based compilers generate the fast code — a path to vastly simpler ML stacks that still hit peak hardware performance.

Summary

Luminal expresses any deep learning model as a DAG of just 12 primitive ops (x2, log2, sin, reciprocal, sqrt; +, ×, mod, <; sum-reduce, max-reduce, plus shape tracking) — matmul/conv/sub/div are derived.
PyTorch has 1,200+ ops × 15 data types × many devices = multiplicative complexity (~3M LOC); Luminal is <5,000 LOC and emits CUDA directly with no cuDNN/cuBLAS dependency.
Out-of-the-box LLaMA 7B is hours-per-sentence slow; the strategy is search-based compilation that transforms primitive graphs into high-performance kernels.
Argues hardware is getting simpler (CPU → GPU → TPU) which means compilers must take on more — but classical handwritten compilers scale superlinearly with kernel complexity, making search-based generation the answer.

compilersml-systemscuda

Original description

Luminal is a deep learning compiler for CPUs, GPUs, and ASICs that takes a search-first approach to discovering efficient kernels, such as flash attention, automatically.