← back
Luminal - Search-Based Deep Learning Compilers - Joe Fioti
Takeaway
Reduce deep learning to ~12 primitives and let search-based compilers generate the fast code — a path to vastly simpler ML stacks that still hit peak hardware performance.
Summary
- Luminal expresses any deep learning model as a DAG of just 12 primitive ops (x2, log2, sin, reciprocal, sqrt; +, ×, mod, <; sum-reduce, max-reduce, plus shape tracking) — matmul/conv/sub/div are derived.
- PyTorch has 1,200+ ops × 15 data types × many devices = multiplicative complexity (~3M LOC); Luminal is <5,000 LOC and emits CUDA directly with no cuDNN/cuBLAS dependency.
- Out-of-the-box LLaMA 7B is hours-per-sentence slow; the strategy is search-based compilation that transforms primitive graphs into high-performance kernels.
- Argues hardware is getting simpler (CPU → GPU → TPU) which means compilers must take on more — but classical handwritten compilers scale superlinearly with kernel complexity, making search-based generation the answer.
compilersml-systemscuda
Original description
Luminal is a deep learning compiler for CPUs, GPUs, and ASICs that takes a search-first approach to discovering efficient kernels, such as flash attention, automatically.