⚡ Inference & Serving
Throughput and latency engineering. Continuous batching, paged attention, quantization, speculative decoding, vLLM/TensorRT/SGLang.
The workflow
flowchart LR
A[Request batch] --> B[Tokenize +<br/>continuous batching]
B --> C[KV-cache<br/>paged attention]
C --> D{Optimization}
D -->|Quantize| E[INT8 / FP8 / GPTQ]
D -->|Distill| F[Smaller model]
D -->|Speculative| G[Draft + verify]
E --> H[Throughput +<br/>latency targets]
F --> H
G --> H
The 10× cost-perf gains come from batching, paged attention, and quantization — not bigger GPUs.
Key takeaways
Videos (28)
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
Llamafile turns LLM weights into a portable, hardware-agnostic executable while pushing CPU inference performance close to GPU territory through targeted matmul optimizations.
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
Cost-effective LLM serving requires understanding the compute-vs-memory-bound dichotomy of prefill vs decode and sizing GPUs accordingly.
The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked
Small-model inference for retrieval is a distinct infra problem from LLM serving, and existing engines underserve it.
TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google
For narrow tasks like function calling, summarization or transcription, fine-tuned 100–500M parameter TLMs can be reliable enough to ship in-app while keeping data on device.
What every AI engineer needs to know about GPUs — Charles Frye, Modal
To use GPUs well as an AI engineer, ignore latency, target high-throughput low-precision matrix-matrix multiplications on tensor cores.
Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
Modern iPhones plus MLX make on-device LLMs (Gemma 4 at 40 tok/s) production-viable; quantize to 4-8 bit and use MLX Community weights for fastest path.
Fun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter
Inference is going multi-model, and a normalized cross-provider API/marketplace beats lock-in for both pricing and uptime.
System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis
Frontier-model inference is a memory-bandwidth game — continuous batching, disaggregated prefill, and context caching are the levers that keep cost and latency manageable.
From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta
TensorRT-LLM is the path to maximum NVIDIA-GPU efficiency for LLM serving, but extracting its 10x throughput wins requires careful per-shape engine building and benchmarking.
Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind
Gemma 4 brings function calling, structured outputs, and thinking mode to under-2GB on-device models, making private edge agents practical today.
Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
SGLang offers vLLM-class throughput with day-0 model support and a strong OSS community, making it a credible production serving stack.
Why MLX — Prince Canuma, Neywa Labs
MLX is reaching the point where production-quality vision, voice, and even hundred-billion-parameter chat run fully on-device on Apple Silicon — replacing cloud subscriptions.
From Mixture of Experts to Mixture of Agents with Super Fast Inference - Daniel Kim & Daria Soboleva
Mixture-of-Agents on fast Cerebras inference lets teams approximate MoE-style specialization at the application layer without pretraining from scratch.
Breaking AI's 1-GHz Barrier: Sunny Madra (Groq)
Token-throughput speedups are not just a UX improvement — fast inference will make the LLM the kernel of new computing platforms.
Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA
DGX Spark with NVFP4 quantization lets developers run 14B-class models locally at user-facing speeds and scale the same stack to production.
Build enterprise generative AI apps using Llama 3 at 1,000 tokens/s on the SambaNova AI platform
SambaNova bundles its own chip, system, and software into a turnkey enterprise inference stack hitting 1000 tokens/sec on Llama-3 with sovereign-AI privacy.
[Workshop] AI Engineering 201: Inference
Production inference engineering hinges on understanding the tensor-to-tensor bottleneck and trading off proprietary capability vs open-model control and cost.
Optimizing inference for voice models in production - Philip Kiely, Baseten
Apply LLM-stack optimizations (TRT-LLM, FP8, dynamic batching) to TTS LLM backbones and target TTFB + concurrency for cheap production voice agents.
A Practical Guide to Efficient AI: Shelby Heinecke
Production AI economics are won on five efficiency axes — and starting with a small model plus quantization beats brute-forcing a frontier giant.
The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten
Enterprise AI's next phase is open-weights models running on customer infra, mirroring the 2010s shift from buying Salesforce to building on Snowflake.
Harnessing the Power of LLMs Locally: Mithun Hunsur
llm.rs offers a Rust-native, library-first alternative to llama.cpp for embedding local quantized LLMs into applications with full control over inference.
Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA
Pick your operating point on the latency/cost Pareto frontier per use case, then use Dynamo-style disaggregation to move the whole frontier outward.
Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov
Open-model inference becomes 10x cheaper and faster than frontier APIs when you co-design serving stack, optimization knobs, and per-tenant LoRA packing.
How fast are LLM inference engines anyway? — Charles Frye, Modal
Self-hosting open models on Modal-style infra is now a serious option — use the public LLM Almanac to pick the right engine/config for your latency/throughput SLO.
Dream Machine: Scaling to 1m users in 4 days — Keegan McCallum, Luma AI
Scaling generative-video inference to 1M users required ditching Triton for a PyTorch-based decoupled queue system with explicit back-pressure and fair scheduling across heterogeneous GPU pools.
Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov
Customize and fine-tune open models on a dedicated serving stack to beat frontier models on cost and latency for narrow production workloads.
[Full Workshop] Llama 3 at 1,000 tok/s on the SambaNova AI Platform
SambaNova's custom-chip full stack delivers 1,000+ tok/s Llama 3 inference and pitches expert ensembles as the enterprise AI path.
Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive — Emma Ning, Microsoft
Foundry Local makes ONNX-optimized on-device LLM inference a one-CLI experience across Windows/macOS and NPU/GPU/CPU silicon for privacy-bound use cases.