← all topics

Inference & Serving

Throughput and latency engineering. Continuous batching, paged attention, quantization, speculative decoding, vLLM/TensorRT/SGLang.

28 videos · inferenceon-devicegpunvidiagemmaserving

The workflow

flowchart LR
    A[Request batch] --> B[Tokenize +<br/>continuous batching]
    B --> C[KV-cache<br/>paged attention]
    C --> D{Optimization}
    D -->|Quantize| E[INT8 / FP8 / GPTQ]
    D -->|Distill| F[Smaller model]
    D -->|Speculative| G[Draft + verify]
    E --> H[Throughput +<br/>latency targets]
    F --> H
    G --> H

The 10× cost-perf gains come from batching, paged attention, and quantization — not bigger GPUs.

Key takeaways

Llamafile turns LLM weights into a portable, hardware-agnostic executable while pushing CPU inference performance close to GPU territory through targeted matmul optimizations.
Cost-effective LLM serving requires understanding the compute-vs-memory-bound dichotomy of prefill vs decode and sizing GPUs accordingly.
Small-model inference for retrieval is a distinct infra problem from LLM serving, and existing engines underserve it.
For narrow tasks like function calling, summarization or transcription, fine-tuned 100–500M parameter TLMs can be reliable enough to ship in-app while keeping data on device.
To use GPUs well as an AI engineer, ignore latency, target high-throughput low-precision matrix-matrix multiplications on tensor cores.
Modern iPhones plus MLX make on-device LLMs (Gemma 4 at 40 tok/s) production-viable; quantize to 4-8 bit and use MLX Community weights for fastest path.

Videos (28)

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile turns LLM weights into a portable, hardware-agnostic executable while pushing CPU inference performance close to GPU territory through targeted matmul optimizations.

48.5K views · Jul 16, 2024

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Cost-effective LLM serving requires understanding the compute-vs-memory-bound dichotomy of prefill vs decode and sizing GPUs accordingly.

43.4K views · Jan 01, 2025

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

Small-model inference for retrieval is a distinct infra problem from LLM serving, and existing engines underserve it.

25.8K views · May 05, 2026

TLMs: Tiny LLMs and Agents on Edge Devices with LiteRT-LM — Cormac Brick, Google

For narrow tasks like function calling, summarization or transcription, fine-tuned 100–500M parameter TLMs can be reliable enough to ship in-app while keeping data on device.

24.2K views · May 03, 2026

What every AI engineer needs to know about GPUs — Charles Frye, Modal

To use GPUs well as an AI engineer, ignore latency, target high-throughput low-precision matrix-matrix multiplications on tensor cores.

21.7K views · Jul 20, 2025

Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

Modern iPhones plus MLX make on-device LLMs (Gemma 4 at 40 tok/s) production-viable; quantize to 4-8 bit and use MLX Community weights for fastest path.

9.3K views · Apr 20, 2026

Fun stories from building OpenRouter and where all this is going - Alex Atallah, OpenRouter

Inference is going multi-model, and a normalized cross-provider API/marketplace beats lock-in for both pricing and uptime.

9.2K views · Jun 25, 2025

System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis

Frontier-model inference is a memory-bandwidth game — continuous batching, disaggregated prefill, and context caching are the levers that keep cost and latency manageable.

6.7K views · Feb 11, 2025

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

TensorRT-LLM is the path to maximum NVIDIA-GPU efficiency for LLM serving, but extracting its 10x throughput wins requires careful per-shape engine building and benchmarking.

5.1K views · Sep 13, 2024

Accelerating AI on Edge — Chintan Parikh and Weiyi Wang, Google DeepMind

Gemma 4 brings function calling, structured outputs, and thinking mode to under-2GB on-device models, making private edge agents practical today.

4.8K views · May 05, 2026

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

SGLang offers vLLM-class throughput with day-0 model support and a strong OSS community, making it a credible production serving stack.

4.4K views · Jul 26, 2025

Why MLX — Prince Canuma, Neywa Labs

MLX is reaching the point where production-quality vision, voice, and even hundred-billion-parameter chat run fully on-device on Apple Silicon — replacing cloud subscriptions.

4.2K views · May 11, 2026

From Mixture of Experts to Mixture of Agents with Super Fast Inference - Daniel Kim & Daria Soboleva

Mixture-of-Agents on fast Cerebras inference lets teams approximate MoE-style specialization at the application layer without pretraining from scratch.

4.1K views · Jun 27, 2025

Breaking AI's 1-GHz Barrier: Sunny Madra (Groq)

Token-throughput speedups are not just a UX improvement — fast inference will make the LLM the kernel of new computing platforms.

4.0K views · Oct 10, 2024

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

DGX Spark with NVFP4 quantization lets developers run 14B-class models locally at user-facing speeds and scale the same stack to production.

3.8K views · Apr 10, 2026

Build enterprise generative AI apps using Llama 3 at 1,000 tokens/s on the SambaNova AI platform

SambaNova bundles its own chip, system, and software into a turnkey enterprise inference stack hitting 1000 tokens/sec on Llama-3 with sovereign-AI privacy.

3.8K views · Sep 11, 2024

[Workshop] AI Engineering 201: Inference

Production inference engineering hinges on understanding the tensor-to-tensor bottleneck and trading off proprietary capability vs open-model control and cost.

3.1K views · Nov 07, 2023

Optimizing inference for voice models in production - Philip Kiely, Baseten

Apply LLM-stack optimizations (TRT-LLM, FP8, dynamic batching) to TTS LLM backbones and target TTFB + concurrency for cheap production voice agents.

3.1K views · Jul 01, 2025

A Practical Guide to Efficient AI: Shelby Heinecke

Production AI economics are won on five efficiency axes — and starting with a small model plus quantization beats brute-forcing a frontier giant.

2.9K views · Nov 18, 2024

The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten

Enterprise AI's next phase is open-weights models running on customer infra, mirroring the 2010s shift from buying Salesforce to building on Snowflake.

2.8K views · Jul 24, 2025

Harnessing the Power of LLMs Locally: Mithun Hunsur

llm.rs offers a Rust-native, library-first alternative to llama.cpp for embedding local quantized LLMs into applications with full control over inference.

2.4K views · Nov 22, 2023

Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA

Pick your operating point on the latency/cost Pareto frontier per use case, then use Dynamo-style disaggregation to move the whole frontier outward.

2.1K views · Aug 01, 2025

Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov

Open-model inference becomes 10x cheaper and faster than frontier APIs when you co-design serving stack, optimization knobs, and per-tenant LoRA packing.

1.8K views · Oct 09, 2024

How fast are LLM inference engines anyway? — Charles Frye, Modal

Self-hosting open models on Modal-style infra is now a serious option — use the public LLM Almanac to pick the right engine/config for your latency/throughput SLO.

1.8K views · Jun 27, 2025

Dream Machine: Scaling to 1m users in 4 days — Keegan McCallum, Luma AI

Scaling generative-video inference to 1M users required ditching Triton for a PyTorch-based decoupled queue system with explicit back-pressure and fair scheduling across heterogeneous GPU pools.

1.7K views · Jul 19, 2025

Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

Customize and fine-tune open models on a dedicated serving stack to beat frontier models on cost and latency for narrow production workloads.

1.6K views · Feb 16, 2025

[Full Workshop] Llama 3 at 1,000 tok/s on the SambaNova AI Platform

SambaNova's custom-chip full stack delivers 1,000+ tok/s Llama 3 inference and pitches expert ensembles as the enterprise AI path.

819 views · Feb 07, 2025

Foundry Local: Cutting-Edge AI experiences on device with ONNX Runtime/Olive — Emma Ning, Microsoft

Foundry Local makes ONNX-optimized on-device LLM inference a one-CLI experience across Windows/macOS and NPU/GPU/CPU silicon for privacy-bound use cases.

595 views · Jun 27, 2025