How fast are LLM inference engines anyway? — Charles Frye, Modal

1.8K views · Jun 27, 2025 · 16:07 min · Watch on YouTube ↗

Takeaway

Self-hosting open models on Modal-style infra is now a serious option — use the public LLM Almanac to pick the right engine/config for your latency/throughput SLO.

Summary

Modal benchmarked vLLM, SGLang and TensorRT-LLM across ~10 open models and ~10 context lengths; results live at modal.com/llm-almanac with reproducible UVX commands.
Open weights (Llama, Qwen, DeepSeek) plus matured serving engines now make self-hosting practical for code-completion, big-batch enrichment, or air-gapped/regulated use.
Example: Qwen3 MoE on vLLM at 128-in/1024-out hits ~1 request/sec/replica at sub-1-second TTFT; Gemma-3-27B in BF16 lands similar throughput despite ~10x smaller weights.
Demo includes filters for first-token <1 s and the 'Doherty threshold' 300 ms interactive SLO.
TGI is RIP; engines now sport speculative decoding, KV-caching, paged attention, multi-token prediction — too hard to hand-roll.

vllmsglangbenchmarks

Original description

Open weights models and open source inference servers have made massive strides in the year since we last got together at AIE World's Fair.

Where once we had only pirated LLaMA 2 weights and Transformers, we now have an embarrassment of riches. In fact, we have too many choices! What's an AI engineer looking to self-host inference to do?

In this session, we'll share our benchmarking results from hundreds of runs across models, frameworks, and hardware. We'll also share tips and tricks from working with teams deploying LLM inference at scale.

About Charles Frye
Charles teaches people to build data, ML, and AI applications. He got his PhD from the University of California, Berkeley, in 2020 for work on the geometry of neural network optimization. He has since worked as an educator and evangelist for neural network applications at Weights & Biases, Full Stack Deep Learning, and now Modal Labs.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter