System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis

6.7K views · Feb 11, 2025 · 18:29 min · Watch on YouTube ↗

Takeaway

Frontier-model inference is a memory-bandwidth game — continuous batching, disaggregated prefill, and context caching are the levers that keep cost and latency manageable.

Summary

Inference splits into prefill (compute-bound, ~1 PFLOP per 2K-token prompt) and decode (memory-bandwidth-bound, must reload all weights per token) — hence the 3:1-4:1 output:input pricing ratio.
GPT-4 (~1.8T params) needs terabytes/sec of memory bandwidth; 60 TB/s for 64 users at 30 tok/sec — H100 only delivers ~3 TB/s, making decode the systems bottleneck.
Continuous batching (missing in llama.cpp) is mandatory to avoid 10-100x cost penalties when serving multiple concurrent users.
Disaggregated prefill — dedicating separate accelerators to prefill vs decode — is now used by Google, likely OpenAI/Anthropic, Together and Fireworks to isolate noisy neighbors and protect SLAs.
Context caching (launched by Google) is a major unlock that decouples large-prompt cost from fine-tuning.

inferenceservinggpu

Original description

Current and future hardware requirements for next generation frontier models.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Dylan
Dylan Patel is the founder and Chief Analyst of SemiAnalysis, a semiconductor and AI research company. SemiAnalysis has analysts across the US, Japan, Taiwan, Singapore, and France covering the industry from production of materials, equipment, process technology, fabs to design IP and fabless to physical infrastructure of datacenters, networking, and AI models.