What every AI engineer needs to know about GPUs — Charles Frye, Modal

21.7K views · Jul 20, 2025 · 19:52 min · Watch on YouTube ↗

Takeaway

To use GPUs well as an AI engineer, ignore latency, target high-throughput low-precision matrix-matrix multiplications on tensor cores.

Summary

Charles Frye (Modal) argues AI engineers should learn GPUs the way SQL developers learn indexes: not to build them, but to use them correctly — one-liner thesis: 'use the tensor cores, Luke.'
GPUs embrace high bandwidth, not low latency (latency scaling 'died during the Bush administration'); optimize for math bandwidth over memory bandwidth.
Concretely: target low-precision matrix-matrix multiplications on tensor cores; matrix-vector ops waste the hardware.
Open-weights serving stacks (vLLM, SGLang, TensorRT-LLM, Dynamo) are improving fast enough that self-hosting now makes economic sense for many workloads.

gpustensor-coresinference

Original description

Every programmer needs to know a few things about hardware, like processors, memory, and disks. Due to AI systems' extreme demand for mathematical processing power, AI engineers need to know a few things about GPUs -- the world's most popular high-throughput mathematical co-processor.

In this talk, I will explain the fundamental engineering constraints and design decisions that shape GPUs and trace those up to some counter-intuitive facts about the performance characteristics of AI systems, with actionable insights for their deployers and consumers.


---related links---