← back
What every AI engineer needs to know about GPUs — Charles Frye, Modal
Takeaway
To use GPUs well as an AI engineer, ignore latency, target high-throughput low-precision matrix-matrix multiplications on tensor cores.
Summary
- Charles Frye (Modal) argues AI engineers should learn GPUs the way SQL developers learn indexes: not to build them, but to use them correctly — one-liner thesis: 'use the tensor cores, Luke.'
- GPUs embrace high bandwidth, not low latency (latency scaling 'died during the Bush administration'); optimize for math bandwidth over memory bandwidth.
- Concretely: target low-precision matrix-matrix multiplications on tensor cores; matrix-vector ops waste the hardware.
- Open-weights serving stacks (vLLM, SGLang, TensorRT-LLM, Dynamo) are improving fast enough that self-hosting now makes economic sense for many workloads.
gpustensor-coresinference
Original description
Every programmer needs to know a few things about hardware, like processors, memory, and disks. Due to AI systems' extreme demand for mathematical processing power, AI engineers need to know a few things about GPUs -- the world's most popular high-throughput mathematical co-processor. In this talk, I will explain the fundamental engineering constraints and design decisions that shape GPUs and trace those up to some counter-intuitive facts about the performance characteristics of AI systems, with actionable insights for their deployers and consumers. ---related links---