Hacking the Inference Pareto Frontier - Kyle Kranen, NVIDIA

2.1K views · Aug 01, 2025 · 20:24 min · Watch on YouTube ↗

Takeaway

Pick your operating point on the latency/cost Pareto frontier per use case, then use Dynamo-style disaggregation to move the whole frontier outward.

Summary

Kyle Kranen (NVIDIA, previously ran the largest NVIDIA inference deployment with multi-tens-of-millions quarterly cloud spend) now leads NVIDIA Dynamo, an open-source data-center-scale inference orchestrator.
Frames deployments along three axes — quality, latency, cost — visualized as a Pareto frontier between TPS-per-GPU (cost) and user TPS (responsiveness).
Each application has a different operating point: cancer-cure inference is cost-insensitive; cursor-style tab completion demands sub-second responsiveness; async commits need a third profile.
Dynamo uses techniques like prefill/decode disaggregation to shift the entire Pareto frontier rather than just picking a point on it.

inferencenvidiadynamo

Original description

Your model works! It aces the evals! It even passes the vibe check! All that’s required is inference, right? Oops, you’ve just stepped into a minefield:

-Not low-latency enough? Choppy experience. Users churn from your app.
-Not cheap enough? You’re losing money on every query.
-Not high enough output quality? Your system can’t be used for that application.

A model and the inference system around it form a “token factory” associated with a Pareto frontier— a curve representing the best possible trade-offs between cost, throughput, latency and quality, outside of which your LLM system cannot be applied successfully.

Outside of the Pareto frontier? You’re back to square one.
That is, unless you’re able to change the shape of the Pareto frontier.

In this session, we’ll introduce NVIDIA Dynamo, a datacenter-scale distributed inference framework as well as the bleeding-edge techniques it enables to hack the Pareto frontier of your inference systems, including:

-Disaggregation - separating phases of LLM generation to make them more efficient
-Speculation - predicting multiple tokens per cycle
-KV routing, storage, and manipulation - ensuring that we don’t redo work that has already been done
-Pipelining improvements for agents - accelerating our workflows using information about the agent

By the end of the talk, we’ll understand how the Pareto frontier limits where models can be applied, the intuition behind how inference techniques can be used to modify it, as well as the mechanics of how these techniques work.

---related links---

https://x.com/kranenkyle
https://www.linkedin.com/in/kyle-kranen/
https://www.nvidia.com/en-us/

Timestamps:

00:00 Introduction to Breaking the Inference Pareto Frontier
00:33 Introduction of Kyle Cranon and NVIDIA Dynamo
01:31 The Three Pillars of Deployment (Quality, Latency, Cost)
02:11 Understanding the Pareto Frontier
03:06 Application-Specific Prioritization of Quality, Latency, and Cost
04:32 Common Techniques to Manipulate the Pareto Frontier (Quantization, RAG, Reasoning)
05:19 Compounding Techniques
06:04 Three Drivers for Modifying the Pareto Frontier (Scale, Structure, Dynamism)
06:20 Scale: Disaggregation
11:02 Scale: Routing
13:00 Structure: Inference Time Scaling
16:14 Structure: KV Manipulation
17:43 Dynamism: Worker Specialization
18:42 Dynamism: Dynamic Load Balancing
19:55 Conclusion and NVIDIA Dynamo Resources