← back

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

5.1K views · Sep 13, 2024 · 100:01 min · Watch on YouTube ↗
Takeaway

TensorRT-LLM is the path to maximum NVIDIA-GPU efficiency for LLM serving, but extracting its 10x throughput wins requires careful per-shape engine building and benchmarking.

Summary

  • BaseTen workshop on TensorRT-LLM: how to take a model (Llama-3 8B used as demo) from weights to a high-throughput production endpoint on NVIDIA GPUs.
  • TensorRT operates on a model's computation graph and provides plugin mechanisms; TensorRT-LLM adds LLM-specific plugins (flash attention, in-flight batching ~10-20x throughput uplift).
  • Engine building requires optimization profiles for input/output sequence lengths and batch sizes — TRT generates specialized kernels per size, which is why compile times can be hours.
  • Trade-offs: production-grade and supported on V100+ (A10/A100/H100), but parts are not fully open source and learning curve is steep; benchmarking is critical before declaring success.
tensorrt-llminferencegpu
Original description
TensorRT-LLM is the highest-performance model serving framework, but it can have a steep learning curve when you’re just getting started. We run TensorRT and TensorRT-LLM in production and have seen both the incredible performance gains it offers and the hurdles to overcome in getting it up and running. In this workshop, participants will learn how to start using TensorRT-LLM, including selecting a model to optimize, building an engine for it with TensorRT-LLM, setting batch sizes and sequence lengths, and running it on a cloud GPU.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Philip
Philip Kiely is a software developer and author based out of Chicago. Originally from Clive, Iowa, he graduated from Grinnell College with honors in Computer Science. Philip joined Baseten in January 2022 and works across documentation, technical content, and developer experience. Outside of work, he's a lifelong martial artist, a voracious reader, and, unfortunately, a Bears fan.

About Pankaj
Pankaj Gupta is a co-founder of Baseten, where he leads model performance. Pankaj has spent his career making systems faster and more efficient, from optimizing data processing libraries at Twitter to search infrastructure at Uber and media processing at Adobe. A graduate of IIT Delhi, Pankaj now lives in the Bay Area, where he enjoys gardening and evening walks around his neighborhood.