← back
[Workshop] AI Engineering 201: Inference
Takeaway
Production inference engineering hinges on understanding the tensor-to-tensor bottleneck and trading off proprietary capability vs open-model control and cost.
Summary
- Charles Frye (Full Stack Deep Learning) breaks inference into tokenizer → tensor-to-tensor neural net → detokenizer; the network step is the engineering bottleneck
- Discusses proprietary (OpenAI/Anthropic top LMSYS leaderboard) vs open models (Llama, Mistral) and the build-vs-buy axis
- Covers self-serving inference, SLAs/SLOs, accelerator choice, and on-device vs cloud
- Second half covers application architectures, monitoring, evaluation, and observability patterns emerging in production
inferenceservingmlops
Original description
Optional introductory course for AI Engineers, free for all Summit attendees. Advanced knowledge of AI Engineering, led by instructor Charles Frye of the massively popular Full Stack LLM Bootcamp. Part I: Running Inference What is the workload? Open vs Proprietary Models Execution End User Device Over a Network Serving Inference Timestamps 0:00:00 Intro & Overview 0:03:52 What is Inference? 0:10:16 Proprietary Models for Inference 0:21:22 Open Models for Inference 0:30:41 Will Open or Proprietary Models Win Long-Term? 0:36:19 Q&A on Models 0:44:12 Inference on End-User Devices 1:04:32 Inference-as-a-Service Providers 1:10:00 Cloud Inference and Serverless GPUs 1:17:46 Rack-and-Stack for Inference 1:20:12 Inference Arithmetic for GPUs 1:27:07 TPUs and Other Custom Silicon for Inference 1:36:11 Containerizing Inference and Inference Services