[Workshop] AI Engineering 201: Inference

3.1K views · Nov 07, 2023 · 103:15 min · Watch on YouTube ↗

Takeaway

Production inference engineering hinges on understanding the tensor-to-tensor bottleneck and trading off proprietary capability vs open-model control and cost.

Summary

Charles Frye (Full Stack Deep Learning) breaks inference into tokenizer → tensor-to-tensor neural net → detokenizer; the network step is the engineering bottleneck
Discusses proprietary (OpenAI/Anthropic top LMSYS leaderboard) vs open models (Llama, Mistral) and the build-vs-buy axis
Covers self-serving inference, SLAs/SLOs, accelerator choice, and on-device vs cloud
Second half covers application architectures, monitoring, evaluation, and observability patterns emerging in production

inferenceservingmlops

Original description

Optional introductory course for AI Engineers, free for all Summit attendees. Advanced knowledge of AI Engineering, led by instructor Charles Frye of the massively popular Full Stack LLM Bootcamp.

Part I: Running Inference

What is the workload?
Open vs Proprietary Models
Execution
End User Device
Over a Network
Serving Inference

Timestamps 

0:00:00 Intro & Overview
0:03:52 What is Inference?
0:10:16 Proprietary Models for Inference
0:21:22 Open Models for Inference
0:30:41 Will Open or Proprietary Models Win Long-Term?
0:36:19 Q&A on Models
0:44:12 Inference on End-User Devices
1:04:32 Inference-as-a-Service Providers
1:10:00 Cloud Inference and Serverless GPUs
1:17:46 Rack-and-Stack for Inference
1:20:12 Inference Arithmetic for GPUs
1:27:07 TPUs and Other Custom Silicon for Inference
1:36:11 Containerizing Inference and Inference Services