[Evals Workshop] Mastering AI Evaluation: From Playground to Production

16.6K views · Jul 01, 2025 · 85:08 min · Watch on YouTube ↗

Takeaway

Evals are tasks + datasets + scores; cross-validate human judgment against scores to know whether to fix your evals or your app.

Summary

Braintrust solutions engineers Doug and Carlos walk through evals from playground to production: tasks (input/output), datasets, and scores (0-1, code or LLM-as-judge).
Distinguishes offline evals (development, proactive issue detection) from online evals (production traffic, real-time monitoring, user feedback capture).
Provides a 2x2 diagnostic: high human quality + low score = fix the eval; low human quality + high score = also fix the eval; both low = fix the app.
Workshop covers SDK usage, production logging, and human-in-the-loop labeling to grow datasets from synthetic seeds to real production logs.
Recommends starting synthetic but grounding datasets in real logged interactions as the system matures.

evalsbraintrustllm-as-judge

Original description

This hands-on workshop will guide participants through the complete AI evaluation lifecycle using Braintrust, from initial prompt testing to production monitoring. Attendees will learn to build evaluation frameworks that ensure their AI applications perform reliably in real-world scenarios. Topics covered include both offline and online evaluation strategies, logging and feedback systems, and human review processes.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter