Evals 101 — Doug Guthrie, Braintrust

23.5K views · Jun 27, 2025 · 48:31 min · Watch on YouTube ↗

Takeaway

Treat evals as the central flywheel — connect offline test datasets, production traces and human review so every prompt or model change is measurably better.

Summary

Doug Guthrie (Braintrust solutions engineer) positions evals as 'offense, not defense' — a way to systematically answer whether a model or prompt change improves the app, beyond unit-test-style regression catching.
An eval requires three ingredients: a task (single prompt up through full agentic tool-using workflow), a dataset of real-world inputs, and scores (LLM-as-judge with rubric outputs mapped to 0/0.5/1, or deterministic/heuristic code scorers).
Braintrust offers a Playground 'IDE for LLM outputs' for rapid prompt iteration plus an SDK so local dev evals share data with production observability.
Distinguishes offline pre-production evals from online evals that trace production traffic, instrumenting cost, tokens, latency, tool calls and intermediate steps; production logs flywheel back into eval datasets through human review and user feedback.
Founder Ankur Goyal's advice repeated: don't wait for a golden dataset — establish a baseline eval and iterate from there.

evalsbraintrustobservability

Original description

This hands-on workshop guides participants through the full AI evaluation lifecycle with Braintrust, from initial prompt testing to production monitoring. Attendees will build evaluation frameworks, practice offline and online strategies, and implement logging systems.

About Doug Guthrie
Doug Guthrie is a solutions engineer at Braintrust. Previously, he helped customers deploy data infrastructure at dbt Labs. He is also a proud girl dad.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter