Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Original: Five hard earned lessons about Evals — Ankur Goyal, Braintrust

17.9K views · Aug 23, 2025 · 19:45 min · Watch on YouTube ↗

Takeaway

Treat evals as a continuously engineered system, not synthetic data plus a judge, so a new model release can flip a feature from unviable to shippable.

Summary

Three signs evals are working: ship a new-model update in <24h (like Notion), turn user complaints into evals fast, and use evals offensively to predict use-case viability.
Evals must be engineered — synthetic data + generic LLM-as-judge scorers don't represent reality; the best teams write custom scorers and treat them as the project's PRD.
Tool definitions and outputs dominate tokens in agent prompts; switching tool output JSON→YAML can materially improve LLM analysis (despite being identical to downstream code).
Build evals that are 'too ambitious' for current models so when a new frontier model drops (Sonnet 4 vs 3.7 vs GPT-4o) viability suddenly flips and you can ship the new feature immediately.
Optimize the whole system (data + prompt + tools + scorers) together — auto-prompt optimization with full system context dramatically outperforms prompt-only optimization.

evalsllm-as-judgebraintrust

Original description

The main thesis of the video is that building successful AI applications requires a sophisticated engineering approach that goes beyond simply writing good prompts. The speaker argues for the importance of evaluations (evals) as a core component of the development process, highlighting that they should be intentionally engineered to reflect real-world user feedback and drive product improvements. The video also introduces the concept of "context engineering" as the new frontier, where the focus is on optimizing the entire context provided to the model, including tool definitions and their outputs. Ultimately, the speaker advocates for a flexible, model-agnostic architecture that can quickly adapt to the rapidly evolving landscape of AI models.

Timestamps:

00:00 Introduction to 5 Lessons in AI Product Development
00:19 Lesson 1: Effective Evals Speak for Themselves
02:09 Lesson 2: Great Evals Need to Be Intentionally Engineered
04:03 Lesson 3: Context Engineering is the New Prompt Engineering
06:37 Lesson 4: Be Prepared for a New Model to Change Everything
09:09 Lesson 5: Optimize the Entire Evaluation System, Not Just the Prompts
12:21 Recap of the Five Lessons