← back
Five hard earned lessons about Evals — Ankur Goyal, Braintrust
Original: Five hard earned lessons about Evals — Ankur Goyal, Braintrust
Takeaway
Treat evals as a continuously engineered system, not synthetic data plus a judge, so a new model release can flip a feature from unviable to shippable.
Summary
- Three signs evals are working: ship a new-model update in <24h (like Notion), turn user complaints into evals fast, and use evals offensively to predict use-case viability.
- Evals must be engineered — synthetic data + generic LLM-as-judge scorers don't represent reality; the best teams write custom scorers and treat them as the project's PRD.
- Tool definitions and outputs dominate tokens in agent prompts; switching tool output JSON→YAML can materially improve LLM analysis (despite being identical to downstream code).
- Build evals that are 'too ambitious' for current models so when a new frontier model drops (Sonnet 4 vs 3.7 vs GPT-4o) viability suddenly flips and you can ship the new feature immediately.
- Optimize the whole system (data + prompt + tools + scorers) together — auto-prompt optimization with full system context dramatically outperforms prompt-only optimization.
evalsllm-as-judgebraintrust
Original description
The main thesis of the video is that building successful AI applications requires a sophisticated engineering approach that goes beyond simply writing good prompts. The speaker argues for the importance of evaluations (evals) as a core component of the development process, highlighting that they should be intentionally engineered to reflect real-world user feedback and drive product improvements. The video also introduces the concept of "context engineering" as the new frontier, where the focus is on optimizing the entire context provided to the model, including tool definitions and their outputs. Ultimately, the speaker advocates for a flexible, model-agnostic architecture that can quickly adapt to the rapidly evolving landscape of AI models. Timestamps: 00:00 Introduction to 5 Lessons in AI Product Development 00:19 Lesson 1: Effective Evals Speak for Themselves 02:09 Lesson 2: Great Evals Need to Be Intentionally Engineered 04:03 Lesson 3: Context Engineering is the New Prompt Engineering 06:37 Lesson 4: Be Prepared for a New Model to Change Everything 09:09 Lesson 5: Optimize the Entire Evaluation System, Not Just the Prompts 12:21 Recap of the Five Lessons