The Future of Evals - Ankur Goyal, Braintrust

8.4K views · Aug 09, 2025 · 5:13 min · Watch on YouTube ↗

Takeaway

Eval work has been painfully manual; with Claude 4-class models, agents like Braintrust Loop can now autonomously improve prompts, datasets and scorers.

Summary

Braintrust averages 13 evals/day per org, with top customers running 3,000+ and spending 2+ hours/day in the product.
Announces Loop, an in-Braintrust agent that auto-improves prompts, datasets, and scorers — only viable because Claude 4 was a 6x leap over prior models on these meta-eval tasks.
Loop runs side-by-side with users: every suggested prompt/dataset/scorer edit is reviewable; an autopilot toggle lets it optimize unattended.
Default model is Claude 4, but Loop also supports OpenAI, Gemini, and custom models.

evalsbraintrustagents

Original description

About Ankur
Ankur Goyal is the founder & CEO of Braintrust—the developer platform that companies like Zapier, Notion, Instacart, Airtable, and more use to evaluate, log, and ship reliable AI products to millions. He was previously Head of AI platform at Figma, founder and CEO of Impira, and VP Eng at Singlestore. After Figma acquired Impira, he led the AI team there, and saw a number of the same blockers to AI development at Impira, Figma, and other peer companies, which led to founding Braintrust

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps
00:00 Introduction to AI Engineer World's Fair
00:15 Speaker Introduction: Ankur Goyal, CEO of Braintrust
00:22 The Future of Evals
00:30 Increasing Adoption of Eval
01:58 Introducing Loop
04:09 Call to Action: Try Loop and Join the Team