📐 Evals
How to actually measure LLM and agent quality — golden sets, LLM-as-judge, regression gates, production tracing, observability.
The workflow
flowchart LR
A[Production traces] --> B[Sample & label<br/>golden set]
B --> C{Eval type}
C -->|Reference| D[Exact / BLEU /<br/>code-exec]
C -->|Reference-free| E[LLM-as-judge<br/>rubric scored]
C -->|Human| F[Pairwise<br/>preference]
D --> G[Aggregate metric]
E --> G
F --> G
G --> H[Regression gate<br/>in CI]
You cannot ship LLM products without evals. The most-watched talks all converge on: golden set + LLM-judge + CI gate.
Key takeaways
Videos (58)
Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil
Agents fail in the real world primarily because we evaluate them like LLMs; building cost-aware, environment-grounded benchmarks is the gating problem for production agents.
Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize
Evaluate agents at three layers — router decisions, individual skill correctness, and the convergence of the overall path — not just final-answer quality.
Evals 101 — Doug Guthrie, Braintrust
Treat evals as the central flywheel — connect offline test datasets, production traces and human review so every prompt or model change is measurably better.
How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh
Domain-specific eval systems are built bottom-up from unit tests, trace logging and frictionless human review — not from buying generic tools or jumping straight to LLM-as-judge.
Five hard earned lessons about Evals — Ankur Goyal, Braintrust
Treat evals as a continuously engineered system, not synthetic data plus a judge, so a new model release can flip a feature from unviable to shippable.
[Evals Workshop] Mastering AI Evaluation: From Playground to Production
Evals are tasks + datasets + scores; cross-validate human judgment against scores to know whether to fix your evals or your app.
Evals Are Not Unit Tests — Ido Pesok, Vercel v0
Treat application evals as statistical measurements over real user traffic, not unit tests — prompt tweaks alone never close the demo-to-prod gap.
AI Agents, Meet Test Driven Development
Treat AI products like TDD systems with continuous eval datasets and LLM-as-judge metrics, not one-shot prompt engineering.
Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith
Treat benchmarking as a subset of evaluation, and pair performance tools (GuideLLM) with accuracy harnesses (lm-eval-harness, OpenAI Evals) to make production LLM rollouts measurable.
Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran
Stop benchmarking models and start instrumenting your app's component-level traces with task-specific LLM-as-judge or heuristic evals.
Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize
AI PMs should treat evals as the new product spec — non-deterministic, data-dependent, and the durable moat for any agentic product.
Why should anyone care about Evals? — Manu Goyal, Braintrust
Evals are the laboratory that lets you iterate offline and turn production traffic into the next training set — without them you ship blind.
How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
METR's measurement shows AI tools can slow experienced devs even when they feel faster, and capability growth is tightly coupled to compute scaling.
Evaluating Domain Specific LLMs for Real World Finance — Waseem Alshikh, Writer
General LLMs collapse on noisy real-world finance queries and context; domain-specific models stay robust on the failure modes that matter in production.
Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc
Replace static benchmark evals with adaptive eval pipelines that evolve alongside your agents as you ship faster than you can review diffs.
What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench
Benchmarks all trend up while models still confidently bullshit on nonsense prompts; epistemic pushback is a major remaining gap.
The Future of Evals - Ankur Goyal, Braintrust
Eval work has been painfully manual; with Claude 4-class models, agents like Braintrust Loop can now autonomously improve prompts, datasets and scorers.
Why building eval platforms is hard — Phil Hetzel, Braintrust
Evals platforms are multi-persona, multi-stage systems problems that quickly outgrow spreadsheets and homegrown loops once teams take agent quality seriously.
Why Agent Hype can fall short of reality – Joel Becker, METR
Trust the exponential time-horizon trend for raw capability, but expect a gap between benchmark hype and real-world developer productivity gains.
Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
LLM judges only work when calibrated against human annotations per specific error type using prompt optimization like GEPA.
7 Habits of Highly Effective Generative AI Evaluations - Justin Muller
Evals exist to discover problems, not to compute a vanity score — building one is what turns a stuck GenAI prototype into a scalable production workload.
Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear
Pick a single business-aligned reliability metric for your AI app and iterate the prompt, model and data against it — generic NLP metrics are noise.
2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI
Agentic systems taking real actions are finally forcing evaluation tooling out of CIO-only sales and into board-level enterprise budgets in 2025.
Everything You Need To Know About Agent Observability — Danny Gollapalli & Zubin Koticha, Raindrop
Move from offline eval sets to production monitoring with implicit/explicit signals and live experiments — agents fail in ways unit tests can't anticipate.
Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize
Production eval pipelines mean traces feeding many calibrated LLM judges per trace, with drift monitoring — modeled on what Duolingo and Reddit actually do.
New York Times' Connections: A Case Study on NLP in Word Games — Shafik Quoraishee, NYT Games
Connections is a reproducible NLP benchmark for abstract reasoning where graph-coloring formulations outperform raw semantic similarity.
Shipping complex AI applications — Braintrust & Trainline
Shipping production AI agents requires the same eval-and-observability discipline Trainline applies via Braintrust to keep agentic ticketing reliable.
Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor
Coding evals are moving from clean snippets to dynamic, real-world, multi-hour codebase tasks with auto-generated tests to combat contamination and brittleness.
Fuzzing in the GenAI Era — Leonard Tang, Haize Labs
Treating GenAI eval as adversarial fuzzing exposes brittleness that static golden-set tests miss, and the judge itself must be evaluated to be trusted.
Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic
Treat prompts as managed variables and use GEPA-style genetic optimization with golden-set evals to systematically improve agent reliability.
Iterating on LLM apps at scale Learnings from Discord: Ian Webster
At Discord scale, simple deterministic evals run on every PR like unit tests beat fancy LLM-graded eval pipelines for shipping safely.
Turning Fails into Features: Zapier’s Hard-Won Eval Lessons — Rafal Willinski, Vitor Balocco, Zapier
Treat probabilistic agents like a data flywheel: instrument traces so any run becomes a replayable eval, then mine implicit feedback to drive continuous improvement.
How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado
Mature AI products require evals owned jointly by PMs and engineers, run in CI, with tracing across multi-tool agent flows — Zapier's 300% accuracy gain proves it.
Mission-Critical Evals at Scale (Learnings from 100k medical decisions)
Real-time reference-free evals (LLM-as-judge + confidence) prioritize human review where it matters and let mission-critical AI scale beyond what clinicians could ever cover.
How to build world-class AI products — Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust)
World-class AI products spend the majority of engineering time on evals and observability, not prompts — that's how Notion ships fast at consumer scale.
Your Evals Are Meaningless (And Here's How to Fix Them)
Build evals as a dynamic system with SME-authored datasets and domain-specific LLM-judge criteria — don't rely on framework defaults that drift away from your users' definition of good.
Agent Evals: Finally, With The Map
A complete agent eval program covers both semantic and behavioral dimensions and treats the LLM-judge layer (EvalOps) as a first-class optimization target.
Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily
Evaluating production AI search needs live LLM-judge monitoring on real traffic, not just static benchmarks like SimpleQA, because both the web and user intent keep moving.
Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft
Use Azure AI Toolkit + Evaluation SDK to spot-check models, then scale to dataset evals, treating evaluation as a layered application-level concern.
Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai
Aesthetic evals must center human perception — current metrics like FID and CLIP miss what people actually find broken in generative imagery.
The Build-Operate Divide: Bridging Product Vision and AI Operational Reality
Crossing the V1-to-V2 quality chasm in AI products comes from a fast eval-iteration loop plus disciplined human-in-the-loop review, not better base models.
Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran
Agent evals must span tool calls, trajectories, and full conversations — and the evals themselves need to evolve alongside the agent.
Judging LLMs: Alex Volkov
Production LLM apps must log/trace everything from day one, and use a layered eval stack (programmatic + human + LLM-judge) rather than skipping straight to fine-tuning.
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Eval is just testing for non-deterministic systems — capture traces, look at them, then write code + LLM-judge + meta-evals before tuning prompts or swapping models.
[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)
Eval is a methodology discipline (calibrate metrics with humans, expand to many signals) not a one-shot benchmark — learn from Google Search's 300-metric setup.
open-rag-eval: RAG Evaluation without "golden" answers — Ofer Mendelevitch, Vectara
Open-rag-eval scores RAG quality without golden datasets using nugget-based generation evaluation and Umbrela retrieval scoring.
Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to
Whoever designs the next benchmark shapes what frontier models become—build benchmarks that are multifaceted, generative, evolutionary, and experiential, not just easy to score.
Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo
Reliable agents need step-level LLM-as-judge evaluations baked into observability pipelines from day one, not just final-answer scoring.
Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
Treat observability as a 3-phase loop (evaluate → monitor → optimize) built on OpenTelemetry tracing and agent-specific evaluators — non-determinism makes this a continuous practice, not a one-time eval.
CI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed
Shipping AI features inside Zed forced the team from fully-deterministic CI to programmatic stochastic evals where assertions interrogate specific agent steps.
How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe
Pick the eval metric to match the application type (RAG vs code-gen vs agent) and treat evals as a continuous, data-centric, parallelizable engineering practice.
The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani
Public benchmarks are rigged by Goodhart's Law and selective comparison — build small domain-specific evals from real production queries instead.
E-Values Evaluating the Values of AI: Sheila Gulati and Nischal Nadhamuni
Evals must become multifaceted, dynamic, and values-aware before agentic systems become fully self-sufficient, or we'll lose the ability to course-correct.
BotDojo Launch: Enhancing AI Assistants with Evaluations and Synthetic Data
Pairing batch-based LLM evaluations with synthetic data generated from real support tickets is a fast path from POC chatbot to production-ready RAG.
Agents reported thousands of bugs, how many were real? - Ian Butler and Nick Gregory
Software-maintenance agents are still poor at bug detection — high false positives, narrow reasoning, and lack of holistic code reading limit real-world reliability today.
Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen
llm-eval gives teams a local, Hydra-configurable CLI to test prompt and model changes across providers with stable multi-sample reporting.
Will Agent evaluation via MCP Stabilize Agent Networks? - Ari Heljakka
Exposing evaluators via MCP lets any agent get scored, explained feedback inline — turning evals from a one-off harness into a continuous stabilization loop for agent networks.
How to evaluate a model for your use case: Emmanuel Turlay
Use LLM-as-judge with task-specific rubrics and visualize score distributions — generic NLP benchmarks won't tell you which model fits your application.