← all topics

📐 Evals

How to actually measure LLM and agent quality — golden sets, LLM-as-judge, regression gates, production tracing, observability.

58 videos · evalsobservabilityagentsllm-as-judgebenchmarksbraintrust

The workflow

flowchart LR
    A[Production traces] --> B[Sample & label<br/>golden set]
    B --> C{Eval type}
    C -->|Reference| D[Exact / BLEU /<br/>code-exec]
    C -->|Reference-free| E[LLM-as-judge<br/>rubric scored]
    C -->|Human| F[Pairwise<br/>preference]
    D --> G[Aggregate metric]
    E --> G
    F --> G
    G --> H[Regression gate<br/>in CI]

You cannot ship LLM products without evals. The most-watched talks all converge on: golden set + LLM-judge + CI gate.

Key takeaways

Agents fail in the real world primarily because we evaluate them like LLMs; building cost-aware, environment-grounded benchmarks is the gating problem for production agents.

Evaluate agents at three layers — router decisions, individual skill correctness, and the convergence of the overall path — not just final-answer quality.

Treat evals as the central flywheel — connect offline test datasets, production traces and human review so every prompt or model change is measurably better.

Domain-specific eval systems are built bottom-up from unit tests, trace logging and frictionless human review — not from buying generic tools or jumping straight to LLM-as-judge.

Treat evals as a continuously engineered system, not synthetic data plus a judge, so a new model release can flip a feature from unviable to shippable.

Evals are tasks + datasets + scores; cross-validate human judgment against scores to know whether to fix your evals or your app.

Videos (58)

Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil

Agents fail in the real world primarily because we evaluate them like LLMs; building cost-aware, environment-grounded benchmarks is the gating problem for production agents.

229.3K views · Apr 17, 2025

Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize

Evaluate agents at three layers — router decisions, individual skill correctness, and the convergence of the overall path — not just final-answer quality.

32.5K views · Apr 23, 2025

Evals 101 — Doug Guthrie, Braintrust

Treat evals as the central flywheel — connect offline test datasets, production traces and human review so every prompt or model change is measurably better.

23.5K views · Jun 27, 2025

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

Domain-specific eval systems are built bottom-up from unit tests, trace logging and frictionless human review — not from buying generic tools or jumping straight to LLM-as-judge.

19.9K views · Sep 19, 2024

Five hard earned lessons about Evals — Ankur Goyal, Braintrust

Treat evals as a continuously engineered system, not synthetic data plus a judge, so a new model release can flip a feature from unviable to shippable.

17.9K views · Aug 23, 2025

[Evals Workshop] Mastering AI Evaluation: From Playground to Production

Evals are tasks + datasets + scores; cross-validate human judgment against scores to know whether to fix your evals or your app.

16.6K views · Jul 01, 2025

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Treat application evals as statistical measurements over real user traffic, not unit tests — prompt tweaks alone never close the demo-to-prod gap.

13.9K views · Aug 06, 2025

AI Agents, Meet Test Driven Development

Treat AI products like TDD systems with continuous eval datasets and LLM-as-judge metrics, not one-shot prompt engineering.

13.4K views · Feb 22, 2025

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Treat benchmarking as a subset of evaluation, and pair performance tools (GuideLLM) with accuracy harnesses (lm-eval-harness, OpenAI Evals) to make production LLM rollouts measurable.

13.4K views · Jul 27, 2025

Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran

Stop benchmarking models and start instrumenting your app's component-level traces with task-specific LLM-as-judge or heuristic evals.

13.1K views · Feb 06, 2025

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

AI PMs should treat evals as the new product spec — non-deterministic, data-dependent, and the durable moat for any agentic product.

13.1K views · Dec 26, 2025

Why should anyone care about Evals? — Manu Goyal, Braintrust

Evals are the laboratory that lets you iterate offline and turn production traffic into the next training set — without them you ship blind.

13.1K views · Jun 27, 2025

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

METR's measurement shows AI tools can slow experienced devs even when they feel faster, and capability growth is tightly coupled to compute scaling.

10.6K views · Jan 19, 2026

Evaluating Domain Specific LLMs for Real World Finance — Waseem Alshikh, Writer

General LLMs collapse on noisy real-world finance queries and context; domain-specific models stay robust on the failure modes that matter in production.

9.5K views · Apr 22, 2025

Dark Factory: How OpenClaw Ships Faster Than You Can Read the Diff — Vincent Koc

Replace static benchmark evals with adaptive eval pipelines that evolve alongside your agents as you ship faster than you can review diffs.

9.3K views · May 12, 2026

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Benchmarks all trend up while models still confidently bullshit on nonsense prompts; epistemic pushback is a major remaining gap.

8.8K views · Apr 24, 2026

The Future of Evals - Ankur Goyal, Braintrust

Eval work has been painfully manual; with Claude 4-class models, agents like Braintrust Loop can now autonomously improve prompts, datasets and scorers.

8.4K views · Aug 09, 2025

Why building eval platforms is hard — Phil Hetzel, Braintrust

Evals platforms are multi-persona, multi-stage systems problems that quickly outgrow spreadsheets and homegrown loops once teams take agent quality seriously.

8.0K views · Apr 28, 2026

Why Agent Hype can fall short of reality – Joel Becker, METR

Trust the exponential time-horizon trend for raw capability, but expect a gap between benchmark hype and real-world developer productivity gains.

7.8K views · Dec 24, 2025

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

LLM judges only work when calibrated against human annotations per specific error type using prompt optimization like GEPA.

6.3K views · Apr 10, 2026

7 Habits of Highly Effective Generative AI Evaluations - Justin Muller

Evals exist to discover problems, not to compute a vanity score — building one is what turns a stuck GenAI prototype into a scalable production workload.

5.9K views · Jun 03, 2025

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Pick a single business-aligned reliability metric for your AI app and iterate the prompt, model and data against it — generic NLP metrics are noise.

5.4K views · Aug 03, 2025

2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

Agentic systems taking real actions are finally forcing evaluation tooling out of CIO-only sales and into board-level enterprise budgets in 2025.

5.0K views · Aug 06, 2025

Everything You Need To Know About Agent Observability — Danny Gollapalli & Zubin Koticha, Raindrop

Move from offline eval sets to production monitoring with implicit/explicit signals and live experiments — agents fail in ways unit tests can't anticipate.

4.9K views · May 07, 2026

Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

Production eval pipelines mean traces feeding many calibrated LLM judges per trace, with drift monitoring — modeled on what Duolingo and Reddit actually do.

4.9K views · Jun 27, 2025

New York Times' Connections: A Case Study on NLP in Word Games — Shafik Quoraishee, NYT Games

Connections is a reproducible NLP benchmark for abstract reasoning where graph-coloring formulations outperform raw semantic similarity.

4.8K views · Jul 05, 2025

Shipping complex AI applications — Braintrust & Trainline

Shipping production AI agents requires the same eval-and-observability discipline Trainline applies via Braintrust to keep agentic ticketing reliable.

4.5K views · May 01, 2026

Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

Coding evals are moving from clean snippets to dynamic, real-world, multi-hour codebase tasks with auto-generated tests to combat contamination and brittleness.

4.1K views · Dec 15, 2025

Fuzzing in the GenAI Era — Leonard Tang, Haize Labs

Treating GenAI eval as adversarial fuzzing exposes brittleness that static golden-set tests miss, and the judge itself must be evaluated to be trusted.

3.9K views · Aug 22, 2025

Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic

Treat prompts as managed variables and use GEPA-style genetic optimization with golden-set evals to systematically improve agent reliability.

3.9K views · May 07, 2026

Iterating on LLM apps at scale Learnings from Discord: Ian Webster

At Discord scale, simple deterministic evals run on every PR like unit tests beat fancy LLM-graded eval pipelines for shipping safely.

3.8K views · Nov 22, 2024

Turning Fails into Features: Zapier’s Hard-Won Eval Lessons — Rafal Willinski, Vitor Balocco, Zapier

Treat probabilistic agents like a data flywheel: instrument traces so any run becomes a replayable eval, then mine implicit feedback to drive continuous improvement.

3.8K views · Jun 30, 2025

How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

Mature AI products require evals owned jointly by PMs and engineers, run in CI, with tracing across multi-tool agent flows — Zapier's 300% accuracy gain proves it.

3.5K views · Nov 07, 2024

Mission-Critical Evals at Scale (Learnings from 100k medical decisions)

Real-time reference-free evals (LLM-as-judge + confidence) prioritize human review where it matters and let mission-critical AI scale beyond what clinicians could ever cover.

3.4K views · Feb 22, 2025

How to build world-class AI products — Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust)

World-class AI products spend the majority of engineering time on evals and observability, not prompts — that's how Notion ships fast at consumer scale.

3.4K views · Jun 27, 2025

Your Evals Are Meaningless (And Here's How to Fix Them)

Build evals as a dynamic system with SME-authored datasets and domain-specific LLM-judge criteria — don't rely on framework defaults that drift away from your users' definition of good.

3.2K views · Feb 22, 2025

Agent Evals: Finally, With The Map

A complete agent eval program covers both semantic and behavioral dimensions and treats the LLM-judge layer (EvalOps) as a first-class optimization target.

3.1K views · Feb 22, 2025

Evaluating AI Search: A Practical Framework for Augmented AI Systems — Quotient AI + Tavily

Evaluating production AI search needs live LLM-judge monitoring on real traffic, not just static benchmarks like SimpleQA, because both the web and user intent keep moving.

3.1K views · Jul 29, 2025

Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft

Use Azure AI Toolkit + Evaluation SDK to spot-check models, then scale to dataset evals, treating evaluation as a layered application-level concern.

3.0K views · Jun 27, 2025

Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Aesthetic evals must center human perception — current metrics like FID and CLIP miss what people actually find broken in generative imagery.

2.9K views · Aug 23, 2025

The Build-Operate Divide: Bridging Product Vision and AI Operational Reality

Crossing the V1-to-V2 quality chasm in AI products comes from a fast eval-iteration loop plus disciplined human-in-the-loop review, not better base models.

2.7K views · Jul 02, 2025

Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran

Agent evals must span tool calls, trajectories, and full conversations — and the evals themselves need to evolve alongside the agent.

2.4K views · Jun 10, 2025

Judging LLMs: Alex Volkov

Production LLM apps must log/trace everything from day one, and use a layered eval stack (programmatic + human + LLM-judge) rather than skipping straight to fine-tuning.

2.3K views · Sep 09, 2024

Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

Eval is just testing for non-deterministic systems — capture traces, look at them, then write code + LLM-judge + meta-evals before tuning prompts or swapping models.

2.3K views · May 14, 2026

[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

Eval is a methodology discipline (calibrate metrics with humans, expand to many signals) not a one-shot benchmark — learn from Google Search's 300-metric setup.

1.9K views · Jul 29, 2025

open-rag-eval: RAG Evaluation without "golden" answers — Ofer Mendelevitch, Vectara

Open-rag-eval scores RAG quality without golden datasets using nugget-based generation evaluation and Umbrela retrieval scoring.

1.8K views · Jun 03, 2025

Benchmarks Are Memes: How What We Measure Shapes AI—and Us - Alex Duffy, Every.to

Whoever designs the next benchmark shapes what frontier models become—build benchmarks that are multifaceted, generative, evolutionary, and experiential, not just easy to score.

1.6K views · Jul 15, 2025

Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo

Reliable agents need step-level LLM-as-judge evaluations baked into observability pipelines from day one, not just final-answer scoring.

1.6K views · Jun 27, 2025

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

Treat observability as a 3-phase loop (evaluate → monitor → optimize) built on OpenTelemetry tracing and agent-specific evaluators — non-determinism makes this a continuous practice, not a one-time eval.

1.3K views · May 14, 2026

CI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed

Shipping AI features inside Zed forced the team from fully-deterministic CI to programmatic stochastic evals where assertions interrogate specific agent steps.

809 views · Jun 27, 2025

How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

Pick the eval metric to match the application type (RAG vs code-gen vs agent) and treat evals as a continuous, data-centric, parallelizable engineering practice.

787 views · Jul 22, 2025

The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani

Public benchmarks are rigged by Goodhart's Law and selective comparison — build small domain-specific evals from real production queries instead.

774 views · Jun 03, 2025

E-Values Evaluating the Values of AI: Sheila Gulati and Nischal Nadhamuni

Evals must become multifaceted, dynamic, and values-aware before agentic systems become fully self-sufficient, or we'll lose the ability to course-correct.

727 views · Dec 31, 2024

BotDojo Launch: Enhancing AI Assistants with Evaluations and Synthetic Data

Pairing batch-based LLM evaluations with synthetic data generated from real support tickets is a fast path from POC chatbot to production-ready RAG.

724 views · Feb 05, 2025

Agents reported thousands of bugs, how many were real? - Ian Butler and Nick Gregory

Software-maintenance agents are still poor at bug detection — high false positives, narrow reasoning, and lack of holistic code reading limit real-world reliability today.

666 views · Jun 03, 2025

Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen

llm-eval gives teams a local, Hydra-configurable CLI to test prompt and model changes across providers with stable multi-sample reporting.

462 views · Feb 05, 2025

Will Agent evaluation via MCP Stabilize Agent Networks? - Ari Heljakka

Exposing evaluators via MCP lets any agent get scored, explained feedback inline — turning evals from a one-off harness into a continuous stabilization loop for agent networks.

457 views · Jun 03, 2025

How to evaluate a model for your use case: Emmanuel Turlay

Use LLM-as-judge with task-specific rubrics and visualize score distributions — generic NLP benchmarks won't tell you which model fits your application.

248 views · Feb 05, 2025