← back

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

Original: Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

6.3K views · Apr 10, 2026 · 40:51 min · Watch on YouTube ↗
Takeaway

LLM judges only work when calibrated against human annotations per specific error type using prompt optimization like GEPA.

Summary

  • Generic LLM-as-judge prompts (e.g., 'rate hallucination') are uncalibrated with human annotation, producing useless signal in observability.
  • Agenta CEO advocates calibrating judges via GEPA prompt optimization against subject-matter-expert annotations to speed the eval-iteration loop.
  • Uses Sierra's τ-Bench airline agent (599 traces, 62/38% compliant/non-compliant split) as the practical case study.
  • Four-step workflow: design metrics from real error analysis clusters (policy, response style, info delivery, tool use), annotate data, optimize judge via GEPA, validate.
  • Build one judge per error type with binary pass/fail + reasoning, not 1–5 scales or generic 'success' judges — narrow scope is easier to calibrate.
evalsllm-as-judgegepa
Original description
Miscalibrated evals are worse than no evals. They give false confidence while being, at best, useless. This workshop walks you through building a calibrated LLM-as-a-judge, from capturing ground truth to optimizing with GEPA and assessing the judge. You will leave with an LLM-as-a-judge you can trust to actually improve your app.

Mahmoud Mabrouk - Co-founder and CEO, Agenta AI

Mahmoud Mabrouk is the cofounder and CEO of Agenta, an open-source LLMOps platform for building and evaluating LLM applications. He has spent the past 15 years working in machine learning and holds a PhD in applied machine learning for computational biology.

Resources:
- Workshop repo: https://github.com/Agenta-AI/judge-the-judge-talk-2026
- GEPA repository: https://github.com/gepa-ai/gepa
- GEPA paper: https://arxiv.org/abs/2507.19457
- Hamel’s guide for error analysis: https://hamel.dev/blog/posts/field-guide/

Socials:
https://x.com/mmabrouk_
https://www.linkedin.com/in/mmabrouk2/
https://agenta.ai
https://github.com/agenta-ai/agenta