← back
Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
Original: Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI
Takeaway
LLM judges only work when calibrated against human annotations per specific error type using prompt optimization like GEPA.
Summary
- Generic LLM-as-judge prompts (e.g., 'rate hallucination') are uncalibrated with human annotation, producing useless signal in observability.
- Agenta CEO advocates calibrating judges via GEPA prompt optimization against subject-matter-expert annotations to speed the eval-iteration loop.
- Uses Sierra's τ-Bench airline agent (599 traces, 62/38% compliant/non-compliant split) as the practical case study.
- Four-step workflow: design metrics from real error analysis clusters (policy, response style, info delivery, tool use), annotate data, optimize judge via GEPA, validate.
- Build one judge per error type with binary pass/fail + reasoning, not 1–5 scales or generic 'success' judges — narrow scope is easier to calibrate.
evalsllm-as-judgegepa
Original description
Miscalibrated evals are worse than no evals. They give false confidence while being, at best, useless. This workshop walks you through building a calibrated LLM-as-a-judge, from capturing ground truth to optimizing with GEPA and assessing the judge. You will leave with an LLM-as-a-judge you can trust to actually improve your app. Mahmoud Mabrouk - Co-founder and CEO, Agenta AI Mahmoud Mabrouk is the cofounder and CEO of Agenta, an open-source LLMOps platform for building and evaluating LLM applications. He has spent the past 15 years working in machine learning and holds a PhD in applied machine learning for computational biology. Resources: - Workshop repo: https://github.com/Agenta-AI/judge-the-judge-talk-2026 - GEPA repository: https://github.com/gepa-ai/gepa - GEPA paper: https://arxiv.org/abs/2507.19457 - Hamel’s guide for error analysis: https://hamel.dev/blog/posts/field-guide/ Socials: https://x.com/mmabrouk_ https://www.linkedin.com/in/mmabrouk2/ https://agenta.ai https://github.com/agenta-ai/agenta