Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo

1.6K views · Jun 27, 2025 · 16:14 min · Watch on YouTube ↗

Takeaway

Reliable agents need step-level LLM-as-judge evaluations baked into observability pipelines from day one, not just final-answer scoring.

Summary

Examples of AI failures (Chicago Sun-Times hallucinated book list, Butler Snow's hallucinated case law, Air Canada chatbot binding refunds) motivate observability-driven eval.
Cannot unit-test agents — non-deterministic, multi-step flows require granular metrics at every step (tool call success, RAG retrieval relevance, hallucination scores).
'Set a thief to catch a thief': use LLM-as-judge (preferably a stronger or custom-trained model) to evaluate cheaper production models, sampling e.g. 10k of 1M daily traces.
Galileo ships a small custom-trained eval LM optimized for grading agent traces.
Best time to add evals is during prompt engineering / model selection; second best is now — bake them into CI/CD and production observability.

galileollm-as-judgeobservability

Original description

LLM agents often drift into failure when prompts, retrieval, external data, and policies interact in unpredictable ways. This session introduces a repeatable, metric-driven framework for detecting, diagnosing, and correcting these undesirable behaviors in agentic systems at production scale.

About Jim Bennett
Jim is the worlds most energetic dev rel, and a Principal Developer Advocate at Galileo, focusing on enabling AI developers to be more productive by monitoring and evaluating LLMs and AI agents. He’s British, so sounds way smarter than he actually is, and lives in the Pacific North West of the USA. In the past he’s lived in 4 continents working as a developer in the mobile, desktop, and scientific space. He's spoken at conferences and events all around the globe, organised meetup groups and communities, and written books on mobile development and IoT. He is currently a Microsoft MVP for AI and Developer Tools.

He also hates and is allergic to cats, but has a 12-year-old who loves cats, so he has 2 cats.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter