Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran

2.4K views · Jun 10, 2025 · 14:25 min · Watch on YouTube ↗

Takeaway

Agent evals must span tool calls, trajectories, and full conversations — and the evals themselves need to evolve alongside the agent.

Summary

Arize founder walks through layered agent evals: tool-call (did it call the right tool and pass right arguments), trajectory (did it call tools in the right order), multi-turn conversation, and self-improving evals.
Demo shows their own copilot's traces: high-level path view shows search Q&A correctness flagging consistently incorrect arguments inside a correct tool — useful for pinpointing argument-level bottlenecks.
Trajectory evals matter because divergent tool orderings burn tokens and often produce wrong outputs even when each call is individually correct.
Closing the loop with evals that themselves get refined — both agent and eval co-improve — is the key to a 'self-improving stack'.

agent-evalsobservabilityarize

Original description

Building and shipping an AI agent is just the beginning. In real-world systems, the real work starts after deployment — when agents drift, fail silently, or underperform in edge cases no one anticipated.

This talk is about building the full monitoring and improvement stack that keeps agents reliable, efficient, and improving over time. We’ll walk through how to connect evals, tracing, observability, experimentation, and optimization into a virtuous cycle — one where agents not only perform, but learn and adapt in production.

Drawing on real-world deployments, I’ll cover:

- Composing evaluation layers that surface meaningful failure modes
-Tracing and instrumentation for deep visibility into agent behavior
-Running experiments that actually improve outcomes
-Closing the loop with feedback-driven optimization
- People know to improve the agents application, but do they also know they need to improve their evals in tandem?

If you’re scaling agents beyond the prototype phase, this is the talk that helps you move from working once to working continuously.