Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize

Original: Ensure AI Agents Work: Evaluation Frameworks for Scaling Success — Aparna Dhinkaran, CEO Arize

32.5K views · Apr 23, 2025 · 15:27 min · Watch on YouTube ↗

Takeaway

Evaluate agents at three layers — router decisions, individual skill correctness, and the convergence of the overall path — not just final-answer quality.

Summary

Arize CEO decomposes agents into router, skills, and memory — each requiring distinct evaluations.
Router evals check whether the right skill was called with the right parameters; skill evals (e.g., RAG relevance, answer correctness) use LLM-as-judge or code-based scoring.
Path/convergence evaluation is where teams struggle most: did the agent take a consistent ~5-step trajectory for similar inputs, or wildly diverge?
Voice agents (Priceline Penny, 1B+ call-center calls) need additional eval dimensions on top of text-agent evals.

agentsevalsobservability

Original description

Turning AI agents into reliable, production-ready tools that deliver tangible business results requires more than just great models. It demands robust evaluation frameworks that ensure agents perform at scale, align with organizational objectives, and continuously improve in dynamic environments.

This session provides an executive-level perspective on evaluating AI agents at scale. We’ll explore practical strategies for designing evaluation processes that drive measurable impact, identifying and mitigating performance bottlenecks, and implementing observability practices to maintain reliability over time. Through insights from real-world deployments, we’ll highlight common pitfalls, share best practices for iterative improvement, and demonstrate how effective evaluation frameworks can transform experimental agents into enterprise-grade solutions.

Whether you're shaping your organization’s GenAI strategy or looking to unlock the full potential of AI agents, this talk offers actionable insights to ensure your agents work—and scale—successfully.

Recorded live at the Leadership Track Session Day from the AI Engineer Summit 2025 in New York. Learn more at https://ai.engineer and purchase tickets to our next event, the AI Engineer World's Fair, in SF June 3 - 5 here: https://ti.to/software-3/ai-engineer-worlds-fair-2025

About Aparna

Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a pioneer, and early leader in machine learning (ML) observability. A frequent speaker at top conferences and thought leader in the space, Dhinakaran was recently named to the Forbes 30 Under 30. Before Arize, Dhinakaran was an ML engineer and leader at Uber, Apple, and TubeMogul (acquired by Adobe). During her time at Uber, she built several core ML Infrastructure platforms, including Michelangelo. She has a bachelor’s from Berkeley's Electrical Engineering and Computer Science program, where she published research with Berkeley's AI Research group. She is on a leave of absence from the Computer Vision Ph.D. program at Cornell University.