Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize

13.1K views · Dec 26, 2025 · 86:16 min · Watch on YouTube ↗

Takeaway

AI PMs should treat evals as the new product spec — non-deterministic, data-dependent, and the durable moat for any agentic product.

Summary

Aman Khan (ex-Cruz self-driving, ex-Spotify ML) lays out an evaluation framework specifically aimed at AI product managers building agentic features.
Argues OpenAI's Kevin Weil, Anthropic's Mike Krieger, and Greg Brockman all publicly say evals are the moat — listen when the model vendors say their models hallucinate.
Differences from unit tests: LLMs are nondeterministic and manipulable, agents take multiple valid paths, and agents depend on your data (not just your code).
Walks through building/eval-ing a multi-agent trip planner using LLM-as-judge evaluation.

evalsproduct-managementagents

Original description

GenAI is reshaping the product landscape, creating huge opportunities (along with new expectations) for product managers. Yet while prompt engineering and model tuning get the spotlight, one critical skill can get overlooked: rigorous evaluation.

This talk will help PMs move beyond gut-feel “vibe checks” to adopt concrete, repeatable evaluation strategies for LLM-powered products. I'll break down essential eval methodologies, from human feedback and code-based checks to cutting-edge LLM-based evaluations. Drawing on real-world examples, I'll share a practical framework PMs can use to:

- Confidently evaluate AI-driven features
- Ground decisions in real, repeatable data
- Build trust and delight through consistent quality