7 Habits of Highly Effective Generative AI Evaluations - Justin Muller

5.9K views · Jun 03, 2025 · 25:38 min · Watch on YouTube ↗

Takeaway

Evals exist to discover problems, not to compute a vanity score — building one is what turns a stuck GenAI prototype into a scalable production workload.

Summary

AWS principal Muller argues evaluations are the #1 missing piece blocking GenAI scale; absence of evals is also his sharpest filter for science-project vs. real production project.
Case study: document-processing customer with 6-8 engineers and ~6mo work was stuck at 22% accuracy; after building an eval framework they hit 92% in six months and became AWS's largest doc-processing workload in North America.
Reframes eval purpose: primary goal is discovering *where* problems are (with LLM reasoning suggesting fixes), measuring quality is a distant third — this changes how you design the framework.
Free-text outputs aren't unmeasurable — humans have graded essays for centuries; good evaluators (like good professors) point out reasons, not just scores; also evaluate the reasoning/method, not just the final output.

evalsawsaccuracy

Original description

Evaluations are the single most reliable indicator of the health and long term viability of any gen AI project.  As a Principal Applied AI Architect for AWS, I've had the opportunity to look at over 100 different attempts at evaluation frameworks over the last few years. 
In this talk I share some stories about the best and worst, and then distill the 7 most common elements I've seen in successful evaluations.  

Slides at https://d2ot4ns4zf41bm.cloudfront.net/slides/7+Habits+AI+World's+Fair.pptx