← back

7 Habits of Highly Effective Generative AI Evaluations - Justin Muller

5.9K views · Jun 03, 2025 · 25:38 min · Watch on YouTube ↗
Takeaway

Evals exist to discover problems, not to compute a vanity score — building one is what turns a stuck GenAI prototype into a scalable production workload.

Summary

  • AWS principal Muller argues evaluations are the #1 missing piece blocking GenAI scale; absence of evals is also his sharpest filter for science-project vs. real production project.
  • Case study: document-processing customer with 6-8 engineers and ~6mo work was stuck at 22% accuracy; after building an eval framework they hit 92% in six months and became AWS's largest doc-processing workload in North America.
  • Reframes eval purpose: primary goal is discovering *where* problems are (with LLM reasoning suggesting fixes), measuring quality is a distant third — this changes how you design the framework.
  • Free-text outputs aren't unmeasurable — humans have graded essays for centuries; good evaluators (like good professors) point out reasons, not just scores; also evaluate the reasoning/method, not just the final output.
evalsawsaccuracy
Original description
Evaluations are the single most reliable indicator of the health and long term viability of any gen AI project.  As a Principal Applied AI Architect for AWS, I've had the opportunity to look at over 100 different attempts at evaluation frameworks over the last few years. 
In this talk I share some stories about the best and worst, and then distill the 7 most common elements I've seen in successful evaluations.  

Slides at https://d2ot4ns4zf41bm.cloudfront.net/slides/7+Habits+AI+World's+Fair.pptx