← back
7 Habits of Highly Effective Generative AI Evaluations - Justin Muller
Takeaway
Evals exist to discover problems, not to compute a vanity score — building one is what turns a stuck GenAI prototype into a scalable production workload.
Summary
- AWS principal Muller argues evaluations are the #1 missing piece blocking GenAI scale; absence of evals is also his sharpest filter for science-project vs. real production project.
- Case study: document-processing customer with 6-8 engineers and ~6mo work was stuck at 22% accuracy; after building an eval framework they hit 92% in six months and became AWS's largest doc-processing workload in North America.
- Reframes eval purpose: primary goal is discovering *where* problems are (with LLM reasoning suggesting fixes), measuring quality is a distant third — this changes how you design the framework.
- Free-text outputs aren't unmeasurable — humans have graded essays for centuries; good evaluators (like good professors) point out reasons, not just scores; also evaluate the reasoning/method, not just the final output.
evalsawsaccuracy
Original description
Evaluations are the single most reliable indicator of the health and long term viability of any gen AI project. As a Principal Applied AI Architect for AWS, I've had the opportunity to look at over 100 different attempts at evaluation frameworks over the last few years. In this talk I share some stories about the best and worst, and then distill the 7 most common elements I've seen in successful evaluations. Slides at https://d2ot4ns4zf41bm.cloudfront.net/slides/7+Habits+AI+World's+Fair.pptx