Your Evals Are Meaningless (And Here's How to Fix Them)

Original: Your Evals Are Meaningless (And Here’s How to Fix Them)

3.2K views · Feb 22, 2025 · 18:50 min · Watch on YouTube ↗

Takeaway

Build evals as a dynamic system with SME-authored datasets and domain-specific LLM-judge criteria — don't rely on framework defaults that drift away from your users' definition of good.

Summary

HoneyHive co-founder describes recurring eval failure patterns across hundreds of teams: cookie-cutter test cases written by devs, not domain experts, and over-reliance on framework defaults.
Three components every eval needs: an 'agent' (system under test), a dataset (inputs + ideal outputs covering edge cases, written by SMEs), and evaluators (human, code-based, or LLM-as-judge).
LLM-as-judge ~10x cheaper ($3–$120 vs hundreds for 1000 ratings via Mechanical Turk), 8–10x faster, ~80% consistency with humans (matching inter-human agreement).
Two hidden killers: criteria drift (Ragas/PromptFlow defaults optimize for generalizability not your e-commerce relevance, missing real user complaints) and evaluator instability when underlying judge model versions change. Berkeley's EvalGen paper covers this.

evalsllm-as-judgecriteria-drift

Original description

After working with hundreds of AI teams, I discovered a concerning pattern: most real-world evals are practically meaningless. Drawing from my experience at HoneyHive, I'll reveal why popular evaluation methods are failing us, why traditional testing methods don't work for agents, and most importantly, how to fix it. Finally, I'll share practical strategies I've seen work across startups and Fortune 100 companies to build evaluation systems that actually map to real-world performance.