The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani

774 views · Jun 03, 2025 · 11:20 min · Watch on YouTube ↗

Takeaway

Public benchmarks are rigged by Goodhart's Law and selective comparison — build small domain-specific evals from real production queries instead.

Summary

Three benchmark-gaming tricks: apples-vs-oranges (xAI Grok 3 best-config vs others' standard config), privileged test access (OpenAI funded FrontierMath then announced 25% score), style-over-substance (Meta entered 27 Llama 4 Maverick variants in LMArena tuned for charm).
39% of SAT essay-score variance is just length — LMArena's style-controlled rankings flip the leaderboard (Claude 3.5 Sonnet jumps to tied-first).
Quotes Karpathy ('evaluation crisis'), SWE-bench creator John Yang ('we kind of just made these up'), CMU's Martin Sat ('yardsticks fundamentally broken').
Recommendation: skip public benchmarks for your domain — gather 5 real production queries, pick quality/cost/latency metrics, test top 5 models on YOUR data, iterate continuously.

benchmarksevalsgoodharts-law

Original description

AI benchmarks control billions in investment and shape entire markets - but the game is rigged. In this talk, I'll expose the three "cheat codes" companies use to game benchmarks:

* Cherry-picking comparisons (xAI's selective Grok-3 graphs)
* Buying privileged access (OpenAI's FrontierMath funding)
* Optimizing for style over substance (Meta's 27 Llama-4 variants on LM Arena)

When Andrej Karpathy says "I don't really know what metrics to look at right now," we have a crisis. I'll show you why Goodhart's Law guarantees benchmarks fail when billions are at stake, and more importantly, what to do about it.

You'll learn:
How to spot benchmark manipulation (with real examples)
Why 39% of score variance is just writing style
A 5-step framework to build evaluations that actually matter for YOUR use case
How pre-deployment evaluation loops separate reliable AI from constant firefighting

Drawing from my experience building evaluation systems at Waymo, Uber ATG, and SpaceX (where bad evals literally crash), I'll show you how to stop playing the rigged benchmarks game and start measuring what actually matters.