← back
Why Agent Hype can fall short of reality – Joel Becker, METR
Takeaway
Trust the exponential time-horizon trend for raw capability, but expect a gap between benchmark hype and real-world developer productivity gains.
Summary
- METR's 'time horizon' benchmark fits a curve to human time-to-complete vs. model success; defines the duration at which models hit 50% success — Claude 3 Opus ~4 min, o1-preview ~15 min, GPT-5.1 Codex Max far higher.
- The exponential 'doubling every 6-7 months' line has held remarkably straight across models.
- Companion RCT 'measuring how allowing AI affects developer productivity' shows the economic-style evidence diverges — real-world developer speedups are smaller than benchmark capability gains suggest.
- Benchmarks limitations: human baselines are low-context experts, tasks aren't messy, and even time-horizon may eventually saturate.
- Talk reconciles the gap between rapid benchmark progress and slower observed economic productivity gains.
evalsbenchmarksmetr
Original description
AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D. Speaker: Joel Becker | Researcher, METR https://x.com/joel_bkr https://www.linkedin.com/in/joel-becker/ https://github.com/joel-becker **Timestamps:** 00:00 Introduction to METR & The Capability Gap 01:49 The Problem with Current Benchmarks (Saturation & Interpretation) 03:19 METR’s New Methodology: Human Time Horizons 04:52 Empirical Results: Fitting Capability Curves 06:19 Time Horizon Trends: Claude 3 Opus vs. o1-preview 17:43 Randomized Controlled Trial (RCT) Discussion 18:18 Reconciling the Gap: Why High Benchmarks Don't Mean High Productivity 19:18 Explaining the Discrepancy: Context, Reliability, and Task Interdependence 20:22 Future Work & Hiring at METR