How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

10.6K views · Jan 19, 2026 · 75:51 min · Watch on YouTube ↗

Takeaway

METR's measurement shows AI tools can slow experienced devs even when they feel faster, and capability growth is tightly coupled to compute scaling.

Summary

METR's log-linear time-horizon curve doubles every ~7 months; tied causally and proportionally to compute growth under standard economic assumptions.
Slowing compute (power/dollar constraints kicking in post-2030) would proportionally slow capability growth; software-only singularity is the main upside risk.
RCT of experienced open-source devs found AI tools made them slower while devs perceived they were faster — perceived productivity doesn't match measured productivity.
Meta's internal data shows a J-curve: ~3–6 month productivity dip when devs first adopt agents before recovery.
Surveys can't ask 'how long did task X take' — humans are systematically wrong; perception still matters for hype and adoption.
Future work explores reliability at shorter time horizons rather than ever-longer ones.

metrdeveloper-productivityevals

Original description

AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D.

https://x.com/joel_bkr

Timestamps
00:00 The Compute-Time Horizon Argument

01:43 Potential Constraints on AI Scaling (Power & Dollars)

04:23 The Problem of Eclipsing Evaluation Time

06:52 Meta's "J-Curve" of Developer Productivity

09:12 Unreliability of Self-Reported Time Estimates

11:43 Personal Experiences with AI Tools (Cursor) & Learning Curves

14:10 METR Study Deep Dive: Scatter Plots & Variance

16:48 The Controversy of "Conservative" Usage Estimates

21:41 Unpublished Hackathon Results (AI Allowed vs. Disallowed)

25:28 Why AI Struggles with Data Science & Messy Enterprise Data

30:35 Example of AI Failure on Complex Deployment Metrics

38:29 Quantifying Speed-Up: The Methodological Challenges

46:30 Future Metrics: "Watched" vs. "Unwatched" Time Horizons

52:52 Moving Beyond Benchmarks: "In the Wild" Transcripts

56:12 The "Agent Village" & Fuzzy Goal Measurement

58:53 The "Neurodivergent AI" Hypothesis & Interface Mismatch

01:06:31 Software-Only Singularity vs. Hardware Constraints

01:13:53 AI Applications in Chip Fabrication & Yield Improvement