← back
Can you prove AI ROI in Software Eng? (Stanford 120k Devs Study) – Yegor Denisov-Blanch, Stanford
Takeaway
AI productivity gains in software are real but median (~10%) and bimodal — top adopters pull away while heavy token-spenders without practices regress.
Summary
- Stanford studied 120k devs across 46 AI-using teams matched against 46 non-AI teams using a panel-of-experts ML model trained on millions of code-commit evaluations across implementation time, maintainability, and complexity.
- Median net productivity gain from AI is ~10% as of July 2025, but variance is widening — top performers compound while bottom performers stall, suggesting a 'rich gets richer' divergence.
- Token spend correlates only loosely with productivity (r≈0.20) and shows a 'death valley' around 10M tokens/engineer/month where heavy users underperform moderate users.
- Without measuring cohort placement, leaders can't course-correct on AI investment.
productivityevalsai-business
Original description
You’re investing millions in AI for software engineering. Can you prove it’s paying off? Benchmarks show models can write code, but in enterprise deployments ROI is hard to measure, easy to bias, and often distorted by activity metrics (PR counts, DORA) that say “more” without proving “better.” Drawing on field data from 120k+ developers across 600+ companies, I’ll show exactly where AI helps the most and how to measure the ROI of your software engineering AI deployment. We’ll unpack why identical tools deliver ~0% lift in some orgs and 25%+ in others. You’ll leave with a step-by-step ROI playbook: what to track, the traps to avoid, and the habits top-quartile teams use to make the most from AI. Speaker: Yegor Denisov-Blanch | Researcher, Stanford https://x.com/yegordb https://www.linkedin.com/in/ydenisov/ Timestamps 00:00 Introduction & Methodology: ML Panels of Experts 00:21 The Research Approach: Time Series & Cross-Sectional Data 01:38 Four Key Topics Overview 02:01 Case Study: 10% Productivity Gain & The Widening Gap 03:16 Factors Driving Performance: Usage vs. Quality 04:02 The Environment Cleanliness Index 05:30 Managing Codebase Entropy & AI Trust 06:17 AI Engineering Practices Benchmark & Fingerprinting 07:38 Case Study: Unequal Adoption Across Business Units 08:31 Challenges in Measuring AI ROI via Business Outcomes 10:28 Proposed Measurement Framework: Usage & Outcomes 11:59 Metric Framework: Primary Output vs. Guardrails 12:54 Case Study: AI Adoption's Negative Impact on Quality 14:04 Rework, Refactoring, and Effective Output Analysis 15:43 Conclusion & Call for Research Participation