← back

Coding Evals: From Code Snippets to Codebases – Naman Jain, Cursor

4.1K views · Dec 15, 2025 · 18:08 min · Watch on YouTube ↗
Takeaway

Coding evals are moving from clean snippets to dynamic, real-world, multi-hour codebase tasks with auto-generated tests to combat contamination and brittleness.

Summary

  • Cursor researcher traces coding evals from single-line pandas completions to whole-codebase tasks taking hours.
  • LiveCodeBench pioneered dynamic evals — periodically updated problem sets to combat contamination; visible model accuracy drops after DeepSeek/etc. training-cutoff dates.
  • Automatic LLM-driven test generators produce 30–50 fuzzed inputs per problem to catch brittle solutions (e.g., set returns without sorting).
  • New software-optimization benchmark mines performance-related commits from real repos like llama.cpp, asking models to optimize quantization code to match human patches.
  • Construct validity emphasized — tasks must come from real-world distribution and be reliably graded for benchmarks to translate to real performance gains.
evalscode-generationbenchmarks
Original description
AI coding capabilities have leapt from generating one-line snippets to competing entire codebases with agentic workflows. I’ll trace that arc focusing on learnings and challenges through each stage. I will start with early testable coding benchmarks distilling lessons about contamination and distributional overfitting. Next, moving beyond isolated programming problems, I will talk about repository grounded coding problems from SWE-bench style bug fixing, and R2E’s automated function completion setting. We’ll then move beyond isolated functions to longer-horizon tasks—runtime optimization (GSO), translation (Syzygy), and refactoring—highlighting challenges like test hacking, code quality, and idiomaticity. Finally, beyond code generation, I will talk about human preference evaluation in chatting (LMArena RepoChat) and developer-preference signals in-IDE via Copilot Arena.

Speaker:  Naman Jain  |  Engineering, Cursor
https://www.linkedin.com/in/naman1205jain/
https://x.com/StringChaos