Agents reported thousands of bugs, how many were real? - Ian Butler and Nick Gregory

666 views · Jun 03, 2025 · 18:38 min · Watch on YouTube ↗

Takeaway

Software-maintenance agents are still poor at bug detection — high false positives, narrow reasoning, and lack of holistic code reading limit real-world reliability today.

Summary

Bismuth's SM-100 benchmark covers 100 hand-curated, validated bugs across 84 public repos in Python, TypeScript, JavaScript, and Go.
Focuses on objective bugs (security issues, logic errors causing data loss/crashes) rather than style or design opinions — reduces ambiguity and enables reproducible evaluation.
Each bug annotated with severity, required domain knowledge, find difficulty, and impact (data loss/crash/exploit) to characterize agent capability levels.
Four metrics per system: needle-in-haystack discovery, false positive rate, find-at-introduction (given the offending commit), and remediation correctness.
Findings: most agents miss bugs humans catch immediately and surface bugs humans would discard; reasoning models help but are still narrow; agents don't holistically read files.

bug-detectionbenchmarkbismuth

Original description

Ever had an AI-generated tweak unexpectedly break your entire project? Agentic software development has impressive promise, but the reality still falls short. In this talk we introduce SM-100, a groundbreaking benchmark designed specifically to evaluate autonomous agents on software maintenance tasks.

We're also excited to announce Bismuth, a generalist software agent with strong performance on such maintenance tasks.

https://bismuth.sh & https://sm100bench.com