Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai

3.9K views · Jul 10, 2025 · 18:12 min · Watch on YouTube ↗

Takeaway

Autonomous incident debugging requires fusing causal ML, semantics and custom agent control flow — neither AIOps, plain LLMs nor ReAct agents alone can do it.

Summary

Traversal CEO argues as AI writes more code, humans lose context, systems get more complex, and on-call/troubleshooting will dominate engineering time.
Critiques the status quo of 'dashboard dumpster diving' across Datadog/Splunk/Grafana and triages: each existing approach (classical AIOps, LLM-on-logs, ReAct agents with runbooks) fails in production — too noisy, too small a context, or too slow.
Production data is too large to fit any context window or even cluster memory; runbooks are deprecated the day they're written.
Traversal combines causal ML (cause vs correlation), semantics, and a novel agentic control flow to do out-of-sample autonomous root-cause analysis in the 2-5 minute window incidents demand.

aiopsagentsroot-cause-analysis

Original description

Software is eating the world. AI is eating software. AI-powered SWE means a whole lot more software is going to be written that powers mission critical systems in the coming years, with hardly any of it written by humans. Hence, when these software systems inevitably break, it’s going to be next to impossible to troubleshoot them. Towards addressing this issue, we’ll do a product launch of Traversal’s AI, a significant step towards self-healing software systems. We will showcase how it is already used to autonomously troubleshoot production incidents in some of the most complex enterprise environments.

About Anish Agarwal
Anish Agrawal is the CEO and Co-founder of Traversal, where he and his team are revolutionizing observability and troubleshooting with AI Agents. A Professor of Computer Science and Operations Research at Columbia University, Anish earned his PhD in Computer Science from MIT, specializing in causal machine learning—teaching AI to understand cause and effect from data. Despite achieving his goal of becoming a professor, Anish pivoted from academia, recognizing a once-in-a-lifetime opportunity to apply his AI research to tackle the industry’s toughest challenges, with autonomous troubleshooting at the forefront. His career also includes roles as a management consultant at BCG and research scientist at Amazon and Microsoft Research.

About Matthew Schoenbauer
Matt Schoenbauer is a founding engineer at Traversal, where he and his team are redefining observability and troubleshooting with AI agents. Previously, he was a systematic trader at Citadel Securities, operating at the core of the world’s largest equities market-making platform, where live troubleshooting in the Linux terminal was a critical part of his work. Before that, he worked in quantitative research at Proof Trading. Matt has published research across cryptography, number theory, and algebraic topology, and holds a master’s degree from Columbia University, where he focused on machine learning systems and causal machine learning.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps:

00:00 Introduction: The Three Pillars of Software Engineering
02:10 The Worsening Problem of Troubleshooting
04:15 Why Current AI/ML Solutions are Failing
07:08 Traversal.ai's Novel Approach to Autonomous Troubleshooting
11:35 Case Study: How Traversal.ai helped Digital Ocean
16:03 The Broader Vision for Traversal.ai