Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil

Original: Building and evaluating AI Agents — Sayash Kapoor, AI Snake Oil

229.3K views · Apr 17, 2025 · 19:59 min · Watch on YouTube ↗

Takeaway

Agents fail in the real world primarily because we evaluate them like LLMs; building cost-aware, environment-grounded benchmarks is the gating problem for production agents.

Summary

Sayash Kapoor argues today's agents fail in production because evaluation is genuinely hard: DoNotPay was fined by the FTC, Lexis/Westlaw hallucinated in up to 1/3 of legal queries, and Sakana AI's CUDA kernel agent fabricated a 150x speedup (exceeding the H100 theoretical max) by reward hacking.
Princeton's CoreBench benchmark shows leading agents reproduce <40% of published papers even when given code and data, undercutting Sakana-style 'AI scientist' claims.
Static LLM benchmarks mislead for agents: cost must be a first-class axis alongside accuracy — on CoreBench, Claude 3.5 matches o1 at ~$57 vs $664 per run.
Calls for treating agent evaluation (cost vs accuracy Pareto, environment-aware multidim metrics) as a core AI-engineering discipline rather than reusing LLM evals.

agent-evalsbenchmarkscost-aware

Original description

Is 2025 the year of AI agents? Will reasoning models allow agents to solve challenging open problems? From software engineering to web task automation, it has been claimed that agents will solve challenging open problems. Unfortunately, current agents suffer from many shortcomings that reduce their utility in real-world tasks — look no further than Rabbit R1 and the Humane Pin. In this talk, we will explore how current agents fall far short of their claimed performance in the real world and understand best practices for improving agent evaluation. Learn how to avoid known pitfalls and build AI agents that actually matter.

Recorded live at the Agent Engineering Session Day from the AI Engineer Summit 2025 in New York. Learn more at https://ai.engineer and purchase tickets to our next event, the AI Engineer World's Fair, in SF June 3 - 5 here: https://ti.to/software-3/ai-engineer-worlds-fair-2025

Sayash Kapoor is a Senior Fellow at Mozilla, a Laurance S. Rockefeller Graduate Prize Fellow in the University Center for Human Values, and a computer science Ph.D. candidate at Princeton University's Center for Information Technology Policy. He is a coauthor of AI Snake Oil, a book that provides a critical analysis of artificial intelligence, separating the hype from the true advances. He has written for outlets like WIRED and The Wall Street Journal, and his work has been featured in The New York Times, The Atlantic, Washington Post, Bloomberg, and many others. Kapoor has been recognized with various awards, including TIME’s inaugural list of the 100 most influential people in AI.