← back

AI Agents, Meet Test Driven Development

13.4K views · Feb 22, 2025 · 29:10 min · Watch on YouTube ↗
Takeaway

Treat AI products like TDD systems with continuous eval datasets and LLM-as-judge metrics, not one-shot prompt engineering.

Summary

  • Anita (Vellum) argues teams shipping reliable agentic workflows follow a TDD-style loop: experiment, evaluate, deploy, then continuously monitor and improve.
  • Notes that scaling-law-style benchmark gains slowed but reasoning models like DeepSeek-R1 (trained without labeled data via pure RL, reportedly similar to OpenAI's o1/o3) reopened progress; Humanity's Last Exam benchmark shows even frontier models still struggle.
  • Experimentation stage: try few-shot, chain-of-thought, prompt chaining, ReAct; involve domain experts (not just engineers) and stay model-agnostic (e.g., Gemini 2.0 Flash for OCR).
  • Evaluation stage: build datasets of hundreds of examples, balance quality/cost/latency/privacy tradeoffs, use ground-truth data when possible and LLM-as-judge otherwise; require a flexible Python/TypeScript-customizable framework with per-node guardrails.
  • Production stage: never stop iterating — capture responses and feed them back into evals.
evalsagentstdd
Original description
Deploying agentic workflows in production is tough—bugs, hallucinations, and unexpected behavior can quickly turn a promising system into a support nightmare. But there’s a pattern we’ve seen across hundreds of companies: teams that embrace test-driven development (TDD) build stronger, more reliable AI systems.

In this talk, Anita from Vellum will break down how TDD can be applied to AI agents, sharing real-world strategies for testing and improving reliability. She’ll also explore different types of agentic behavior, what’s possible to build today, and where the innovation is heading. To bring it all together, Anita will demo her own SEO agent—an agentic workflow that automates a big chunk of her content-writing process.

If you're building AI-powered workflows and want them to actually work, this session is for you!

Related links:

DeepSeek-R1 training process: https://www.vellum.ai/blog/the-training-of-deepseek-r1-and-ways-to-use-it
Agentic Workflows: Emerging architectures: https://www.vellum.ai/blog/agentic-workflows-emerging-architectures-and-design-patterns
Four pillars of building AI systems in production: https://www.vellum.ai/blog/the-four-pillars-of-building-a-production-grade-ai-application
Everything you need to know on Chain of Thought prompting: https://www.vellum.ai/blog/chain-of-thought-prompting-cot-everything-you-need-to-know
Reasoning models are indecisive parrots: https://www.vellum.ai/reasoning-models