Voice Agents: the good, the bad, and the ugly

4.7K views · Feb 22, 2025 · 18:47 min · Watch on YouTube ↗

Takeaway

Production voice agents need decomposed prompts, tool-based state transitions, and a separate text-LLM supervisor to keep conversations on rails.

Summary

Eddie Siegel (Fractional AI CTO) builds an AI consulting-interview agent on OpenAI Realtime API — replaces human consultants doing Fortune-500 employee qualitative research, scales to hundreds of parallel interviews with auto-transcription.
Started with monolithic prompt then refactored to feed one question at a time plus a tool the LLM calls to advance — needed to support a clickable roadmap UI and skip-around.
Added a 'drift detector' background agent: separate text LLM continuously reads transcript and forces the move-on tool when conversation rabbit-holes, balancing improvisation against staying on topic.
Other practical lessons: streaming audio makes everything harder (transcription, latency, evaluation); fluid conversation has no objective metric so they custom-built conversation-quality evals.

voice-agentsrealtime-apidrift-detection

Original description

AI voice agents seem to be everywhere. But what does it actually take to move from proof of concept to production? This talk walks you through the process of building an AI voice agent that's now conducting hundreds of consulting-style research interviews.

What once required hours of consultant time and weeks of scheduling can now be done in a single day with the voice agent.

But getting there wasn’t exactly smooth sailing. We’ll skip the hype and dive into the real challenges: wrangling hallucinations, designing evaluation metrics that actually matter, determining the right human/AI handoff, and troubleshooting unexpected surprises (like the agent randomly switching from English to Korean).

You’ll leave with a no-nonsense playbook for getting voice agents into production.