← back

Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI

11.2K views · Jul 20, 2025 · 17:17 min · Watch on YouTube ↗
Takeaway

Pick speech-to-speech (Realtime API) when latency and expressivity matter; pick chained when accuracy and determinism dominate.

Summary

  • Speech-to-speech voice models have moved from slow/robotic six months ago to fast/expressive/interruptible today — demoed via OpenAI Realtime API.
  • Two architectures: chained (STT -> LLM -> TTS) which is lossy and slow, vs. unified speech-to-speech via Realtime API which preserves semantics with low latency.
  • Five trade-off axes: latency, cost, accuracy/intelligence, UX, integrations/tooling — consumer apps optimize latency/UX while customer service prioritizes accuracy and SIP/Twilio integrations.
  • Voice agents need different design from text: prompts, voice customization, tools and eval/guardrails differ; small triage models can delegate hard tasks to o3 or GPT-4 mini.
  • Real-time API may not fit deterministic call-center work where chained still wins.
voicerealtime-apiopenai
Original description
How to build production voice applications and learnings from working with customers along the way!

https://x.com/tokisherbakov
https://www.linkedin.com/in/akotha7/