← back
Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI
Takeaway
Pick speech-to-speech (Realtime API) when latency and expressivity matter; pick chained when accuracy and determinism dominate.
Summary
- Speech-to-speech voice models have moved from slow/robotic six months ago to fast/expressive/interruptible today — demoed via OpenAI Realtime API.
- Two architectures: chained (STT -> LLM -> TTS) which is lossy and slow, vs. unified speech-to-speech via Realtime API which preserves semantics with low latency.
- Five trade-off axes: latency, cost, accuracy/intelligence, UX, integrations/tooling — consumer apps optimize latency/UX while customer service prioritizes accuracy and SIP/Twilio integrations.
- Voice agents need different design from text: prompts, voice customization, tools and eval/guardrails differ; small triage models can delegate hard tasks to o3 or GPT-4 mini.
- Real-time API may not fit deterministic call-center work where chained still wins.
voicerealtime-apiopenai
Original description
How to build production voice applications and learnings from working with customers along the way! https://x.com/tokisherbakov https://www.linkedin.com/in/akotha7/