🎙️ Voice
Real-time voice AI. ASR (Whisper), TTS, turn detection, latency, voice agents for phones, support, accessibility.
The workflow
flowchart LR
A[Mic audio] --> B[VAD<br/>turn detection]
B --> C[ASR<br/>Whisper / streaming]
C --> D[LLM reasoning]
D --> E[TTS<br/>low-latency stream]
E --> F[Speaker out]
F --> G[Barge-in<br/>interrupt handling]
G --> A
Sub-300ms latency end-to-end is the bar. Below that, conversations feel human.
Key takeaways
Videos (26)
Building voice agents with OpenAI — Dominik Kundel, OpenAI
Use the new Agents SDK to combine fast speech-to-speech frontends with reasoning-model backends for production voice agents.
Full Workshop: Realtime Voice AI — Mark Backman, Daily
Pipecat orchestrates pluggable voice agent pipelines and now wraps native speech-to-speech models like Gemini Live for low-latency conversational apps.
Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI
Pick speech-to-speech (Realtime API) when latency and expressivity matter; pick chained when accuracy and determinism dominate.
Voice Agent Engineering — Nik Caryotakis, SuperDial
Voice AI in 2025 wins on vertical integrations and conversational design, not speech-to-speech realism — and the voice AI engineer is a distinct multimodal role.
Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral
Modern TTS converges on LLM-style autoregressive decoders over neural-codec audio tokens, optimized for streaming agent latency and voice cloning.
Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber
Self-hosted Orpheus + LoRA voice clones on L40S can deliver real-time consumer-grade voice AI at $1/hr if you fine-tune away head-of-line silence.
Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily
Open-source Pipecat plus Daily's WebRTC infrastructure can hit the ~800ms voice-to-voice latency that real conversational AI requires.
Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind
Gemini's free-tier API plus AI Studio is the fastest path to a multilingual conversational agent prototype without a credit card.
Voice Agents: the good, the bad, and the ugly
Production voice agents need decomposed prompts, tool-based state transitions, and a separate text-LLM supervisor to keep conversations on rails.
How to build the world's fastest voice bot: Kwindla Hultman Kramer
For voice bots, aim for sub-500ms latency end-to-end; the orchestration map (media → STT → LLM → TTS plus tool calls) is the work, not the model itself.
[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs
ElevenLabs' conversational stack favors text-mediated pipelines for transparency and supports 99 languages via colocated ASR/LLM/TTS components.
Shipping an Enterprise Voice AI Agent in 100 Days - Peter Bar, Intercom Fin
Voice agents ship fast by reusing chat-side RAG and starting with the out-of-hours voicemail-replacement wedge, but conversation design must be re-thought for sub-second latency.
Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit
To stop voice AI from interrupting, augment VAD with semantic, syntactic and prosodic models that predict (not just detect) end-of-turn the way humans do.
Give Your Chat Agent a Voice — Luke Harries, Head of Growth, ElevenLabs
Voice Engine lets you slap voice onto any existing chat agent in one prompt instead of replacing the whole stack.
Building and Scaling an AI Agent Swarm of low latency real time voice bots: Damien Murphy
Modern voice agents should use unified audio-in/audio-out models on call-center-tuned acoustics for low-latency real-time deployments.
Voice AI: when is the "Her" moment? — Neil Zeghidour, CEO, Gradium AI
Real-time conversation needs full-duplex speech-to-speech models plus filler-based tool-call hiding — current half-duplex stacks aren't there yet.
Milliseconds to Magic: Real-Time Workflows using the Gemini Live API and Pipecat
Voice AI's hard problems migrate down the stack over time, so build orchestration code now while leaving room for API/model improvements to absorb tomorrow's work.
Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)
For sub-second voice agents, use WebRTC for edge audio and reserve websockets for server-to-server and small structured data.
Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus
Production real-time voice/video AI needs a dedicated orchestration layer (Pipecat) on top of voice or video-generation models like Tavus replicas.
Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram
The leap to human-feeling voice agents comes not from speed or accuracy but from passing rich context (tone, history, modality) through every stage of the STT-LLM-TTS pipeline.
Building an AI assistant that makes phone calls [Convex Workshop]
Reactive databases like Convex make it straightforward to glue STT, LLM and TTS streams into real-time voice agents that hold actual phone conversations.
From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval
Borrow self-driving's large-scale probabilistic simulation playbook to escape voice-agent POC purgatory and ship reliable conversational systems.
Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)
State-space models give voice AI the millisecond latency it needs without giving up quality — transformers' quadratic scaling makes them unsuitable for sub-second voice agents.
Contact Center Voice AI: Low-Latency Intelligence Extraction from Messy Audio Streams — Dippu Singh
Real contact-center voice AI ROI comes from automating the 1:1 after-call work via a low-latency STT + structured-output LLM + CRM-schema-mapper pipeline with a human verification step.
The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss
Ambient conversational overlays solve a different latency/UX problem than voice agents — timing and attention budget matter more than raw speed.
The End of Awkward AI Transcriptions - Travis Bartley and Myungjong Kim
Nvidia Riva trades a single 'one model' approach for a Fast Conformer-based toolkit (Parakeet streaming + Canary accuracy) plus Sortformer diarization, ruling the Open ASR leaderboard.