🎙️ Voice

Real-time voice AI. ASR (Whisper), TTS, turn detection, latency, voice agents for phones, support, accessibility.

26 videos · voicepipecatagentsvoice-aiopenaigemini-live

The workflow

flowchart LR
    A[Mic audio] --> B[VAD<br/>turn detection]
    B --> C[ASR<br/>Whisper / streaming]
    C --> D[LLM reasoning]
    D --> E[TTS<br/>low-latency stream]
    E --> F[Speaker out]
    F --> G[Barge-in<br/>interrupt handling]
    G --> A

Sub-300ms latency end-to-end is the bar. Below that, conversations feel human.

Key takeaways

Use the new Agents SDK to combine fast speech-to-speech frontends with reasoning-model backends for production voice agents.

Pipecat orchestrates pluggable voice agent pipelines and now wraps native speech-to-speech models like Gemini Live for low-latency conversational apps.

Pick speech-to-speech (Realtime API) when latency and expressivity matter; pick chained when accuracy and determinism dominate.

Voice AI in 2025 wins on vertical integrations and conversational design, not speech-to-speech realism — and the voice AI engineer is a distinct multimodal role.

Modern TTS converges on LLM-style autoregressive decoders over neural-codec audio tokens, optimized for streaming agent latency and voice cloning.

Self-hosted Orpheus + LoRA voice clones on L40S can deliver real-time consumer-grade voice AI at $1/hr if you fine-tune away head-of-line silence.

Videos (26)

Building voice agents with OpenAI — Dominik Kundel, OpenAI

Use the new Agents SDK to combine fast speech-to-speech frontends with reasoning-model backends for production voice agents.

27.1K views · Jun 29, 2025

Full Workshop: Realtime Voice AI — Mark Backman, Daily

Pipecat orchestrates pluggable voice agent pipelines and now wraps native speech-to-speech models like Gemini Live for low-latency conversational apps.

14.2K views · Aug 03, 2025

Building Effective Voice Agents — Toki Sherbakov + Anoop Kotha, OpenAI

Pick speech-to-speech (Realtime API) when latency and expressivity matter; pick chained when accuracy and determinism dominate.

11.2K views · Jul 20, 2025

Voice Agent Engineering — Nik Caryotakis, SuperDial

Voice AI in 2025 wins on vertical integrations and conversational design, not speech-to-speech realism — and the voice AI engineer is a distinct multimodal role.

9.8K views · Apr 18, 2025

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Modern TTS converges on LLM-style autoregressive decoders over neural-codec audio tokens, optimized for streaming agent latency and voice cloning.

8.9K views · May 09, 2026

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

Self-hosted Orpheus + LoRA voice clones on L40S can deliver real-time consumer-grade voice AI at $1/hr if you fine-tune away head-of-line silence.

6.8K views · Jul 31, 2025

Pipecat Cloud: Enterprise Voice Agents Built On Open Source - Kwindla Hultman Kramer, Daily

Open-source Pipecat plus Daily's WebRTC infrastructure can hit the ~800ms voice-to-voice latency that real conversational AI requires.

6.1K views · Jul 31, 2025

Building Conversational Agents — Thor Schaeff and Philipp Schmid, Google DeepMind

Gemini's free-tier API plus AI Studio is the fastest path to a multilingual conversational agent prototype without a credit card.

5.7K views · Apr 30, 2026

Voice Agents: the good, the bad, and the ugly

Production voice agents need decomposed prompts, tool-based state transitions, and a separate text-LLM supervisor to keep conversations on rails.

4.7K views · Feb 22, 2025

How to build the world's fastest voice bot: Kwindla Hultman Kramer

For voice bots, aim for sub-500ms latency end-to-end; the orchestration map (media → STT → LLM → TTS plus tool calls) is the work, not the model itself.

4.5K views · Feb 10, 2025

[Full Workshop] Building Conversational AI Agents - Thor Schaeff, ElevenLabs

ElevenLabs' conversational stack favors text-mediated pipelines for transparency and supports 99 languages via colocated ASR/LLM/TTS components.

4.4K views · Jul 31, 2025

Shipping an Enterprise Voice AI Agent in 100 Days - Peter Bar, Intercom Fin

Voice agents ship fast by reusing chat-side RAG and starting with the out-of-hours voicemail-replacement wedge, but conversation design must be re-thought for sub-second latency.

4.2K views · Jul 18, 2025

Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

To stop voice AI from interrupting, augment VAD with semantic, syntactic and prosodic models that predict (not just detect) end-of-turn the way humans do.

3.6K views · Jul 31, 2025

Give Your Chat Agent a Voice — Luke Harries, Head of Growth, ElevenLabs

Voice Engine lets you slap voice onto any existing chat agent in one prompt instead of replacing the whole stack.

3.2K views · May 09, 2026

Building and Scaling an AI Agent Swarm of low latency real time voice bots: Damien Murphy

Modern voice agents should use unified audio-in/audio-out models on call-center-tuned acoustics for low-latency real-time deployments.

3.2K views · Oct 08, 2024

Voice AI: when is the "Her" moment? — Neil Zeghidour, CEO, Gradium AI

Real-time conversation needs full-duplex speech-to-speech models plus filler-based tool-call hiding — current half-duplex stacks aren't there yet.

2.9K views · May 09, 2026

Milliseconds to Magic: Real-Time Workflows using the Gemini Live API and Pipecat

Voice AI's hard problems migrate down the stack over time, so build orchestration code now while leaving room for API/model improvements to absorb tomorrow's work.

2.6K views · Jun 27, 2025

Your realtime AI is ngmi — Sean DuBois (OpenAI), Kwindla Kramer (Daily)

For sub-second voice agents, use WebRTC for edge audio and reserve websockets for server-to-server and small structured data.

2.2K views · Jul 31, 2025

Realtime Conversational Video with Pipecat and Tavus — Chad Bailey and Brian Johnson, Daily & Tavus

Production real-time voice/video AI needs a dedicated orchestration layer (Pipecat) on top of voice or video-generation models like Tavus replicas.

2.2K views · Jun 27, 2025

Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram

The leap to human-feeling voice agents comes not from speed or accuracy but from passing rich context (tone, history, modality) through every stage of the STT-LLM-TTS pipeline.

1.8K views · Feb 10, 2025

Building an AI assistant that makes phone calls [Convex Workshop]

Reactive databases like Convex make it straightforward to glue STT, LLM and TTS streams into real-time voice agents that hold actual phone conversations.

1.8K views · Feb 09, 2025

From Self-driving to Autonomous Voice Agents — Brooke Hopkins, Coval

Borrow self-driving's large-scale probabilistic simulation playbook to escape voice-agent POC purgatory and ship reliable conversational systems.

1.8K views · Jul 31, 2025

Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)

State-space models give voice AI the millisecond latency it needs without giving up quality — transformers' quadratic scaling makes them unsuitable for sub-second voice agents.

1.8K views · Jun 27, 2025

Contact Center Voice AI: Low-Latency Intelligence Extraction from Messy Audio Streams — Dippu Singh

Real contact-center voice AI ROI comes from automating the 1:1 after-call work via a low-latency STT + structured-output LLM + CRM-schema-mapper pipeline with a human verification step.

1.6K views · Apr 08, 2026

The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss

Ambient conversational overlays solve a different latency/UX problem than voice agents — timing and attention budget matter more than raw speed.

1.3K views · Jun 03, 2025

The End of Awkward AI Transcriptions - Travis Bartley and Myungjong Kim

Nvidia Riva trades a single 'one model' approach for a Fast Conformer-based toolkit (Parakeet streaming + Canary accuracy) plus Sortformer diarization, ruling the Open ASR leaderboard.

550 views · Jun 03, 2025