How to build the world's fastest voice bot: Kwindla Hultman Kramer

4.5K views · Feb 10, 2025 · 20:38 min · Watch on YouTube ↗

Takeaway

For voice bots, aim for sub-500ms latency end-to-end; the orchestration map (media → STT → LLM → TTS plus tool calls) is the work, not the model itself.

Summary

Kwindla Hultman Kramer (Daily, real-time audio/video infra, creators of Pipecat) lays out the architecture for production voice bots: media transport, transcription, buffering, model swapping, phrase endpointing, interruption handling, echo cancellation, fast TTS.
Target latency: ~500ms total response time because humans typically reply in 200-300ms; Gemini Pro alone has ~900ms time-to-first-token, so the budget is blown before networking.
Throughput improves an order of magnitude every few years but latency improvements are linear — latency is the binding constraint.
Healthcare demo: voice agent calls a patient and replaces forms (intake questions about prescriptions and allergies), using real tool calling with mocked EHR endpoints.
Echo cancellation cannot be avoided assuming headphones — must be solved natively in the pipeline.

voicelatencypipecat

Original description

How to build the world's fastest voice AI bot:

Self-host speech-to-text, LLM inference, and text-to-speech all together in the same container/cluster.
Route audio over the internet using WebRTC and edge networking.
Configure timings for voice activity detection, phrase endpointing, and other parts of the pipeline to optimize for latency. (There are trade-offs to doing this!)
Here's a LLama 3 voice bot that has voice-to-voice response times of ~500ms.

We used @DeepgramAI 's STT and TTS for this bot, and everything is hosted on @cerebriumai 's serverless GPU infrastructure.

https://x.com/kwindla/status/1806129490411900940

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Kwindla 
I am one of the founders of Daily and I serve as the company's CEO. We develop infrastructure and SDKs for video and audio. If you are are creating a product or an app that has video or audio features, we can probably help. The Internet is increasingly a video-first medium, and we think of ourselves as building the infrastructure for the future of our collective digital experience.