← back
Giving a Voice to AI Agents: Scott Stephenson, CEO, Deepgram
Takeaway
The leap to human-feeling voice agents comes not from speed or accuracy but from passing rich context (tone, history, modality) through every stage of the STT-LLM-TTS pipeline.
Summary
- Deepgram CEO Scott Stephenson: Voice AI 2.0 chains STT→LLM→TTS, with sub-500ms full round-trips already achieved (Daily + Deepgram + Llama) matching human 400-600ms turn-taking.
- STT accuracy went from ~75% to 90%+ and latency from 2-5s to 100-200ms over the last few years; on-prem co-located deployments enable real-time.
- Next-gen architecture (Deepgram's 'Contextual AI') passes prompt-like context between models—audio/images/embeddings/previous turns—so STT outputs context, TTS outputs context describing how it spoke.
- Argues melded speech-to-speech models lose controllability; modular composition with rich context will produce the first truly human-feeling agents within a year.
voicedeepgramspeech-recognition
Original description
Voice AI technology has evolved significantly in recent years, transitioning from simple command-response systems to more sophisticated natural conversational agents powered by Large Language Models (LLMs). This progression in voice AI is being driven by advances in core technologies such as foundational AI models, dramatically transforming interactions between humans and machines. Notable improvements include advanced automatic speech recognition and breakthroughs in human-like speech synthesis, all integrated with the deep language comprehension provided by LLMs. These developments have culminated in powerful, autonomous systems that interact through spoken language exclusively. During the session, Scott Stephenson, Founder and CEO of Deepgram, will explore the fundamental principles and best practices for crafting responsive, realistic, and captivating AI agents. He will delve into topics such as natural language processing and the design of multimodal interactions. Attendees will gain insights into the principal design challenges and key considerations involved in developing voice agents capable of managing complex conversations and providing context-sensitive responses on par with human speakers. Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Scott Voice AI technology has evolved significantly in recent years, transitioning from simple command-response systems to more sophisticated natural conversational agents powered by Large Language Models (LLMs). This progression in voice AI is being driven by advances in core technologies such as foundational AI models, dramatically transforming interactions between humans and machines. Notable improvements include advanced automatic speech recognition and breakthroughs in human-like speech synthesis, all integrated with the deep language comprehension provided by LLMs. These developments have culminated in powerful, autonomous systems that interact through spoken language exclusively. During the session, Scott Stephenson, Founder and CEO of Deepgram, will explore the fundamental principles and best practices for crafting responsive, realistic, and captivating AI agents. He will delve into topics such as natural language processing and the design of multimodal interactions. Attendees will gain insights into the principal design challenges and key considerations involved in developing voice agents capable of managing complex conversations and providing context-sensitive responses on par with human speakers. Scott Stephenson is a dark matter physicist turned Deep Learning entrepreneur. He earned a PhD in particle physics from University of Michigan where his research involved building a lab two miles underground to detect dark matter. Scott left his physics post-doc research position to found Deepgram.