← back

Why ChatGPT Keeps Interrupting You — Dr. Tom Shapland, LiveKit

3.6K views · Jul 31, 2025 · 27:02 min · Watch on YouTube ↗
Takeaway

To stop voice AI from interrupting, augment VAD with semantic, syntactic and prosodic models that predict (not just detect) end-of-turn the way humans do.

Summary

  • LiveKit's Tom Shapland frames interruptions as the biggest unsolved problem in voice AI: VAD + silence-timeout (>500ms) is far too crude vs human turn-taking (Japanese listeners respond near-instantly, Danes slowest).
  • Humans predict end-of-turn via semantics, then syntax, then prosody, generating speech before the speaker finishes — full-duplex comprehension and production.
  • Current cascading voice pipelines (STT → VAD → LLM → TTS) are unidirectional and look only backwards; new approaches augment VAD with semantic/syntax/prosody models.
  • LiveKit's transformer-based semantic end-of-utterance model takes the last four conversational turns and predicts an end-of-utterance token.
voice-aiturn-takinglivekit
Original description
ChatGPT Advanced Voice Mode isn’t interrupting just you. Interruptions, and turn-taking in general, are unsolved problems for all Voice AI agents. Nobody likes being cut short – and people have much less patience for machines than they do for other humans. Turn-taking failures take many forms (e.g., the agent interrupts the user, the agent mistakes a cough for an interruption), and all of them lead to users immediately hanging up the phone.

In this talk, we use human conversation as a framework for understanding both today’s approaches to turn detection and where the field is headed. You’ll learn about how linguists think about turn detection in human dialogue, what’s working (and what’s broken) in current methods, and how we might build Voice AIs that interrupt you less than your human brother.

About Tom Shapland
Tom Shapland, PhD, is a Product Manager at LiveKit. LiveKit is an open source platform for building, deploying, and scaling realtime multimodal agents. He's passionate about the multimodal future of human-computer interfaces. Before LiveKit, he was the cofounder of a Voice AI observability platform (Canonical AI) and an agriculture technology startup (Tule, YC S14). He lives in the East Bay and coaches lacrosse for his two kids.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter