← back

Voice AI: when is the "Her" moment? — Neil Zeghidour, CEO, Gradium AI

2.9K views · May 09, 2026 · 19:26 min · Watch on YouTube ↗
Takeaway

Real-time conversation needs full-duplex speech-to-speech models plus filler-based tool-call hiding — current half-duplex stacks aren't there yet.

Summary

  • Gradium spun out of a non-profit lab (Moshi creators, funded by Schmidt, Niel, Saadé) to be the voice-AI building-blocks provider.
  • Even best voice agents have ~200ms TTS latency — leaving no budget for the speech-to-text + LLM + tool-call round trip needed for human-pace conversation.
  • Tool calls (500ms–4s via OpenRouter) are now the dominant bottleneck; solution: fillers where the LLM keeps talking while waiting for tool results.
  • Speech-to-speech (one model end-to-end) is faster than cascaded systems but every model except Moshi is half-duplex — can't handle back-channeling, overlaps, or 'mhm' which are essential to human (especially Japanese) conversation.
  • Full duplex is the real Her threshold — humans overlap up to 20% of conversation time.
voicefull-duplexlatency
Original description
The "Her" moment has been promised so many times it's become a joke. Every new demo, every smooth-sounding voice agent gets called it. Neil Zeghidour, CEO of Gradium AI and one of the researchers behind Moshi — the first full-duplex voice model — uses this talk to be honest about where the gap actually is and why it keeps not closing.

The core tension: cascaded systems (speech-to-text, LLM, text-to-speech) are practical and getting smarter, but they're architecturally incapable of feeling like a real conversation. Latency from tool calls alone can be 500ms to 4 seconds — while humans process and respond in around 200ms total. Speech-to-speech models solve some of that but trade it for a different problem: they're still half-duplex, meaning they're either listening or talking but never both, which makes backchanneling impossible and the interaction feel robotic in a different way. Moshi showed that full-duplex is solvable. What it didn't solve was making the model useful. And cost is a wall hiding behind the latency problem — TTS at scale is expensive enough that some teams burn through their fundraising before they can grow a user base.

The most underrated thread in the talk is paralinguistic understanding: voice carries tone, hesitation, discomfort, and cultural signals that get entirely stripped out the moment you transcribe to text. Getting to Her means building models that don't just produce natural-sounding speech but actually understand what the voice is carrying — and that's a science problem, not a prompt engineering one.

Speaker info:
- https://x.com/neilzegh
- https://www.linkedin.com/in/neil-zeghidour-a838aaa7/

Timestamps:

0:14 Introduction and mission of Gradium AI
1:16 Demonstration of voice cloning technology
2:42 The "Her" movie analogy and current limitations of Voice AI
5:42 Challenges of cascaded systems (Speech-to-Text, LLM, Text-to-Speech)
6:37 The difficulty of latency in tool calling
9:08 Explanation of Speech-to-Speech vs. cascaded architectures
9:34 The necessity of full-duplex systems and backchanneling
11:53 Demonstration of the full-duplex Moshi model
12:59 The importance of paralinguistic understanding
14:29 Scalability and the high cost of current Voice AI
16:38 Introducing Phoneon: on-device, local TTS for privacy and cost efficiency
18:29 Conclusion and path forward for Voice AI