Serving Voice AI at Scale — Arjun Desai (Cartesia) & Rohit Talluri (AWS)

1.8K views · Jun 27, 2025 · 17:04 min · Watch on YouTube ↗

Takeaway

State-space models give voice AI the millisecond latency it needs without giving up quality — transformers' quadratic scaling makes them unsuitable for sub-second voice agents.

Summary

Cartesia (with AWS) talks about its Sonic 2 voice model and the state-space-model (SSM) architecture they pioneered as a transformer alternative — O(1) generation at inference vs transformers' quadratic scaling.
Voice AI has three first-principle requirements: quality (naturalness), latency (time to first audio in milliseconds, not seconds), and controllability (brand-matching, accents, background sounds, voice cloning).
Most production latency budget should go to the LLM, so STT and TTS need to be near-instant; Cartesia claims fastest TTS in the world; biggest cited customer markets are healthcare, customer support, and real-time gaming NPCs.
Notable design choice: keeping faint phone-call artifacts in agent voices because users find perfectly clean audio uncanny-valley; voice marketplace gives actors licensing income rather than replacing them.

voice-aistate-space-modelscartesia

Original description

Real-Time Voice AI applications demand the lowest possible latencies to enhance user experiences with more advanced reasoning and agentic capabilities. AWS is hosting Arjun Desai, co-founder of Cartesia, in a fireside chat for a technical deep dive into learnings and best practices for building a state-of-the-art inference stack that serves global enterprise customers.

About Arjun Desai
Cofounder @ Cartesia | Prev. Stanford ML PhD

About Rohit Talluri
Amazon Web Services (AWS) Generative AI ML Frameworks, focusing on Foundation Model Training & Inference.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter