← back
The End of Awkward AI Transcriptions - Travis Bartley and Myungjong Kim
Takeaway
Nvidia Riva trades a single 'one model' approach for a Fast Conformer-based toolkit (Parakeet streaming + Canary accuracy) plus Sortformer diarization, ruling the Open ASR leaderboard.
Summary
- Nvidia Riva speech AI stack uses Fast Conformer backbone with extra subsampling for 80ms compressed time steps, enabling fast convergence and small audio inputs.
- Two model families: Riva Parakeet (CTC and TDT for streaming/ASR/translation/target-speaker) and Riva Canary (attention-encoder-decoder for max accuracy and multitask).
- Sortformer (rival-timing principle) bridges diarization speaker tokens with ASR encoder embeddings to solve who-spoke-what-when in a single fine-tunable architecture.
- Accessory models: VAD (MarbleNet), n-gram LMs, WFST text-norm/ITN, BERT-based punctuation/capitalization.
- Riva models dominate the Hugging Face Open ASR leaderboard's top 5 thanks to per-customer customization rather than one-model-fits-all.
nvidiaspeech-recognitionfast-conformer
Original description
NVIDIA is setting the new global standard for speech AI—with 6 top-ten models on the Hugging Face ASR leaderboard and blazing a trail with models like Parakeet2. In this talk, we’ll pull back the curtain on what it takes to build the world’s fastest, most accurate conversational AI, from open-source research to enterprise-ready NIM microservices that scale across any infrastructure. We hear you, developers: Whether you’re building call center agents, video dubbing tools, or digital humans, NVIDIA’s ecosystem is designed for you. With Python-first frameworks, intuitive configurators, and a thriving open-source community, we’re making rapid iteration and seamless integration a reality—so you can launch faster, cut costs, and innovate boldly. Real-world impact is already here. Enterprises are deploying multilingual, noise-robust, and highly customizable voice agents at scale, while our digital human blueprint lets you create interactive avatars. But the real story is the underlying conversational AI stack that’s transforming customer experience, accessibility, and global communication. Join us to see why developers and industry leaders alike are calling NVIDIA’s speech AI “a game-changer”—and how you can be part of the next wave of conversational intelligence.