← back
Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber
Takeaway
Self-hosted Orpheus + LoRA voice clones on L40S can deliver real-time consumer-grade voice AI at $1/hr if you fine-tune away head-of-line silence.
Summary
- Neil Dwyer (Gabber CTO, ex-LiveKit agents platform) describes self-hosting Orpheus voice TTS to hit $1/hr — needed for consumer use cases (AI girlfriends, NPCs, kids' toys) where $5/hr platforms don't work.
- Orpheus is a Llama-3B pre-trained on 100K hours of voice + text, outputting Snack audio codec tokens at 24kHz — needs ~85-100 tokens/sec to keep up with real-time playback; hosted on L40S GPUs.
- Cloning via one-shot fails on Orpheus (low pretrain hours); they use LoRA fine-tunes (rank 16, alpha 32, all projections) — ~10 min of source audio works, 30 min is better; even overfit clones sound usable and emotive.
- Biggest latency win: 'head-of-line silence' — Orpheus's training data had 600ms of silence baked into voices; they fine-tune the silence away and drop P50 to ~100ms (essentially half a second free).
- Latency budget combines time-to-first-token, tokens/sec, network latency, and head-of-line silence; matters because of end-of-turn detection snooze periods.
voiceorpheuslora
Original description
This is a talk that goes over our experience deploying Orpheus (Emotive, Realtime TTS) to production. It will cover topics: - Latency and optimizations - High fidelity voice clones w/ examples - Load balancing w/ multiple GPUs and multiple LoRas About Neil Dwyer Spent a lot of my career building real-time applications. First at a company called Bebo circa 2018 where I built a live streaming + computer vision pipeline that watched people play Fortnite. More recently at a company called LiveKit where I worked on the Agents platform along with some amazing people. And now at my own startup, Gabber, where we are making it easier (and cheaper!) to make real-time, multi-modal consumer apps. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter Timestamps 00:00 Introduction to Gabber and Real-Time AI 02:15 Gabber's Mission for Consumer AI 04:17 The Orpheus Voice Model 05:43 Challenges in Voice Cloning 07:44 Latency Management and "Head of Line Silence" 11:07 Infrastructure for Batch Inference 11:36 Leveraging vLLM and Dynamic Quantization 13:21 Load Balancing with a Consistent Hash Ring 14:17 System Architecture Overview 15:07 Conclusion and Open Source Shout-outs