← back

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

6.8K views · Jul 31, 2025 · 16:09 min · Watch on YouTube ↗
Takeaway

Self-hosted Orpheus + LoRA voice clones on L40S can deliver real-time consumer-grade voice AI at $1/hr if you fine-tune away head-of-line silence.

Summary

  • Neil Dwyer (Gabber CTO, ex-LiveKit agents platform) describes self-hosting Orpheus voice TTS to hit $1/hr — needed for consumer use cases (AI girlfriends, NPCs, kids' toys) where $5/hr platforms don't work.
  • Orpheus is a Llama-3B pre-trained on 100K hours of voice + text, outputting Snack audio codec tokens at 24kHz — needs ~85-100 tokens/sec to keep up with real-time playback; hosted on L40S GPUs.
  • Cloning via one-shot fails on Orpheus (low pretrain hours); they use LoRA fine-tunes (rank 16, alpha 32, all projections) — ~10 min of source audio works, 30 min is better; even overfit clones sound usable and emotive.
  • Biggest latency win: 'head-of-line silence' — Orpheus's training data had 600ms of silence baked into voices; they fine-tune the silence away and drop P50 to ~100ms (essentially half a second free).
  • Latency budget combines time-to-first-token, tokens/sec, network latency, and head-of-line silence; matters because of end-of-turn detection snooze periods.
voiceorpheuslora
Original description
This is a talk that goes over our experience deploying Orpheus (Emotive, Realtime TTS) to production. It will cover topics:

- Latency and optimizations
- High fidelity voice clones w/ examples
- Load balancing w/ multiple GPUs and multiple LoRas

About Neil Dwyer
Spent a lot of my career building real-time applications. First at a company called Bebo circa 2018 where I built a live streaming + computer vision pipeline that watched people play Fortnite. More recently at a company called LiveKit where I worked on the Agents platform along with some amazing people. And now at my own startup, Gabber, where we are making it easier (and cheaper!) to make real-time, multi-modal consumer apps.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps

00:00 Introduction to Gabber and Real-Time AI
02:15 Gabber's Mission for Consumer AI
04:17 The Orpheus Voice Model
05:43 Challenges in Voice Cloning
07:44 Latency Management and "Head of Line Silence"
11:07 Infrastructure for Batch Inference
11:36 Leveraging vLLM and Dynamic Quantization
13:21 Load Balancing with a Consistent Hash Ring
14:17 System Architecture Overview
15:07 Conclusion and Open Source Shout-outs