← back
Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern
Takeaway
You can fine-tune a token-based TTS model like Sesame CSM-1B to mimic a target voice with ~50 30-second YouTube clips and a free Colab GPU.
Summary
- Workshop fine-tunes Sesame CSM-1B (token-based TTS) on a single speaker's voice using a free Colab T5 GPU and Unsloth.
- Pipeline: download YouTube audio with yt-dlp, transcribe with Whisper turbo, chunk into ~30-second snippets with text — ~50 segments is enough to affect quality.
- Token-based TTS background: audio is encoded into a codebook of vectors via encoder-decoder VQ; Sesame uses 32 hierarchical tokens per time step, with a main 1B transformer predicting token 0 and a smaller second transformer decoding tokens 1–31.
- Recommends single-speaker videos to avoid diarization; suggests find-replace fixes on Whisper transcript errors before training.
- Compares with Canopy Labs' Orpheus as another token-based TTS option.
ttsfine-tuningsesame
Original description
By the end of this workshop, you'll have train Sesame's CSM-1B text-to-speech model on a voice from a Youtube video. The workshop will cover data preparation, fine-tuning and evaluation.