Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern

1.5K views · Jun 03, 2025 · 34:00 min · Watch on YouTube ↗

Takeaway

You can fine-tune a token-based TTS model like Sesame CSM-1B to mimic a target voice with ~50 30-second YouTube clips and a free Colab GPU.

Summary

Workshop fine-tunes Sesame CSM-1B (token-based TTS) on a single speaker's voice using a free Colab T5 GPU and Unsloth.
Pipeline: download YouTube audio with yt-dlp, transcribe with Whisper turbo, chunk into ~30-second snippets with text — ~50 segments is enough to affect quality.
Token-based TTS background: audio is encoded into a codebook of vectors via encoder-decoder VQ; Sesame uses 32 hierarchical tokens per time step, with a main 1B transformer predicting token 0 and a smaller second transformer decoding tokens 1–31.
Recommends single-speaker videos to avoid diarization; suggests find-replace fixes on Whisper transcript errors before training.
Compares with Canopy Labs' Orpheus as another token-based TTS option.

ttsfine-tuningsesame

Original description

By the end of this workshop, you'll have train Sesame's CSM-1B text-to-speech model on a voice from a Youtube video. The workshop will cover data preparation, fine-tuning and evaluation.