Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

18.6K views · May 04, 2026 · 81:26 min · Watch on YouTube ↗

Takeaway

A small GPT-2-style model can be trained from scratch on a laptop, and post-training — not architecture — is where modern frontier gains come from.

Summary

ElevenLabs Scribe V2 lead leads a hands-on workshop training a GPT-2-style decoder-only causal LM from scratch in PyTorch, inspired by Karpathy's nanoGPT.
Trains on Shakespeare dataset with a character-level tokenizer (65 tokens); runs on 16GB laptops, Apple Silicon MPS, CUDA, or Google Colab free GPUs.
Four building blocks: tokenizer, model architecture (causal self-attention + MLP + layer norm), training loop, and inference.
Emphasizes tokenizer choice is the most consequential decision — voice teams spend 6+ months on it before architecture.
Argues pretraining recipes haven't changed much; the real differentiator between GPT-4, 4o, 5, Gemini 3, 3.1 is post-training/fine-tuning techniques.

llmtrainingfine-tuning

Original description

Training an LLM from scratch on a local machine sounds unreasonable, until it isn't. In this workshop, Angelos Perivolaropoulos from ElevenLabs walks through what it actually takes to train a language model locally, with a practical focus on the tooling, constraints, and engineering tradeoffs involved.

If you want a hands-on look at small-scale LLM training beyond the cloud-heavy default, this is a useful deep dive.

Speaker info:
- https://www.linkedin.com/in/angelos-perivolaropoulos/
- https://github.com/angelos-p

Timestamp:
0:00 Introduction and background of the speaker
1:21 Overview of the workshop objectives
3:12 Inspiration from Andre Karpathy's NanoGPT
4:37 The four fundamental building blocks of an LLM
7:08 Prerequisites and setup tools (UV, Python, hardware requirements)
9:06 Part 1: The Tokenizer (character-level tokenization explained)
24:29 Model architecture and parameters (vocab size, layers, embeddings)
30:13 The GPT class structure and transformer blocks
37:52 Parameter count and model sizing
40:54 The training loop: objectives and next-token prediction
44:44 Optimization and learning rate strategies (warm-up and cosine decay)
47:56 Validation and monitoring loss
53:07 Part 3: Text generation and inference strategies
56:30 Putting it all together (project file structure)
58:46 Monitoring training and debugging common issues
1:00:27 Workshop challenge and competition details
1:05:24 Q&A: Differences between base models and reasoning models
1:11:31 Q&A: Applying these concepts to audio and multimodal models