← back
Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs
Takeaway
A small GPT-2-style model can be trained from scratch on a laptop, and post-training — not architecture — is where modern frontier gains come from.
Summary
- ElevenLabs Scribe V2 lead leads a hands-on workshop training a GPT-2-style decoder-only causal LM from scratch in PyTorch, inspired by Karpathy's nanoGPT.
- Trains on Shakespeare dataset with a character-level tokenizer (65 tokens); runs on 16GB laptops, Apple Silicon MPS, CUDA, or Google Colab free GPUs.
- Four building blocks: tokenizer, model architecture (causal self-attention + MLP + layer norm), training loop, and inference.
- Emphasizes tokenizer choice is the most consequential decision — voice teams spend 6+ months on it before architecture.
- Argues pretraining recipes haven't changed much; the real differentiator between GPT-4, 4o, 5, Gemini 3, 3.1 is post-training/fine-tuning techniques.
llmtrainingfine-tuning
Original description
Training an LLM from scratch on a local machine sounds unreasonable, until it isn't. In this workshop, Angelos Perivolaropoulos from ElevenLabs walks through what it actually takes to train a language model locally, with a practical focus on the tooling, constraints, and engineering tradeoffs involved. If you want a hands-on look at small-scale LLM training beyond the cloud-heavy default, this is a useful deep dive. Speaker info: - https://www.linkedin.com/in/angelos-perivolaropoulos/ - https://github.com/angelos-p Timestamp: 0:00 Introduction and background of the speaker 1:21 Overview of the workshop objectives 3:12 Inspiration from Andre Karpathy's NanoGPT 4:37 The four fundamental building blocks of an LLM 7:08 Prerequisites and setup tools (UV, Python, hardware requirements) 9:06 Part 1: The Tokenizer (character-level tokenization explained) 24:29 Model architecture and parameters (vocab size, layers, embeddings) 30:13 The GPT class structure and transformer blocks 37:52 Parameter count and model sizing 40:54 The training loop: objectives and next-token prediction 44:44 Optimization and learning rate strategies (warm-up and cosine decay) 47:56 Validation and monitoring loss 53:07 Part 3: Text generation and inference strategies 56:30 Putting it all together (project file structure) 58:46 Monitoring training and debugging common issues 1:00:27 Workshop challenge and competition details 1:05:24 Q&A: Differences between base models and reasoning models 1:11:31 Q&A: Applying these concepts to audio and multimodal models