← back
[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han
Takeaway
Unsloth's open-source stack lets you fine-tune, quantize and RL-train modern LLMs on free or consumer GPUs with bug-fixed open-source models.
Summary
- Daniel Han (Unsloth) walks through fine-tuning, RL, GPU kernels, quantization and agents, drawing on his team's bug-fix contributions to Gemma, Llama, Mistral, Phi and Qwen and 10M+ monthly HF downloads.
- Demonstrates how 1.58-bit dynamic quants (e.g. DeepSeek R1 0528) preserve accuracy while running on consumer-grade VRAM.
- Covers async offloaded gradient checkpointing and gradient accumulation bug fixes that materially improved open-source training accuracy.
- Walks through using free Colab/Kaggle GPUs (30 hrs/week on Kaggle) for SFT, reasoning fine-tunes and continued pre-training via Unsloth notebooks.
fine-tuningquantizationunsloth
Original description
Why is Reinforcement Learning (RL) suddenly everywhere, and is it truly effective? Have LLMs hit a plateau in terms of intelligence and capabilities, or is RL the breakthrough they need? In this workshop, we'll dive into the fundamentals of RL, what makes a good reward function, and how RL can help create agents. We'll also talk about kernels, are they still worth your time and what you should focus on. And finally, we’ll explore how LLMs like DeepSeek-R1 can be quantized down to 1.58-bits and still perform well, along with techniques to maintain accuracy. About Daniel Han I'm building Unsloth and we're an open-source startup trying to make AI more accessible and accurate for everyone! We have 40K GitHub stars, 10M monthly downloads on Hugging Face and worked with Google, Meta, Hugging Face teams to fix bugs in open-source models like Llama, Phi & Gemma models. I was previously working at NVIDIA making TSNE 2000x faster. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter Timestamps 00:00 Introduction and Unsloth's Contributions 03:25 The Evolution of Large Language Models (LLMs) 09:47 LLM Training Stages and Yann LeCun's Cake Analogy 16:56 Agents and Reinforcement Learning Principles 23:17 PPO and the Introduction of GRPO 48:12 Reward Model vs. Reward Function 51:22 The Math Behind the Reinforce Algorithm 01:08:50 PPO Formula Breakdown 01:16:29 GRPO Deep Dive 02:00:20 Practical Implementation and Demo with Unsloth 02:33:07 Quantization and the Future of GPUs 02:41:59 Conclusion and Call to Action