Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci

5.8K views · Apr 08, 2026 · 40:34 min · Watch on YouTube ↗

Takeaway

RL environments built with libraries like Verifiers let small models discover strategies beyond what SFT examples can teach.

Summary

Deepset/Haystack engineer surveys RL environments for LLMs — DeepSeek and MiniMax reports show thousands of environments scaling intelligence beyond SFT.
Maps classic RL (agent, env, state, action, reward, trajectory) to LLM training; RLVR (RL with verifiable rewards) replaces curated SFT data with auto-checkable outcomes.
DeepSeek R1 used GRPO — a lighter alternative to PPO — to teach reasoning purely from verifiable rewards.
Introduces the Verifiers library by Prime Intellect: modular Python packages for single-turn, multi-turn, and tool environments with OpenAI-compatible endpoints.
Walks through a reverse-text single-turn env and ends with training a small model from random-baseline to master tic-tac-toe via RL.

rltrainingverifiers

Original description

Reasoning models like DeepSeek R1 have demonstrated that learning from interaction is just as critical as learning from examples. To build these capabilities ourselves, we need to move beyond static datasets and start building Reinforcement Learning Environments: little worlds where models can act, get rewards, and learn.

In this talk, I will walk you through my journey exploring this space from a practical software engineering perspective.

We will cover:
- How classic Reinforcement Learning concepts translate to Language Models
- Verifiers, an open-source library to build Environments as software artifacts
- Concrete examples of environments, from single-turn tasks to multi-turn games and tool-using agents
- How to use these environments for both evaluating and training Small Language Models.

Join me to learn how to move from prompting models to building the gyms where they learn.

Stefano Fiorucci - AI/SW Engineer/Explorer, deepset

Stefano is an AI/Software Engineer and explorer.

He currently works on AI Orchestration at Deepset, where he contributes to and maintains Haystack, a widely used open-source framework for building LLM applications.

He loves experimenting with Small Language Models, Post-Training and Reinforcement Learning, and shares his learning through code, writing, and talks.

LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course

Socials:
https://twitter.com/theanakin87
https://www.linkedin.com/in/stefano-fiorucci/
https://github.com/anakin87
https://huggingface.co/anakin87

Slides:
https://drive.google.com/file/d/116PKThwtyTxeH1GmZQ7bL3HPYM6KCgHa/view?usp=drive_link