Training Agentic Reasoners — Will Brown, Prime Intellect

21.8K views · Jul 07, 2025 · 19:16 min · Watch on YouTube ↗

Takeaway

Agentic reasoning isn't a separate research thread — it's the same RL-on-tool-use scaling recipe that powers o3 and DeepSeek, and it's becoming accessible outside frontier labs.

Summary

Will Brown argues reasoning and agents are the same problem: RL is the trick that makes brittle agent loops reliable at scale, and DeepSeek showed it works with surprisingly few tweaks.
Frames OpenAI's o3 as the prototypical agentic reasoner — its selling point is tool-use across long horizons, not raw IQ — and notes OpenAI stopped serving GPT-4.5 to double down on RL compute.
Critiques Verl/GRPO complexity as a barrier; Prime Intellect aims to make RL on top of agent setups accessible to non-lab teams.
Long-horizon agentic tasks break vanilla LLM APIs after N steps; RL fine-tuning is the practical way to harden them.

rlagentsreasoning

Original description

This talk will be a technical deep dive into RL for agentic reasoning via multi-turn tool calling, similar to OpenAI's o3 and Deep Research. In particular, we'll cover:

- When, why, and how
- GRPO vs PPO vs etc
- Designing environments and rewards
- Survey of recent research highlights
- Results on example tasks
- Overview of open-source ecosystem (libraries, compute requirements, tradeoffs, etc.)

About Will Brown
Will Brown is a Research Engineering Lead at Prime Intellect, focusing on RL for reasoning and agents. He previously held research roles at Morgan Stanley and AWS, and completed his PhD in Computer Science at Columbia University.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps
[00:00] Introduction to the idea that reasoning and agents are similar.
[01:05] The growing effectiveness of Reinforcement Learning (RL) in AI.
[03:04] The complexities and challenges of implementing RL.
[04:41] The connection between popular AI products (agents) and RL fine-tuning.
[07:18] The core process of Reinforcement Learning.
[10:21] The importance of tools and real-world tasks for agents.
[12:13] The problem of "reward hacking" and how to design better evaluations.
[14:51] Future directions for agentic systems and a practical toolkit for implementation.