Reinforcement Learning for Agents - Will Brown, ML Researcher at Morgan Stanley

113.6K views · Mar 07, 2025 · 18:17 min · Watch on YouTube ↗

Takeaway

Reinforcement learning, not just bigger models, is the missing piece to take agents from 70%-pipelines to reliable long-horizon autonomous systems.

Summary

Will Brown argues current agents are pipelines of chained chatbot calls and the path to true multi-step autonomy runs through RL, not just better base models.
DeepSeek R1 showed that long chain-of-thought emerges as a byproduct of pure RL with verifiable rewards using GRPO — sample N completions per prompt, reinforce the high-scoring ones.
OpenAI's Deep Research is the existence proof that end-to-end RL works for ~100-call tool-using agents; the open question is scaling this beyond research tasks.
Pre-training and RLHF show diminishing returns; RL with verifiers and rejection sampling is the unlock for test-time scaling and agent capability gains.

agentsreinforcement-learninggrpo

Original description

Recorded live at the Agent Engineering Session Day from the AI Engineer Summit 2025 in New York. Learn more at https://ai.engineer and purchase tickets to our next event, the AI Engineer World's Fair, in SF June 3 - 5 here: https://ti.to/software-3/ai-engineer-worlds-fair-2025

About Will
Hi! I’m a machine learning researcher based in New York City.

I am a member of Morgan Stanley’s Machine Learning Research group, where I have been primarily working on projects related to language models and sequential prediction. I completed my PhD (CS) at Columbia, where I was fortunate to be co-advised by Christos Papadimitriou and Tim Roughgarden.

Before that, I was an undergrad (CS + philosophy) and masters (DS) student at Penn, and I’ve spent time in research and engineering roles at AWS, Two Sigma, MongoDB, and AmFam