← back
Reinforcement Learning for Agents - Will Brown, ML Researcher at Morgan Stanley
Takeaway
Reinforcement learning, not just bigger models, is the missing piece to take agents from 70%-pipelines to reliable long-horizon autonomous systems.
Summary
- Will Brown argues current agents are pipelines of chained chatbot calls and the path to true multi-step autonomy runs through RL, not just better base models.
- DeepSeek R1 showed that long chain-of-thought emerges as a byproduct of pure RL with verifiable rewards using GRPO — sample N completions per prompt, reinforce the high-scoring ones.
- OpenAI's Deep Research is the existence proof that end-to-end RL works for ~100-call tool-using agents; the open question is scaling this beyond research tasks.
- Pre-training and RLHF show diminishing returns; RL with verifiers and rejection sampling is the unlock for test-time scaling and agent capability gains.
agentsreinforcement-learninggrpo
Original description
Recorded live at the Agent Engineering Session Day from the AI Engineer Summit 2025 in New York. Learn more at https://ai.engineer and purchase tickets to our next event, the AI Engineer World's Fair, in SF June 3 - 5 here: https://ti.to/software-3/ai-engineer-worlds-fair-2025 About Will Hi! I’m a machine learning researcher based in New York City. I am a member of Morgan Stanley’s Machine Learning Research group, where I have been primarily working on projects related to language models and sequential prediction. I completed my PhD (CS) at Columbia, where I was fortunate to be co-advised by Christos Papadimitriou and Tim Roughgarden. Before that, I was an undergrad (CS + philosophy) and masters (DS) student at Penn, and I’ve spent time in research and engineering roles at AWS, Two Sigma, MongoDB, and AmFam