← back
Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive ML
Takeaway
Treat reinforcement learning as the missing 'last mile' for enterprise agents — it gives you ownership, latency-friendly small models, and a synthetic-data flywheel that SFT and prompting can't.
Summary
- Adaptive ML (Falcon co-founders) runs an RLOps platform for AT&T, Manulife, CVS; argues 95% of GenAI pilots fail because teams stop at MVPs built on proprietary models or instruction fine-tuning
- RL is disproportionately more sample-efficient than SFT and prompting at steering model behavior — same quality from much smaller models
- Smaller RL-trained models unlock production tokenomics (AT&T summarizes every customer-agent transcript, currently costs millions), latency (<0.33s for voice agents, impossible with frontier LLMs) and data ownership
- For agents you can plug models into the existing agent workflow (Manulife) or mock tools + a mock LLM-driven user, with reward = business KPI or LLM-as-judge
- RL environment doubles as a synthetic data pipeline via rejection sampling, solving the cold-start data problem for agent training
reinforcement-learningfine-tuningenterprise
Original description
95% of GenAI pilots fail to reach production. Alessandro Cappelli's argument is that this isn't a deployment problem or a prompt engineering problem — it's a feedback integration problem. Instruction fine-tuning and proprietary models give you a demo. Only reinforcement learning gives you a systematic way to incorporate defects, business metrics, and production signals and keep improving. This talk covers what a production-grade RL pipeline looks like at Fortune 500 scale: synthetic data as a byproduct of environment training rather than a prerequisite, mock environments where agents can fail safely before touching real systems, and LLM judges that replace expensive annotation campaigns with a rubric-definition exercise that takes hours rather than weeks. The throughline is that agents raise the stakes on all of this — more tokens, less tolerance for errors, direct access to live databases — and RL was designed for exactly that problem. Speaker info: - https://www.linkedin.com/in/alessandro-cappelli-aa8060172