← back

Lessons from Trillion Token Deployments at Fortune 500s — Alessandro Cappelli, Adaptive ML

4.3K views · May 12, 2026 · 18:34 min · Watch on YouTube ↗
Takeaway

Treat reinforcement learning as the missing 'last mile' for enterprise agents — it gives you ownership, latency-friendly small models, and a synthetic-data flywheel that SFT and prompting can't.

Summary

  • Adaptive ML (Falcon co-founders) runs an RLOps platform for AT&T, Manulife, CVS; argues 95% of GenAI pilots fail because teams stop at MVPs built on proprietary models or instruction fine-tuning
  • RL is disproportionately more sample-efficient than SFT and prompting at steering model behavior — same quality from much smaller models
  • Smaller RL-trained models unlock production tokenomics (AT&T summarizes every customer-agent transcript, currently costs millions), latency (<0.33s for voice agents, impossible with frontier LLMs) and data ownership
  • For agents you can plug models into the existing agent workflow (Manulife) or mock tools + a mock LLM-driven user, with reward = business KPI or LLM-as-judge
  • RL environment doubles as a synthetic data pipeline via rejection sampling, solving the cold-start data problem for agent training
reinforcement-learningfine-tuningenterprise
Original description
95% of GenAI pilots fail to reach production. Alessandro Cappelli's argument is that this isn't a deployment problem or a prompt engineering problem — it's a feedback integration problem. Instruction fine-tuning and proprietary models give you a demo. Only reinforcement learning gives you a systematic way to incorporate defects, business metrics, and production signals and keep improving.

This talk covers what a production-grade RL pipeline looks like at Fortune 500 scale: synthetic data as a byproduct of environment training rather than a prerequisite, mock environments where agents can fail safely before touching real systems, and LLM judges that replace expensive annotation campaigns with a rubric-definition exercise that takes hours rather than weeks. The throughline is that agents raise the stakes on all of this — more tokens, less tolerance for errors, direct access to live databases — and RL was designed for exactly that problem.

Speaker info:
- https://www.linkedin.com/in/alessandro-cappelli-aa8060172