← back

OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

3.5K views · Jul 19, 2025 · 19:58 min · Watch on YouTube ↗
Takeaway

Post-training reasoning data recipes invert pre-training intuition: fewer high-quality synthetic sources, multiple samples per question, and weaker-but-clearer teachers outperform.

Summary

  • Bespoke Labs' OpenThoughts 3 is the SOTA open-source reasoning SFT dataset; DeepSeek-R1 itself was ultimately an SFT model fine-tuned on 800K traces (600K reasoning) and OpenThoughts reverse-engineers the missing data recipe.
  • Pipeline: source questions → mix sources → filter questions → generate answers with teacher → filter answers → pick best teacher. ~5K datasets and ~3K models created across ~1K experiments on HuggingFace.
  • Surprises: sampling 16 traces per question at 1/16 the questions matches scaling questions 16x; Qwen-32B beat DeepSeek-R1 as teacher (great researcher != great lecturer); synthetic questions outperform scraped/human-written ones.
  • Question filtering via LLM-rated difficulty (and answer length as proxy) beats embedding/fastText filters — opposite of pre-training filtering best practices.
reasoningsftdata-recipes
Original description
Peel back the curtain on state of the art model post-training through the story of OpenThinker, a SOTA small reasoning model (outperforming DeepSeek distill), built in the open. Learn about the dataset recipe used to build the strongest reasoning models which you can apply to your own domain-specific specialized reasoning models. Hear about the strategies that scale (and that don't) based on our rigorous experimentation on the journey from thousands of data points (Bespoke-Stratos) to millions of data (OpenThinker3). Build upon our open source engineering solutions for large-scale synthetic data generation, training on multiple supercomputing clusters, and building out fast reliable evaluations.

About Ryan Marten
Ryan Marten is co-lead of OpenThinker collaboration and a founding engineer at Bespoke Labs, working on data curation and model post-training. Previously, Ryan has been an AI researcher at the University of Illinois Urbana-Champaign, University of Toronto, University of Oxford, AI2, and Vector Institute. When he's not at the lab, he's probably out surfing.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter

Timestamps:

0:00 - Introduction to the problem of open-source reasoning in AI models.

1:09 - The effectiveness of Supervised Fine-Tuning (SFT) for reasoning.

3:38 - Introduction to OpenThoughts 3 and its performance.

7:52 - Key learnings from the data recipe development.

11:34 - Guidance on adapting the dataset recipe to specific domains.

15:15 - Call for open collaboration and where to find the project's resources