How 800,000 traces gave small models o1-class reasoning
The DeepSeek-R1 distillation recipe was almost embarrassingly simple — and it rewrote what we thought small open models could do. A look at why it worked.
In January 2025, DeepSeek released something quieter than its flagship reasoning model but arguably more consequential: a set of distilled models that took the reasoning ability of a giant and poured it into bases small enough to run on a laptop. The recipe was so simple it embarrassed a lot of assumptions.
The recipe, in three steps
- Take the full, reinforcement-learning-trained R1 as the teacher.
- Have it generate roughly 800,000 reasoning traces — about 600K chains of thought via rejection sampling, plus 200K general samples.
- Train six ordinary open base models on those traces with plain supervised fine-tuning. No reinforcement learning. No reward model. No RL infrastructure at all.
The students were nothing exotic: Qwen2.5 at 1.5B, 7B, 14B, and 32B; Llama-3.1-8B; and Llama-3.3-70B. The result that made everyone sit up: R1-Distill-Qwen-32B beat OpenAI's o1-mini on several reasoning benchmarks.
Why it worked
The deep reason is that reasoning is unusually distillable. A chain of thought is a dense, explicit demonstration of process — not just "here's the answer" but "here's every step to get there." Train a model on enough of those demonstrations and it doesn't merely memorize answers; it picks up the procedure.
This had been hinted at before — Google's Distilling Step-by-Step showed a 770M model beating a few-shot 540B model by learning rationales. But R1 did it at scale, in the open, with a recipe anyone could copy. And copy they did: OpenThoughts, Bespoke-Stratos, S1, and a long tail of "distill-from-R1" fine-tunes followed within weeks.
The two lessons that stuck
Reasoning transfers cheaply. You don't need the teacher's weights or a giant RL run. You need its traces and a fine-tuning loop. That collapses the cost of building a strong reasoning model by orders of magnitude.
Distilling from a strong reasoner beats running RL on a small model directly. It is often easier to inherit reasoning than to grow it from scratch. The small model doesn't have to discover good reasoning through trial and error — it can be shown.
The honest asterisks
It's not magic, and pretending otherwise does the field a disservice:
- The ceiling is task-dependent. Math and code — densely represented as traces — transfer beautifully. Long-tail world knowledge doesn't. A student generally won't exceed its teacher on the distilled distribution.
- Pure trace-SFT compounds errors. The student only sees contexts the teacher visits, so it learns nothing about recovering from its own mistakes. That's exactly the gap on-policy distillation closes — and why 2025–2026 moved toward it.
- Benchmarks can leak. Teacher traces can smuggle benchmark-derived content into the student, inflating scores. Evaluate on fresh problems; report
pass@k, not justpass@1.
Why it matters
The DeepSeek-R1 distilled models are the clearest proof yet of this site's premise: the gap between "what the best AI can do" and "what you can run yourself" is closing, and distillation is the tool closing it. A recipe that fits in three bullet points took reasoning that everyone assumed required frontier scale and made it a download.
Want the full mechanics? Read Reasoning distillation: teaching small models to think.
The AI Distillery is a living knowledgebase on shrinking intelligence.
Explore the knowledge base →