How 800,000 traces gave small models o1-class reasoning

In January 2025, DeepSeek released something quieter than its flagship reasoning model but arguably more consequential: a set of distilled models that took the reasoning ability of a giant and poured it into bases small enough to run on a laptop. The recipe was so simple it embarrassed a lot of assumptions.

The recipe, in three steps

Take the full, reinforcement-learning-trained R1 as the teacher.
Have it generate roughly 800,000 reasoning traces — about 600K chains of thought via rejection sampling, plus 200K general samples.
Train six ordinary open base models on those traces with plain supervised fine-tuning. No reinforcement learning. No reward model. No RL infrastructure at all.

The students were nothing exotic: Qwen2.5 at 1.5B, 7B, 14B, and 32B; Llama-3.1-8B; and Llama-3.3-70B. The result that made everyone sit up: R1-Distill-Qwen-32B beat OpenAI's o1-mini on several reasoning benchmarks.

Why it worked

The deep reason is that reasoning is unusually distillable. A chain of thought is a dense, explicit demonstration of process — not just "here's the answer" but "here's every step to get there." Train a model on enough of those demonstrations and it doesn't merely memorize answers; it picks up the procedure.

This had been hinted at before — Google's Distilling Step-by-Step showed a 770M model beating a few-shot 540B model by learning rationales. But R1 did it at scale, in the open, with a recipe anyone could copy. And copy they did: OpenThoughts, Bespoke-Stratos, S1, and a long tail of "distill-from-R1" fine-tunes followed within weeks.

The two lessons that stuck

Reasoning transfers cheaply. You don't need the teacher's weights or a giant RL run. You need its traces and a fine-tuning loop. That collapses the cost of building a strong reasoning model by orders of magnitude.

Distilling from a strong reasoner beats running RL on a small model directly. It is often easier to inherit reasoning than to grow it from scratch. The small model doesn't have to discover good reasoning through trial and error — it can be shown.

The honest asterisks

It's not magic, and pretending otherwise does the field a disservice:

The ceiling is task-dependent. Math and code — densely represented as traces — transfer beautifully. Long-tail world knowledge doesn't. A student generally won't exceed its teacher on the distilled distribution.
Pure trace-SFT compounds errors. The student only sees contexts the teacher visits, so it learns nothing about recovering from its own mistakes. That's exactly the gap on-policy distillation closes — and why 2025–2026 moved toward it.
Benchmarks can leak. Teacher traces can smuggle benchmark-derived content into the student, inflating scores. Evaluate on fresh problems; report pass@k, not just pass@1.

Why it matters

The DeepSeek-R1 distilled models are the clearest proof yet of this site's premise: the gap between "what the best AI can do" and "what you can run yourself" is closing, and distillation is the tool closing it. A recipe that fits in three bullet points took reasoning that everyone assumed required frontier scale and made it a download.

Want the full mechanics? Read Reasoning distillation: teaching small models to think.