Reasoning distillation: teaching small models to think
How chain-of-thought traces turned distillation from a compression trick into a way to transfer reasoning itself — the DeepSeek-R1 recipe and why it changed the field.
Reasoning distillation trains a small model on a large model's chain-of-thought traces, so the student learns to reason rather than just to answer. DeepSeek-R1 proved it at scale in 2025: ~800,000 reasoning traces, plain supervised fine-tuning, no reinforcement learning — and the resulting 32B student beat OpenAI's o1-mini on several benchmarks. The lesson: reasoning transfers cheaply, and distilling from a strong reasoner often beats running RL on a small model directly.
For most of its history, distillation transferred answers. The breakthrough of the last two years is that you can transfer reasoning — the step-by-step thinking a model does before it answers. This single shift is why small open models can now solve competition math and hard coding problems that everyone assumed required frontier scale.
The idea: distill the trace, not just the label
A reasoning model doesn't jump to an answer — it produces a long chain of thought (CoT) first. The insight behind reasoning distillation is simple and powerful:
If you train a small model on the teacher's reasoning traces — not just its final answers — the student learns to reason, not merely to recall.
An early, clean demonstration was Google's Distilling Step-by-Step (Hsieh et al., 2023): extract natural-language rationales from the teacher and train the student to predict both the label and the rationale. A 770M-parameter student reportedly beat a few-shot 540B PaLM on a benchmark — using a fraction of the data. Reasoning, it turned out, is unusually distillable.
The watershed: DeepSeek-R1 (January 2025)
The moment the field changed was the release of the DeepSeek-R1 distilled models. The recipe was striking in its simplicity:
- Take the full, RL-trained R1 reasoning model as the teacher.
- Have it generate ~800,000 reasoning traces (roughly 600K chain-of-thought examples via rejection sampling, plus 200K general samples).
- Train six smaller dense models on those traces with pure supervised fine-tuning — no reinforcement learning.
The students were ordinary open bases: Qwen2.5 at 1.5B / 7B / 14B / 32B, Llama-3.1-8B, and Llama-3.3-70B. The headline result: R1-Distill-Qwen-32B beat OpenAI's o1-mini on several reasoning benchmarks.
Two lessons landed hard:
- Reasoning transfers cheaply. You don't need the teacher's weights or a giant RL run — you need its traces and a fine-tuning loop.
- Distilling from a strong reasoner beats running RL on a small model directly. It's often easier to inherit reasoning than to grow it.
A wave of open replications followed — OpenThoughts, Bespoke-Stratos, S1 — all variations on "harvest long traces, SFT a smaller model."
The next step: on-policy distillation
Trace-based SFT has a weakness. The student trains on contexts the teacher visits. When the student later makes a mistake the teacher never would, it has never seen how to recover — errors compound. The 2025–2026 frontier fixes this with on-policy distillation: let the student generate its own attempts and have the teacher grade them token by token. Qwen3 used this to give an 8B student a 32B teacher's reasoning at roughly a tenth of the GPU hours.
The honest caveats
Reasoning distillation is powerful but not magic:
- The ceiling is task-dependent. Densely-represented skills like math and code transfer beautifully through traces. Long-tail world knowledge does not — a student generally can't exceed its teacher on the distilled distribution. (This nuances the 2023 "False Promise of Imitating Proprietary LLMs" result, which was about shallow style cloning; reasoning-trace distillation is a different, more effective animal.)
- Diversity can collapse. Mode-seeking objectives can make a student great at
pass@1but worse atpass@k— bad if you rely on sampling many attempts. Watch for it. - Benchmarks can leak. Teacher traces can carry benchmark-derived content into the student, inflating scores. Evaluate on fresh, post-cutoff problems and report
pass@k, not justpass@1.
Why it matters here
Reasoning distillation is the reason this site exists now rather than in five years. It turned distillation from a quiet compression trick into the mechanism by which frontier capability — not just frontier answers — becomes something you can run on your own machine. The craft of doing it well is still being invented. That's the fun part.