← Knowledge base
Frontier· 3 min read

Reasoning distillation: teaching small models to think

How chain-of-thought traces turned distillation from a compression trick into a way to transfer reasoning itself — the DeepSeek-R1 recipe and why it changed the field.

In short

Reasoning distillation trains a small model on a large model's chain-of-thought traces, so the student learns to reason rather than just to answer. DeepSeek-R1 proved it at scale in 2025: ~800,000 reasoning traces, plain supervised fine-tuning, no reinforcement learning — and the resulting 32B student beat OpenAI's o1-mini on several benchmarks. The lesson: reasoning transfers cheaply, and distilling from a strong reasoner often beats running RL on a small model directly.

For most of its history, distillation transferred answers. The breakthrough of the last two years is that you can transfer reasoning — the step-by-step thinking a model does before it answers. This single shift is why small open models can now solve competition math and hard coding problems that everyone assumed required frontier scale.

The idea: distill the trace, not just the label

A reasoning model doesn't jump to an answer — it produces a long chain of thought (CoT) first. The insight behind reasoning distillation is simple and powerful:

If you train a small model on the teacher's reasoning traces — not just its final answers — the student learns to reason, not merely to recall.

An early, clean demonstration was Google's Distilling Step-by-Step (Hsieh et al., 2023): extract natural-language rationales from the teacher and train the student to predict both the label and the rationale. A 770M-parameter student reportedly beat a few-shot 540B PaLM on a benchmark — using a fraction of the data. Reasoning, it turned out, is unusually distillable.

The watershed: DeepSeek-R1 (January 2025)

The moment the field changed was the release of the DeepSeek-R1 distilled models. The recipe was striking in its simplicity:

  1. Take the full, RL-trained R1 reasoning model as the teacher.
  2. Have it generate ~800,000 reasoning traces (roughly 600K chain-of-thought examples via rejection sampling, plus 200K general samples).
  3. Train six smaller dense models on those traces with pure supervised fine-tuning — no reinforcement learning.

The students were ordinary open bases: Qwen2.5 at 1.5B / 7B / 14B / 32B, Llama-3.1-8B, and Llama-3.3-70B. The headline result: R1-Distill-Qwen-32B beat OpenAI's o1-mini on several reasoning benchmarks.

Two lessons landed hard:

  • Reasoning transfers cheaply. You don't need the teacher's weights or a giant RL run — you need its traces and a fine-tuning loop.
  • Distilling from a strong reasoner beats running RL on a small model directly. It's often easier to inherit reasoning than to grow it.

A wave of open replications followed — OpenThoughts, Bespoke-Stratos, S1 — all variations on "harvest long traces, SFT a smaller model."

The next step: on-policy distillation

Trace-based SFT has a weakness. The student trains on contexts the teacher visits. When the student later makes a mistake the teacher never would, it has never seen how to recover — errors compound. The 2025–2026 frontier fixes this with on-policy distillation: let the student generate its own attempts and have the teacher grade them token by token. Qwen3 used this to give an 8B student a 32B teacher's reasoning at roughly a tenth of the GPU hours.

The honest caveats

Reasoning distillation is powerful but not magic:

  • The ceiling is task-dependent. Densely-represented skills like math and code transfer beautifully through traces. Long-tail world knowledge does not — a student generally can't exceed its teacher on the distilled distribution. (This nuances the 2023 "False Promise of Imitating Proprietary LLMs" result, which was about shallow style cloning; reasoning-trace distillation is a different, more effective animal.)
  • Diversity can collapse. Mode-seeking objectives can make a student great at pass@1 but worse at pass@k — bad if you rely on sampling many attempts. Watch for it.
  • Benchmarks can leak. Teacher traces can carry benchmark-derived content into the student, inflating scores. Evaluate on fresh, post-cutoff problems and report pass@k, not just pass@1.

Why it matters here

Reasoning distillation is the reason this site exists now rather than in five years. It turned distillation from a quiet compression trick into the mechanism by which frontier capability — not just frontier answers — becomes something you can run on your own machine. The craft of doing it well is still being invented. That's the fun part.