On-policy distillation: learning from your own mistakes
Why letting the student generate its own attempts and having the teacher grade them — rather than imitating fixed teacher data — became the dominant post-training paradigm of 2025–2026.
If reasoning distillation is what the frontier transfers, on-policy distillation is how it transfers best. It's the technique a 2026 survey called "the indispensable post-training paradigm for scaling reasoning," and it's worth understanding precisely.
The problem with imitation
Standard ("off-policy") distillation trains the student on a fixed corpus the teacher generated. This is sequence-level KD, and it works — but it has a structural flaw called exposure bias or compounding error:
- The student only ever sees contexts the teacher would visit.
- At inference, the student inevitably wanders into states the teacher never would — because it's not as good.
- In those states it has no training signal, so a small early error snowballs into a wrong answer.
You trained it on a perfect driver's route; the moment it drifts off that route, it has never seen the shoulder.
The fix: train on the student's own trajectories
On-policy distillation flips the data source. The student generates its own outputs, and the teacher scores them — token by token — typically with a mode-seeking reverse-KL objective. Now:
- The student trains on its own mistakes, learning to recover from the states it actually reaches.
- Feedback is dense (a signal on every token) rather than sparse like RL's single final reward.
- It's hard to "hack": low reverse-KL essentially means behaving like the teacher.
The unifying framework: GKD
The clean formalization is Generalized Knowledge Distillation (GKD) (Agarwal et al., Google DeepMind, 2023). GKD generalizes distillation along two dials:
- Data source — interpolate between fixed teacher data (off-policy) and the student's own self-generated sequences (on-policy).
- Divergence — forward KL (mode-covering), reverse KL (mode-seeking), or a generalized JSD in between.
This framework contains older methods as special cases — SeqKD, token-level KD, and MiniLLM (which introduced reverse-KL distillation for LLMs) all fall out of it. In practice you reach for it through Hugging Face TRL's GKDTrainer, where lmbda sets the fraction of on-policy data and beta selects the divergence.
Why it won in 2025–2026
The economics are decisive. Reported results:
- Thinking Machines Lab (Oct 2025) hit ~70–74% on AIME'24 at roughly one-tenth the cost of reinforcement learning, and 9–30× cheaper than off-policy distillation — while also recovering instruction-following lost to domain fine-tuning (a cure for catastrophic forgetting).
- Qwen3 distilled a 32B teacher's reasoning into an 8B student at ~1/10 the GPU hours.
- On-policy distillation is now reported across DeepSeek-V4, Qwen3, Gemma 2, Nemotron, and MiMo.
The catch: diversity collapse
Reverse-KL is mode-seeking by design, and that's a double-edged sword. Push it too hard and the student drops the high-entropy "branch points" where reasoning legitimately has multiple valid paths. One 2026 study found standard on-policy distillation retained only 6.8% of high-entropy tokens versus 18.5% in the teacher — showing up as pass@1 improving while pass@k degrades. That's bad news if you rely on best-of-N sampling or inference-time scaling. The emerging fix is to blend mode-seeking reverse-KL with mass-covering forward-KL.
The takeaway
Off-policy distillation teaches the student the teacher's answers. On-policy distillation teaches the student to be the teacher in the situations the student actually gets itself into. That difference — training on your own mistakes rather than someone else's successes — is why it became the default. Just keep an eye on diversity while you do it.