On-policy distillation: learning from your own mistakes

If reasoning distillation is what the frontier transfers, on-policy distillation is how it transfers best. It's the technique a 2026 survey called "the indispensable post-training paradigm for scaling reasoning," and it's worth understanding precisely.

The problem with imitation

Standard ("off-policy") distillation trains the student on a fixed corpus the teacher generated. This is sequence-level KD, and it works — but it has a structural flaw called exposure bias or compounding error:

The student only ever sees contexts the teacher would visit.
At inference, the student inevitably wanders into states the teacher never would — because it's not as good.
In those states it has no training signal, so a small early error snowballs into a wrong answer.

You trained it on a perfect driver's route; the moment it drifts off that route, it has never seen the shoulder.

The fix: train on the student's own trajectories

On-policy distillation flips the data source. The student generates its own outputs, and the teacher scores them — token by token — typically with a mode-seeking reverse-KL objective. Now:

The student trains on its own mistakes, learning to recover from the states it actually reaches.
Feedback is dense (a signal on every token) rather than sparse like RL's single final reward.
It's hard to "hack": low reverse-KL essentially means behaving like the teacher.

The unifying framework: GKD

The clean formalization is Generalized Knowledge Distillation (GKD) (Agarwal et al., Google DeepMind, 2023). GKD generalizes distillation along two dials:

Data source — interpolate between fixed teacher data (off-policy) and the student's own self-generated sequences (on-policy).
Divergence — forward KL (mode-covering), reverse KL (mode-seeking), or a generalized JSD in between.

This framework contains older methods as special cases — SeqKD, token-level KD, and MiniLLM (which introduced reverse-KL distillation for LLMs) all fall out of it. In practice you reach for it through Hugging Face TRL's GKDTrainer, where lmbda sets the fraction of on-policy data and beta selects the divergence.

Why it won in 2025–2026

The economics are decisive. Reported results:

Thinking Machines Lab (Oct 2025) hit ~70–74% on AIME'24 at roughly one-tenth the cost of reinforcement learning, and 9–30× cheaper than off-policy distillation — while also recovering instruction-following lost to domain fine-tuning (a cure for catastrophic forgetting).
Qwen3 distilled a 32B teacher's reasoning into an 8B student at ~1/10 the GPU hours.
On-policy distillation is now reported across DeepSeek-V4, Qwen3, Gemma 2, Nemotron, and MiMo.

The catch: diversity collapse

Reverse-KL is mode-seeking by design, and that's a double-edged sword. Push it too hard and the student drops the high-entropy "branch points" where reasoning legitimately has multiple valid paths. One 2026 study found standard on-policy distillation retained only 6.8% of high-entropy tokens versus 18.5% in the teacher — showing up as pass@1 improving while pass@k degrades. That's bad news if you rely on best-of-N sampling or inference-time scaling. The emerging fix is to blend mode-seeking reverse-KL with mass-covering forward-KL.

The takeaway

Off-policy distillation teaches the student the teacher's answers. On-policy distillation teaches the student to be the teacher in the situations the student actually gets itself into. That difference — training on your own mistakes rather than someone else's successes — is why it became the default. Just keep an eye on diversity while you do it.