The distillation glossary
Every term a newcomer to model distillation needs — soft labels, dark knowledge, reverse KL, GGUF, on-policy distillation, the capacity gap, and more — each in one sentence.
A fast reference for the vocabulary of model distillation. Skim it once; come back when a paper throws a term at you.
Core concepts
Knowledge distillation (KD) — Training a small "student" model to reproduce the behavior of a larger "teacher" model.
Teacher model — The large, capable source model whose knowledge is being transferred.
Student model — The smaller target model trained to mimic the teacher.
Soft targets / soft labels — The teacher's full probability distribution over outputs, a far richer signal than a single correct answer.
Hard labels — Ground-truth one-hot targets: the correct answer and nothing else.
Logits — The raw, pre-softmax scores a model produces for each class or token.
Temperature (T) — A softmax scaling knob that "softens" the output distribution to expose how the teacher sees similarity.
Dark knowledge — The information hidden in the relative probabilities the teacher assigns to incorrect answers.
KL divergence — The asymmetric measure of difference between two probability distributions used as the core distillation loss.
Forward KL (mode-covering) — A divergence that pushes the student to put probability everywhere the teacher does; can overestimate regions it can't model.
Reverse KL (mode-seeking) — A divergence that pushes the student to concentrate on the teacher's dominant modes; better for limited-capacity students.
Kinds of distillation
Response-based distillation — Transferring knowledge from the teacher's final output layer (soft targets / logits).
Feature-based distillation — Transferring knowledge from the teacher's intermediate-layer activations (e.g. FitNets).
Relation-based distillation — Transferring the relationships between layers or between data samples (e.g. RKD, FSP).
Self-distillation — Distilling into a same-architecture student to improve rather than compress (e.g. Born-Again Networks).
Online vs. offline distillation — Whether teacher and student train simultaneously (online) or the teacher is pre-trained and frozen (offline).
Sequence-level KD (SeqKD) — Training the student on whole sequences the teacher generates, rather than per-token distributions.
On-policy distillation — The student generates its own outputs and the teacher scores them token-by-token, fixing train/inference mismatch.
GKD (Generalized Knowledge Distillation) — A framework unifying on-policy data sourcing with a choice of divergence (forward/reverse KL, JSD).
Reasoning / CoT distillation — Transferring a teacher's chain-of-thought reasoning traces into a smaller model.
Synthetic-data distillation — Using a teacher to generate the training corpus, then doing standard fine-tuning on it.
Preference / DPO distillation — Distilling alignment or preference signals rather than just token distributions.
Compression neighbors
Quantization — Reducing the numeric precision of a model's weights/activations (e.g. FP16 → INT4) to shrink it.
GGUF — The de-facto file format for distributing quantized models for llama.cpp and Ollama.
Q4_K_M — A popular ~4.5-bit GGUF k-quant tier balancing size and quality.
Pruning — Removing weights, neurons, heads, or whole layers from a trained model.
Structured vs. unstructured pruning — Removing whole groups (hardware-friendly) vs. individual weights (high compression, irregular).
QAT (quantization-aware training) — Training that simulates quantization, often using the full-precision model as a distillation target.
Limits & pitfalls
Capacity gap — The phenomenon where a teacher too far above the student's capacity yields a worse distilled student.
Model collapse — Degradation from recursively training on AI-generated data, losing the distribution's tails (distinct from curated distillation).
Mode / diversity collapse — Loss of output variety from mode-seeking objectives, often showing up as pass@1 improving while pass@k degrades.
SLM (Small Language Model) — A compact model (roughly under 15B, often under 4B parameters) built for efficiency and on-device use, frequently distilled.