The distillation glossary

A fast reference for the vocabulary of model distillation. Skim it once; come back when a paper throws a term at you.

Core concepts

Knowledge distillation (KD) — Training a small "student" model to reproduce the behavior of a larger "teacher" model.

Teacher model — The large, capable source model whose knowledge is being transferred.

Student model — The smaller target model trained to mimic the teacher.

Soft targets / soft labels — The teacher's full probability distribution over outputs, a far richer signal than a single correct answer.

Hard labels — Ground-truth one-hot targets: the correct answer and nothing else.

Logits — The raw, pre-softmax scores a model produces for each class or token.

Temperature (T) — A softmax scaling knob that "softens" the output distribution to expose how the teacher sees similarity.

Dark knowledge — The information hidden in the relative probabilities the teacher assigns to incorrect answers.

KL divergence — The asymmetric measure of difference between two probability distributions used as the core distillation loss.

Forward KL (mode-covering) — A divergence that pushes the student to put probability everywhere the teacher does; can overestimate regions it can't model.

Reverse KL (mode-seeking) — A divergence that pushes the student to concentrate on the teacher's dominant modes; better for limited-capacity students.

Kinds of distillation

Response-based distillation — Transferring knowledge from the teacher's final output layer (soft targets / logits).

Feature-based distillation — Transferring knowledge from the teacher's intermediate-layer activations (e.g. FitNets).

Relation-based distillation — Transferring the relationships between layers or between data samples (e.g. RKD, FSP).

Self-distillation — Distilling into a same-architecture student to improve rather than compress (e.g. Born-Again Networks).

Online vs. offline distillation — Whether teacher and student train simultaneously (online) or the teacher is pre-trained and frozen (offline).

Sequence-level KD (SeqKD) — Training the student on whole sequences the teacher generates, rather than per-token distributions.

On-policy distillation — The student generates its own outputs and the teacher scores them token-by-token, fixing train/inference mismatch.

GKD (Generalized Knowledge Distillation) — A framework unifying on-policy data sourcing with a choice of divergence (forward/reverse KL, JSD).

Reasoning / CoT distillation — Transferring a teacher's chain-of-thought reasoning traces into a smaller model.

Synthetic-data distillation — Using a teacher to generate the training corpus, then doing standard fine-tuning on it.

Preference / DPO distillation — Distilling alignment or preference signals rather than just token distributions.

Compression neighbors

Quantization — Reducing the numeric precision of a model's weights/activations (e.g. FP16 → INT4) to shrink it.

GGUF — The de-facto file format for distributing quantized models for llama.cpp and Ollama.

Q4_K_M — A popular ~4.5-bit GGUF k-quant tier balancing size and quality.

Pruning — Removing weights, neurons, heads, or whole layers from a trained model.

Structured vs. unstructured pruning — Removing whole groups (hardware-friendly) vs. individual weights (high compression, irregular).

QAT (quantization-aware training) — Training that simulates quantization, often using the full-precision model as a distillation target.

Limits & pitfalls

Capacity gap — The phenomenon where a teacher too far above the student's capacity yields a worse distilled student.

Model collapse — Degradation from recursively training on AI-generated data, losing the distribution's tails (distinct from curated distillation).

Mode / diversity collapse — Loss of output variety from mode-seeking objectives, often showing up as pass@1 improving while pass@k degrades.

SLM (Small Language Model) — A compact model (roughly under 15B, often under 4B parameters) built for efficiency and on-device use, frequently distilled.