← Knowledge base
Foundations· 3 min read

Distillation vs. quantization vs. pruning

Three different ways to shrink a model — knowledge transfer, precision reduction, and removal — what each one actually changes, and how to stack them into one local-ready pipeline.

In short

Distillation trains a new, smaller model to mimic a larger one (knowledge transfer); quantization stores the same model's weights at lower precision like FP16→INT4 (precision reduction); pruning removes weights, neurons, or layers (removal). They are orthogonal and stack: a common pipeline is prune → distill to recover accuracy → quantize to a Q4_K_M GGUF for local use.

People use "compression," "distillation," and "quantization" almost interchangeably. They're not the same thing — they change different parts of a model, and the best local models often use all three. Getting the distinctions straight is the fastest way to stop being confused by model names and release notes.

The one-line difference

TechniqueWhat it changesWhat you get
DistillationTrains a new, smaller architecture to mimic a teacherFewer parameters, retained capability
PruningRemoves weights, neurons, heads, or layersA sparser or structurally smaller model
QuantizationReduces numeric precision of weights (e.g. FP16 → INT4)Same parameters, fewer bits each

Said even more briefly: distillation = knowledge transfer. Pruning = removal. Quantization = precision reduction.

Quantization, a little deeper

Quantization is the cheapest, most common way to make a model runnable locally, because it touches only how the weights are stored, not what the model is. A 16-bit weight becomes a 4-bit approximation; the model gets ~4× smaller with a modest quality hit.

You'll meet these names constantly:

  • GGUF — the de-facto file format for distributing quantized models for llama.cpp and Ollama. The label Q4_K_M decodes as: Q4 ≈ 4-bit, _K = k-quant (smarter per-block scaling), _M = the medium mixed-precision tier. Q4_K_M (~4.5 bits/weight) is the community's quality-vs-size sweet spot.
  • GPTQ / AWQ — GPU-oriented post-training quantizers. AWQ ("activation-aware") protects the ~1% of weight channels that matter most, which helps reasoning and code hold up at low bit-rates.
  • FP8 / NVFP4 — float formats that keep dynamic range for near-FP16 quality; FP8 is fastest on new GPUs, and 4-bit float (NVFP4) is now reaching consumer hardware.

Pruning, a little deeper

Pruning removes parts of the network judged unimportant.

  • Unstructured pruning zeros individual weights — high compression, but the irregular sparsity is hard for hardware to exploit.
  • Structured pruning removes whole groups (neurons, attention heads, channels, or entire layers) — friendlier to hardware, at some accuracy cost.
  • Semi-structured (e.g. 2:4) keeps 2 of every 4 weights nonzero, hitting a ~2× speedup on modern tensor cores.

The punchline: they stack

These aren't competitors — they're stages of a pipeline. Two canonical recipes:

  1. Prune → distill → quantize. NVIDIA's Minitron does exactly the first two: prune a Llama-3.1-8B down, then use distillation against the unpruned model to recover the lost accuracy — reaching a strong 4B with up to 40× fewer training tokens than training from scratch. Finish by quantizing to a GGUF and you have a small, cheap, capable local model.
  2. Distill, then quantize. Train a small student from a big teacher, then ship it as a Q4_K_M GGUF for laptops.

Distillation even shows up inside quantization: quantization-aware training (QAT) can use the full-precision model's outputs as distillation targets so the quantized version stays close to the original. Gemma 3 ships QAT checkpoints built this way.

Which do you reach for?

  • Just want to run an existing good model on your hardware? Quantize (or download a GGUF someone already quantized).
  • Want a smaller model that's genuinely good at your task, not just a compressed giant? Distill.
  • Need to squeeze a specific architecture for a specific device? Prune, then distill to recover, then quantize.

The art of building a great local model is knowing how to combine all three. The rest of the knowledge base is about doing exactly that.