The best open models you can run yourself.
A curated directory of top open-source and distilled models, built to answer one question: what’s the best model I can run on my hardware for my task? Filter by the RAM you have and the job you need done.
Sizes are approximate Q4_K_M GGUF footprints — budget KV-cache on top. Curated list, hand-reviewed; trending strip auto-refreshed from Hugging Face.
31 models
★Qwen3-Embedding-0.6B
Best small multilingual embedding model for RAG — tiny, fast, with an official GGUF.
★Qwen3-4B
The best small brain that fits in 4GB — toggles a thinking mode on for hard problems, off for fast triage.
★Qwen3-8B
The best balanced default at this size — strong tool-use and a toggleable thinking mode, all under Apache-2.0.
★Qwen3-30B-A3B-2507
Reasons like a 30B, decodes like a 3B — the best local agent brain. Fits 16GB at Q3, 24GB comfortably at Q4.
★Qwen2.5-Coder-32B
The mature dense code workhorse for a single 24GB GPU — fill-in-the-middle, long context, rock-solid.
★R1-Distill-Qwen-32B
The top open dense reasoner you can run on one 24GB card — the model that beat o1-mini on several benchmarks.
★Llama 3.3 70B
The stable general-purpose 70B — runs on 2×24GB or a 48GB Mac, with broad ecosystem support.
★gpt-oss-120b
o4-mini-class reasoning you can self-host — Apache-2.0, MoE-fast, ~63GB at every quant.
Nomic Embed v1.5
Tiny, fast English embeddings with 8k context — the lightweight default for local RAG.
BGE-M3
Hybrid dense+sparse+ColBERT retrieval in one model, great on long documents. MIT-licensed.
moondream2
A tiny, fast VLM for edge devices — image Q&A and captioning in under 2GB.
Llama 3.2 3B
Tight, fast, and maximally compatible — a dependable default for classification and lightweight chat.
SmolLM3-3B
The strongest fully-open 3B — open data, open recipe, dual think/no-think. A distillation-community favorite.
Gemma 3 4B
Vision-capable, 128k context, 140+ languages — a versatile small multimodal chat/summarizer.
Phi-4-mini
Microsoft's synthetic-data thesis in miniature — punches well above its size on math and structured tasks.
R1-Distill-Qwen-7B
A math-leaning reasoner — R1 distilled onto a Qwen math base. Strong AIME-style performance for 7B.
R1-Distill-Llama-8B
Chain-of-thought reasoning in an 8B — DeepSeek-R1's traces distilled into a Llama base. The viral 2025 recipe.
Llama 3.1 8B
The most battle-tested 8B — maximum ecosystem and tooling compatibility when you want a known-good base.
Qwen2.5-VL-7B
The go-to small vision-language model — OCR, documents, video, and GUI grounding, Apache-licensed.
Gemma 3 12B
Multimodal, 128k context, 140+ languages — and itself trained with knowledge distillation.
Qwen3-14B
A fully-on-GPU dense option with big KV-cache headroom — the safe 16GB workhorse.
Phi-4-reasoning-plus
A 14B reasoner that rivals much larger distills on AIME — CoT supervised fine-tuning plus RL.
R1-Distill-Qwen-14B
The mid-size sweet spot for distilled reasoning — most of R1-32B's ability in a 16GB-friendly model.
gpt-oss-20b
OpenAI's open MoE — o-mini-class reasoning, Apache-2.0, and a fixed ~12GB footprint at every quant.
Mistral Small 3.2 24B
The best large *Apache* chat model — multimodal, strong tool-use, comfortable on a 24GB card.
Gemma 3 27B
Gemini-1.5-Pro-class on benchmarks, fully local — ships official distilled QAT int4 checkpoints.
Qwen3-Coder-30B-A3B
The best local agentic coder for 16–24GB — native tool-call format for Cline/Qwen Code, MoE-fast.
Qwen3-32B
The strongest dense Qwen3 generalist — a do-everything model for a 24GB card.
QwQ-32B
Qwen's dedicated reasoning model with the cleanest license in its class — Apache-2.0 throughout.
R1-Distill-Llama-70B
Heavy local chain-of-thought — R1's reasoning distilled into a 70B Llama for serious work.
Who learned from whom.
The thing that makes these models small is that they were taught by something larger. Here’s the teacher→student map behind the directory.
People worth following.
The quantizers, fine-tuners, and distillers turning frontier models into things you can actually run.
The standout viral distiller of the moment — the “Qwopus” series distills Claude Opus reasoning into Qwen students, with speculative-decoding GGUF builds.
@Jackrong →The most-trusted GGUF quantizer — first-day imatrix quants of nearly every major release.
@bartowski →2–5× faster fine-tuning, plus the “Dynamic” GGUFs that benchmark ahead of standard quants.
@unsloth →The most prolific quantizer — blankets new releases with static and imatrix quants.
@mradermacher →Author of the LLM Course; popularized abliteration and accessible merging/distillation tooling.
@mlabonne →The Hermes family — hybrid reasoning and function-calling models built on open bases.
@NousResearch →DistillKit and the SuperNova line — serious cross-architecture logit distillation in the open.
@arcee-ai →The R1-Distill series that kicked off the open reasoning-distillation wave.
@deepseek-ai →