The Stillhouse

The best open models you can run yourself.

A curated directory of top open-source and distilled models, built to answer one question: what’s the best model I can run on my hardware for my task? Filter by the RAM you have and the job you need done.

Sizes are approximate Q4_K_M GGUF footprints — budget KV-cache on top. Curated list, hand-reviewed; trending strip auto-refreshed from Hugging Face.

▲ Trending on Hugging Faceupdated 2026-06-18

openai/gpt-oss-120b Qwen/Qwen3-235B-A22B-Instruct-2507 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B google/gemma-3-27b-it Qwen/Qwen3-30B-A3B-Instruct-2507 mistralai/Mistral-Small-3.2-24B-Instruct-2506 openai/gpt-oss-20b Qwen/Qwen3-8B

My hardware

My task

Commercial-use OK only

31 models

★Qwen3-Embedding-0.6B

0.6B·dense

Apache-2.0

Best small multilingual embedding model for RAG — tiny, fast, with an official GGUF.

~0.6GB · Q4_K_M32k ctx

★Qwen3-4B

4B·dense·hybrid-reason

Apache-2.0

The best small brain that fits in 4GB — toggles a thinking mode on for hard problems, off for fast triage.

~2.5GB · Q4_K_M32k ctx

🤗 model GGUF

★Qwen3-8B

8.2B·dense·hybrid-reason

Apache-2.0

The best balanced default at this size — strong tool-use and a toggleable thinking mode, all under Apache-2.0.

~5GB · Q4_K_M32k ctx

🤗 model GGUF

★Qwen3-30B-A3B-2507

30B·MoE · 3B active·hybrid-reason

Apache-2.0

Reasons like a 30B, decodes like a 3B — the best local agent brain. Fits 16GB at Q3, 24GB comfortably at Q4.

~18.6GB · Q4_K_M256k ctx

🤗 model GGUF

★Qwen2.5-Coder-32B

32B·dense

Apache-2.0

The mature dense code workhorse for a single 24GB GPU — fill-in-the-middle, long context, rock-solid.

~19.9GB · Q4_K_M128k ctx

🤗 model GGUF

★R1-Distill-Qwen-32B

32B·dense·reasoning

MIT

distilled from DeepSeek-R1

The top open dense reasoner you can run on one 24GB card — the model that beat o1-mini on several benchmarks.

~20GB · Q4_K_M128k ctx

🤗 model GGUF

★Llama 3.3 70B

70B·dense

Llama

The stable general-purpose 70B — runs on 2×24GB or a 48GB Mac, with broad ecosystem support.

~42.5GB · Q4_K_M128k ctx

🤗 model GGUF

★gpt-oss-120b

117B·MoE · 5.1B active·reasoning

Apache-2.0

o4-mini-class reasoning you can self-host — Apache-2.0, MoE-fast, ~63GB at every quant.

~63GB · Q4_K_M128k ctx

🤗 model GGUF

Nomic Embed v1.5

0.1B·dense

Apache-2.0

Tiny, fast English embeddings with 8k context — the lightweight default for local RAG.

~0.3GB · Q4_K_M8k ctx

BGE-M3

0.6B·dense

MIT

Hybrid dense+sparse+ColBERT retrieval in one model, great on long documents. MIT-licensed.

~1.2GB · Q4_K_M8k ctx

moondream2

~2B·dense

Apache-2.0

A tiny, fast VLM for edge devices — image Q&A and captioning in under 2GB.

~1.8GB · Q4_K_M2k ctx

Llama 3.2 3B

3B·dense

Llama

Tight, fast, and maximally compatible — a dependable default for classification and lightweight chat.

~2GB · Q4_K_M128k ctx

🤗 model GGUF

SmolLM3-3B

3B·dense·hybrid-reason

Apache-2.0

The strongest fully-open 3B — open data, open recipe, dual think/no-think. A distillation-community favorite.

~2GB · Q4_K_M128k ctx

Gemma 3 4B

4B·dense

Gemma

Vision-capable, 128k context, 140+ languages — a versatile small multimodal chat/summarizer.

~2.5GB · Q4_K_M128k ctx

🤗 model GGUF

Phi-4-mini

3.8B·dense

MIT

Microsoft's synthetic-data thesis in miniature — punches well above its size on math and structured tasks.

~2.5GB · Q4_K_M128k ctx

🤗 model GGUF

R1-Distill-Qwen-7B

7B·dense·reasoning

MIT

distilled from DeepSeek-R1

A math-leaning reasoner — R1 distilled onto a Qwen math base. Strong AIME-style performance for 7B.

~4.7GB · Q4_K_M128k ctx

🤗 model GGUF

R1-Distill-Llama-8B

8B·dense·reasoning

MIT

distilled from DeepSeek-R1

Chain-of-thought reasoning in an 8B — DeepSeek-R1's traces distilled into a Llama base. The viral 2025 recipe.

~4.9GB · Q4_K_M128k ctx

🤗 model GGUF

Llama 3.1 8B

8B·dense

Llama

The most battle-tested 8B — maximum ecosystem and tooling compatibility when you want a known-good base.

~4.9GB · Q4_K_M128k ctx

🤗 model GGUF

Qwen2.5-VL-7B

7B·dense

Apache-2.0

The go-to small vision-language model — OCR, documents, video, and GUI grounding, Apache-licensed.

~6GB · Q4_K_M128k ctx

Gemma 3 12B

12B·dense

Gemma

Multimodal, 128k context, 140+ languages — and itself trained with knowledge distillation.

~7.3GB · Q4_K_M128k ctx

🤗 model GGUF

Qwen3-14B

14B·dense·hybrid-reason

Apache-2.0

A fully-on-GPU dense option with big KV-cache headroom — the safe 16GB workhorse.

~9GB · Q4_K_M32k ctx

🤗 model GGUF

Phi-4-reasoning-plus

14B·dense·reasoning

MIT

distilled from o3-mini (traces)

A 14B reasoner that rivals much larger distills on AIME — CoT supervised fine-tuning plus RL.

~9GB · Q4_K_M32k ctx

🤗 model GGUF

R1-Distill-Qwen-14B

14B·dense·reasoning

MIT

distilled from DeepSeek-R1

The mid-size sweet spot for distilled reasoning — most of R1-32B's ability in a 16GB-friendly model.

~9GB · Q4_K_M128k ctx

🤗 model GGUF

gpt-oss-20b

21B·MoE · 3.6B active·reasoning

Apache-2.0

OpenAI's open MoE — o-mini-class reasoning, Apache-2.0, and a fixed ~12GB footprint at every quant.

~12.1GB · Q4_K_M128k ctx

🤗 model GGUF

Mistral Small 3.2 24B

24B·dense

Apache-2.0

The best large *Apache* chat model — multimodal, strong tool-use, comfortable on a 24GB card.

~14GB · Q4_K_M128k ctx

🤗 model GGUF

Gemma 3 27B

27B·dense

Gemma

Gemini-1.5-Pro-class on benchmarks, fully local — ships official distilled QAT int4 checkpoints.

~16.5GB · Q4_K_M128k ctx

🤗 model GGUF

Qwen3-Coder-30B-A3B

30B·MoE · 3B active

Apache-2.0

The best local agentic coder for 16–24GB — native tool-call format for Cline/Qwen Code, MoE-fast.

~18.6GB · Q4_K_M256k ctx

🤗 model GGUF

Qwen3-32B

32B·dense·hybrid-reason

Apache-2.0

The strongest dense Qwen3 generalist — a do-everything model for a 24GB card.

~19.8GB · Q4_K_M32k ctx

🤗 model GGUF

QwQ-32B

32B·dense·reasoning

Apache-2.0

Qwen's dedicated reasoning model with the cleanest license in its class — Apache-2.0 throughout.

~19.8GB · Q4_K_M128k ctx

R1-Distill-Llama-70B

70B·dense·reasoning

MIT

distilled from DeepSeek-R1

Heavy local chain-of-thought — R1's reasoning distilled into a 70B Llama for serious work.

~42.5GB · Q4_K_M128k ctx

🤗 model GGUF

Qwen3-235B-A22B-2507

235B·MoE · 22B active·hybrid-reason

Apache-2.0

A frontier open MoE — runs at low quant on 128GB+ unified memory or a multi-GPU rig.

~142GB · Q4_K_M256k ctx

🤗 model GGUF

Distillation lineage

Who learned from whom.

The thing that makes these models small is that they were taught by something larger. Here’s the teacher→student map behind the directory.

DeepSeek-R1teacher

💧 R1-Distill-Llama-8B💧 R1-Distill-Qwen-7B💧 R1-Distill-Qwen-14B💧 R1-Distill-Qwen-32B💧 R1-Distill-Llama-70B

o3-mini (traces)teacher

💧 Phi-4-reasoning-plus

Claude Opusteacher

💧 Jackrong's “Qwopus” series (Opus reasoning → Qwen)

Llama-3.1-405B + Qwen-72Bteacher

💧 Arcee SuperNova (cross-architecture logit distill)

The distillers

People worth following.

The quantizers, fine-tuners, and distillers turning frontier models into things you can actually run.

Individual distiller

The standout viral distiller of the moment — the “Qwopus” series distills Claude Opus reasoning into Qwen students, with speculative-decoding GGUF builds.

Quantizer (GGUF)

The most-trusted GGUF quantizer — first-day imatrix quants of nearly every major release.

Fine-tuning + quants

2–5× faster fine-tuning, plus the “Dynamic” GGUFs that benchmark ahead of standard quants.

Quantizer (GGUF)

The most prolific quantizer — blankets new releases with static and imatrix quants.

@mradermacher →

Distillation + merging

Author of the LLM Course; popularized abliteration and accessible merging/distillation tooling.

Open research lab

The Hermes family — hybrid reasoning and function-calling models built on open bases.

@NousResearch →

Distillation engineering

DistillKit and the SuperNova line — serious cross-architecture logit distillation in the open.

The R1-Distill series that kicked off the open reasoning-distillation wave.

@deepseek-ai →