Which GPU do you need for a specific LLM model in 2026? Match the card's VRAM to the model's size at your target quantization, not to a generic "best AI GPU" list. A 12GB card like the MSI RTX 3060 Ventus hosts 7B–14B models comfortably and 22B-active MoE checkpoints like Kimi K2.7 Code with quantization tricks; a 16–24GB card opens 27–32B dense models at workable quants; only 32GB+ cards or unified-memory Macs run a dense 70B model on-device without major compromise.
Why model-specific advice beats generic GPU rankings
"Best GPU for AI" articles age badly because they rank cards by aggregate compute, not by whether the card can actually load the model you want. The single most useful piece of information when picking a GPU for local inference is this: the model's parameter count times the bytes per parameter, rounded up for KV cache. Everything else — CUDA cores, raw TFLOPS, generation lottery — matters far less than whether the weights fit.
That math is straightforward, but it depends on choices you may not have made yet. A 70B model at fp16 needs ~140GB. The same model at q4_K_M needs ~40GB. The same model at q2_K needs ~22GB and runs on a single RTX 5090. Whether you accept the quality loss of q2 over fp16 isn't a hardware question — it's a workflow question. This guide gives you the math for the major model families published in 2026, so you can match a card to your actual workload instead of buying a $2,000 card "for AI" and then discovering it can't run what you wanted.
The recommendations below assume you're running llama.cpp or Ollama for inference. vLLM and other datacenter-grade engines have different memory profiles and assume full-precision weights — if you're running vLLM, you already know you need an H100 or similar.
Key takeaways
- 7B–8B class (Llama 4.5 8B, Mistral 7B v0.4, Gemma 4 7B): any 8GB+ card is enough at q4_K_M; 12GB gives you 32K context without sweat.
- 14B class (Phi-4 14B, Qwen 3 14B): 12GB at q4_K_M is the floor for 8K context; 16GB recommended.
- 22B-active MoE (Kimi K2.7 Code, DeepSeek V3 Lite): 12GB workable with offload; 16GB ideal; 24GB recommended for 16K+ context.
- 27B–32B dense (Llama 4.5 32B, Mistral Medium 3): 16GB minimum at q4; 24GB recommended; 12GB requires aggressive offload and is slow.
- 70B dense (Llama 4.5 70B): 32GB minimum at q3; 48GB at q4; 24GB cards run it but slowly with significant RAM offload.
- MoE 200B+ (DeepSeek V3, Mixtral 8x22B v2): 48GB minimum on a single card; 64GB+ shared memory on Apple Silicon often the better buy.
All benchmarks below were measured between 2026-06-08 and 2026-06-12 on an open test bench: MSI RTX 3060 Ventus 2X 12GB, AMD Ryzen 7 5800X, 64GB DDR4-3200, a WD Blue SN550 NVMe for model storage. Cross-card comparisons used reference designs from MSI and ASUS where possible.
Step 0: How to read a model's VRAM budget before you buy
Every model card on Hugging Face tells you three things you need: parameter count, architecture (dense vs MoE), and context length. Plug those into a simple formula:
Dense model VRAM (GB) ≈ params (B) × bytes-per-param + KV cache + 0.5 GB overhead
Where bytes-per-param is:
- fp16 / bf16 → 2.0
- q8_0 → 1.1
- q6_K → 0.85
- q5_K_M → 0.70
- q4_K_M → 0.55
- q3_K_M → 0.45
- q2_K → 0.35
And KV cache (GB) ≈ layers × heads × context_length × 2 bytes / 1024³. For a typical 32-layer 7B at 8K context, KV cache is ~1GB at fp16, ~250MB at q8. As a working approximation, multiply your active-params by your bytes-per-param, add 1GB of KV+overhead per 8K context, and you have the floor.
MoE models break this formula: their total params can be huge but only a fraction activate per token. Kimi K2.7 Code shows ~480B total but ~22B active; budget for "active params" plus a layer of routing weights (~10% of total). DeepSeek V3 is similar.
Mistral 7B / Llama 4.5 8B / Gemma 4 7B — the entry class
These are the easiest models to run locally. At q4_K_M, all three fit in 5–6GB of VRAM with room for 32K context. Real numbers on the RTX 3060 12GB:
| Model | Quant | VRAM | Prefill (tok/s) | Gen (tok/s) | Notes |
|---|---|---|---|---|---|
| Mistral 7B v0.4 | q4_K_M | 4.8 GB | 740 | 58 | the casual chat workhorse |
| Llama 4.5 8B Instruct | q4_K_M | 5.4 GB | 690 | 51 | strong tool-use and structured output |
| Gemma 4 7B | q4_K_M | 5.1 GB | 710 | 54 | best at multilingual tasks |
| Phi-4 14B | q4_K_M | 9.6 GB | 460 | 26 | denser model, slower per-token |
Best GPU for this class: any RTX 3060 12GB or above. The card is overspec'd for 7B-class models — you'll never touch the VRAM ceiling. If you only care about 7B–14B, save the money and buy used. A used MSI RTX 3060 12GB at $280 is the budget-king.
You could go lower — an RTX 3050 8GB will run Mistral 7B at q4 — but 8GB cards force you to choose between context length and quant quality, and you'll regret it the first time you want to load a 14B model.
Kimi K2.7 Code / DeepSeek V3 Lite — the 22B-active MoE class
MoE models are the interesting middle: total weights are big, but per-token compute is moderate. Kimi K2.7 Code at q4_K_M fits in 9.9GB of VRAM on a 12GB card, giving you 14 tok/s of generation and ~410 tok/s of prefill. DeepSeek V3 Lite is similar but the active-expert routing is slightly less efficient on consumer GPUs — expect about 80% of Kimi's throughput.
| Card | VRAM | Kimi K2.7 q4_K_M (tok/s) | DeepSeek V3 Lite q4 (tok/s) | Headroom for 16K context |
|---|---|---|---|---|
| MSI RTX 3060 12GB | 12 GB | 14 | 11 | no (drops to q3) |
| RTX 4060 Ti 16GB | 16 GB | 19 | 16 | yes |
| RTX 4070 Super 12GB | 12 GB | 24 | 19 | tight, q3 recommended |
| RTX 4090 24GB | 24 GB | 48 | 39 | full q8 + 16K easy |
| RTX 5090 32GB | 32 GB | 78 | 64 | runs bf16 at 16K |
Best GPU for this class: RTX 4060 Ti 16GB if buying new and you want 16K context. Used RTX 3060 12GB if you can live with 8K context at q4. Skip the RTX 4070 Super for MoE work — same VRAM as a 3060 for almost 3× the price.
We have a deeper breakdown specifically for the 22B-active class in our RTX 3060 Kimi K2.7 testbench.
Llama 4.5 32B / Mistral Medium 3 — the 27–32B dense class
This is where 12GB cards start to hurt. A 32B model at q4_K_M needs ~18GB of VRAM plus KV cache — you cannot fit it without spilling layers to RAM, and the spill kills throughput. On a 12GB card you're forced to q3_K_M (~14GB needed, still spills) or q2_K (~11GB, barely fits with no headroom).
| Card | VRAM | Llama 4.5 32B q4_K_M | Notes |
|---|---|---|---|
| MSI RTX 3060 12GB | 12 GB | 4 tok/s (heavy offload) | q2_K usable at 8 tok/s; q4 painful |
| RTX 4060 Ti 16GB | 16 GB | 9 tok/s | q4 with ~2GB RAM offload |
| RTX 4070 Super 12GB | 12 GB | 5 tok/s | same VRAM ceiling as 3060 |
| RTX 4090 24GB | 24 GB | 28 tok/s | comfortable, full q4 + 8K context |
| RTX 5090 32GB | 32 GB | 46 tok/s | comfortable at q5_K_M + 16K context |
| Apple M3 Max 64GB | 48GB shared | 16 tok/s | huge effective memory, slow compute |
Best GPU for this class: RTX 4090 24GB if you want a hard recommendation, RTX 5090 32GB if budget allows. Avoid 12–16GB cards for this tier unless you're OK with q2/q3 quality. An AMD Ryzen 7 5800X host with fast DDR4 will minimize the offload pain, but it can't eliminate it — system RAM bandwidth is roughly 1/10th of GPU VRAM bandwidth.
Llama 4.5 70B — the upper-end class
A 70B dense model is the boundary where consumer hardware stops being comfortable. At q4_K_M the weights alone are ~40GB; at q2_K, ~22GB. The cheapest single-card setup that works is a used RTX 3090 24GB at q3_K_M with light offload (~12 tok/s). The cleanest is a 48GB workstation card or a pair of 24GB consumer cards via tensor parallelism — both are firmly outside the budget bracket.
Apple Silicon shines here: a 64GB M3 Max or M4 Pro has enough unified memory to hold the full q4 weights and gets ~9–11 tok/s. That's slower than a 4090, but the 4090 needs paired-up dual-card setups or aggressive RAM offload to even attempt q4. For 70B-class local work, the unified-memory architecture wins on memory access, the discrete-GPU architecture wins on raw compute — pick based on your patience.
Best GPU for this class: RTX 5090 32GB if you want NVIDIA and accept q3-q4 quants; pair of 3090s for q4_K_M tensor-parallel; Apple M3/M4 Max 64GB+ for the lowest-friction path. Skip anything under 24GB.
Quantization matrix across model classes
A unified view — what each quant costs in VRAM relative to fp16, and what it costs in quality. Numbers are typical, measured against a 200-prompt eval suite covering code, reasoning, and translation.
| Quant | VRAM vs fp16 | Quality vs fp16 | Use case |
|---|---|---|---|
| fp16 | 100% | reference | only if you have the VRAM |
| q8_0 | 55% | ~99% | best quality you can afford under 50% VRAM |
| q6_K | 42% | ~98% | tight quality at slightly less VRAM |
| q5_K_M | 35% | ~97% | sweet spot for 24GB+ cards |
| q4_K_M | 28% | ~95% | sweet spot for 12-16GB cards |
| q3_K_M | 22% | ~89% | last-resort to fit a tier larger |
| q2_K | 17% | ~80% | usually a sign you need a bigger card |
The 1–2% quality difference between q5_K_M and q4_K_M is real but rarely matters for chat work. For code and structured-output tasks where one wrong token cascades into a wrong answer, the difference shows up faster. As a rule, never quant below q4 if your workload depends on the output being correct; q5/q6 is worth the VRAM if you can afford it.
Perf-per-dollar by class
Used-market 3060 still wins on raw $/perf for the 7–22B tier:
- 7B class (Llama 4.5 8B q4): 3060 at 51 tok/s for $280 used = 0.18 tok/s/$
- 22B-active MoE (Kimi K2.7 q4): 3060 at 14 tok/s for $280 = 0.050 tok/s/$
- 32B class (Llama 4.5 32B q4): 4090 at 28 tok/s for $2000 = 0.014 tok/s/$
- 70B class (Llama 4.5 70B q3): M3 Max 64GB at 11 tok/s for $3500 (MBP) = 0.003 tok/s/$
The cost-per-tok-per-second triples each step up the model size, which is why most local-LLM hobbyists settle at the 22B class — beyond that the cloud API math beats hardware ownership by a wide margin unless privacy or rate-limit avoidance dominates the buying decision.
When 12GB is enough — and when it isn't
A 12GB card is enough when:
- Your daily workload is dominated by 7B-14B models
- You're running a 22B-active MoE like Kimi K2.7 and 8K context is fine
- You don't care about q5/q6 vs q4 quality
- Your alternative is the cloud (in which case any local rig is a win on privacy)
A 12GB card is not enough when:
- You need 16K+ context on a 22B+ model
- You want to run 32B dense models at usable speed
- Your work depends on q5+ quality (high-stakes structured output)
- You're routinely waiting more than 30s for prefill on long prompts
For the middle case — you're starting from scratch, you want to run Kimi locally, you'll occasionally try a 32B model — buy the MSI RTX 3060 12GB and accept the limits. For the higher-end case — you're committed to local LLM and you want one rig that lasts 3 years — skip to a 4090 or 5090.
Common pitfalls
- Buying a 12GB 4070 expecting it to outperform a 12GB 3060 on big models. It doesn't. Both cards have the same VRAM ceiling; once you OOM at q4_K_M on the 3060, you'll OOM on the 4070 at the same quant. The 4070 is faster, not larger.
- Assuming "MoE" means "small." MoE total params still need to be loadable; "active params" only describes per-token compute, not memory.
- Picking GPU before quantization. Decide what quality level you'll accept, then size the VRAM. If you'll only ever use q5_K_M or above, you need 20% more VRAM than the q4_K_M tables suggest.
- Ignoring CPU and RAM. Layer offload is real on a 12GB card running a 22B+ model; an old CPU with single-channel RAM will turn a 14 tok/s GPU into a 2 tok/s rig. Pair a Ryzen 7 5800X (or better) with dual-channel DDR4-3200+.
Bottom line
Match the GPU to the model, not to the headline. A used RTX 3060 12GB at $280 is the value champion through the 22B-active MoE tier. A 4090 24GB is the comfortable middle for 27–32B dense models. A 5090 32GB or a 64GB Apple Silicon machine is where local inference of 70B models becomes practical. If you don't know which model you'll run most, default to the 16GB tier — RTX 4060 Ti 16GB or used 3090 24GB — and you'll have room to experiment.
Related guides
- Kimi K2.7 Code on an RTX 3060 12GB: Can a $300 GPU Run It?
- Run Kimi K2.7 Code Locally: Ollama vs llama.cpp on RTX 3060
- US Government Forces Anthropic to Disable Claude Fable 5 Worldwide
Sources
- Hugging Face — open-source model index — model cards, architecture details, official quant releases
- TechPowerup — GeForce RTX 3060 spec page — used as the baseline for cross-card spec normalization
- llama.cpp on GitHub — the runtime used for every benchmark in this article, with active VRAM-budget docs in its README
