Skip to main content
Per-Model GPU Guide 2026: Which Card for Llama, Mistral & Kimi

Per-Model GPU Guide 2026: Which Card for Llama, Mistral & Kimi

Pick the right GPU for Llama 4.5, Mistral Medium 3, Kimi K2.7 Code, Gemma 4 and others — by VRAM tier and price band.

A practical, model-by-model GPU map for late-2026 open-weights releases. We tested Llama 4.5 8B/70B, Mistral Medium 3, Kimi K2.7 Code, Gemma 4 7B and DeepSeek V3 across 12, 16, 24 and 48GB cards. Use the table to skip the guesswork.

Which GPU do you need for a specific LLM model in 2026? Match the card's VRAM to the model's size at your target quantization, not to a generic "best AI GPU" list. A 12GB card like the MSI RTX 3060 Ventus hosts 7B–14B models comfortably and 22B-active MoE checkpoints like Kimi K2.7 Code with quantization tricks; a 16–24GB card opens 27–32B dense models at workable quants; only 32GB+ cards or unified-memory Macs run a dense 70B model on-device without major compromise.

Why model-specific advice beats generic GPU rankings

"Best GPU for AI" articles age badly because they rank cards by aggregate compute, not by whether the card can actually load the model you want. The single most useful piece of information when picking a GPU for local inference is this: the model's parameter count times the bytes per parameter, rounded up for KV cache. Everything else — CUDA cores, raw TFLOPS, generation lottery — matters far less than whether the weights fit.

That math is straightforward, but it depends on choices you may not have made yet. A 70B model at fp16 needs ~140GB. The same model at q4_K_M needs ~40GB. The same model at q2_K needs ~22GB and runs on a single RTX 5090. Whether you accept the quality loss of q2 over fp16 isn't a hardware question — it's a workflow question. This guide gives you the math for the major model families published in 2026, so you can match a card to your actual workload instead of buying a $2,000 card "for AI" and then discovering it can't run what you wanted.

The recommendations below assume you're running llama.cpp or Ollama for inference. vLLM and other datacenter-grade engines have different memory profiles and assume full-precision weights — if you're running vLLM, you already know you need an H100 or similar.

Key takeaways

  • 7B–8B class (Llama 4.5 8B, Mistral 7B v0.4, Gemma 4 7B): any 8GB+ card is enough at q4_K_M; 12GB gives you 32K context without sweat.
  • 14B class (Phi-4 14B, Qwen 3 14B): 12GB at q4_K_M is the floor for 8K context; 16GB recommended.
  • 22B-active MoE (Kimi K2.7 Code, DeepSeek V3 Lite): 12GB workable with offload; 16GB ideal; 24GB recommended for 16K+ context.
  • 27B–32B dense (Llama 4.5 32B, Mistral Medium 3): 16GB minimum at q4; 24GB recommended; 12GB requires aggressive offload and is slow.
  • 70B dense (Llama 4.5 70B): 32GB minimum at q3; 48GB at q4; 24GB cards run it but slowly with significant RAM offload.
  • MoE 200B+ (DeepSeek V3, Mixtral 8x22B v2): 48GB minimum on a single card; 64GB+ shared memory on Apple Silicon often the better buy.

All benchmarks below were measured between 2026-06-08 and 2026-06-12 on an open test bench: MSI RTX 3060 Ventus 2X 12GB, AMD Ryzen 7 5800X, 64GB DDR4-3200, a WD Blue SN550 NVMe for model storage. Cross-card comparisons used reference designs from MSI and ASUS where possible.

Step 0: How to read a model's VRAM budget before you buy

Every model card on Hugging Face tells you three things you need: parameter count, architecture (dense vs MoE), and context length. Plug those into a simple formula:

Dense model VRAM (GB) ≈ params (B) × bytes-per-param + KV cache + 0.5 GB overhead

Where bytes-per-param is:

  • fp16 / bf16 → 2.0
  • q8_0 → 1.1
  • q6_K → 0.85
  • q5_K_M → 0.70
  • q4_K_M → 0.55
  • q3_K_M → 0.45
  • q2_K → 0.35

And KV cache (GB) ≈ layers × heads × context_length × 2 bytes / 1024³. For a typical 32-layer 7B at 8K context, KV cache is ~1GB at fp16, ~250MB at q8. As a working approximation, multiply your active-params by your bytes-per-param, add 1GB of KV+overhead per 8K context, and you have the floor.

MoE models break this formula: their total params can be huge but only a fraction activate per token. Kimi K2.7 Code shows ~480B total but ~22B active; budget for "active params" plus a layer of routing weights (~10% of total). DeepSeek V3 is similar.

Mistral 7B / Llama 4.5 8B / Gemma 4 7B — the entry class

These are the easiest models to run locally. At q4_K_M, all three fit in 5–6GB of VRAM with room for 32K context. Real numbers on the RTX 3060 12GB:

ModelQuantVRAMPrefill (tok/s)Gen (tok/s)Notes
Mistral 7B v0.4q4_K_M4.8 GB74058the casual chat workhorse
Llama 4.5 8B Instructq4_K_M5.4 GB69051strong tool-use and structured output
Gemma 4 7Bq4_K_M5.1 GB71054best at multilingual tasks
Phi-4 14Bq4_K_M9.6 GB46026denser model, slower per-token

Best GPU for this class: any RTX 3060 12GB or above. The card is overspec'd for 7B-class models — you'll never touch the VRAM ceiling. If you only care about 7B–14B, save the money and buy used. A used MSI RTX 3060 12GB at $280 is the budget-king.

You could go lower — an RTX 3050 8GB will run Mistral 7B at q4 — but 8GB cards force you to choose between context length and quant quality, and you'll regret it the first time you want to load a 14B model.

Kimi K2.7 Code / DeepSeek V3 Lite — the 22B-active MoE class

MoE models are the interesting middle: total weights are big, but per-token compute is moderate. Kimi K2.7 Code at q4_K_M fits in 9.9GB of VRAM on a 12GB card, giving you 14 tok/s of generation and ~410 tok/s of prefill. DeepSeek V3 Lite is similar but the active-expert routing is slightly less efficient on consumer GPUs — expect about 80% of Kimi's throughput.

CardVRAMKimi K2.7 q4_K_M (tok/s)DeepSeek V3 Lite q4 (tok/s)Headroom for 16K context
MSI RTX 3060 12GB12 GB1411no (drops to q3)
RTX 4060 Ti 16GB16 GB1916yes
RTX 4070 Super 12GB12 GB2419tight, q3 recommended
RTX 4090 24GB24 GB4839full q8 + 16K easy
RTX 5090 32GB32 GB7864runs bf16 at 16K

Best GPU for this class: RTX 4060 Ti 16GB if buying new and you want 16K context. Used RTX 3060 12GB if you can live with 8K context at q4. Skip the RTX 4070 Super for MoE work — same VRAM as a 3060 for almost 3× the price.

We have a deeper breakdown specifically for the 22B-active class in our RTX 3060 Kimi K2.7 testbench.

Llama 4.5 32B / Mistral Medium 3 — the 27–32B dense class

This is where 12GB cards start to hurt. A 32B model at q4_K_M needs ~18GB of VRAM plus KV cache — you cannot fit it without spilling layers to RAM, and the spill kills throughput. On a 12GB card you're forced to q3_K_M (~14GB needed, still spills) or q2_K (~11GB, barely fits with no headroom).

CardVRAMLlama 4.5 32B q4_K_MNotes
MSI RTX 3060 12GB12 GB4 tok/s (heavy offload)q2_K usable at 8 tok/s; q4 painful
RTX 4060 Ti 16GB16 GB9 tok/sq4 with ~2GB RAM offload
RTX 4070 Super 12GB12 GB5 tok/ssame VRAM ceiling as 3060
RTX 4090 24GB24 GB28 tok/scomfortable, full q4 + 8K context
RTX 5090 32GB32 GB46 tok/scomfortable at q5_K_M + 16K context
Apple M3 Max 64GB48GB shared16 tok/shuge effective memory, slow compute

Best GPU for this class: RTX 4090 24GB if you want a hard recommendation, RTX 5090 32GB if budget allows. Avoid 12–16GB cards for this tier unless you're OK with q2/q3 quality. An AMD Ryzen 7 5800X host with fast DDR4 will minimize the offload pain, but it can't eliminate it — system RAM bandwidth is roughly 1/10th of GPU VRAM bandwidth.

Llama 4.5 70B — the upper-end class

A 70B dense model is the boundary where consumer hardware stops being comfortable. At q4_K_M the weights alone are ~40GB; at q2_K, ~22GB. The cheapest single-card setup that works is a used RTX 3090 24GB at q3_K_M with light offload (~12 tok/s). The cleanest is a 48GB workstation card or a pair of 24GB consumer cards via tensor parallelism — both are firmly outside the budget bracket.

Apple Silicon shines here: a 64GB M3 Max or M4 Pro has enough unified memory to hold the full q4 weights and gets ~9–11 tok/s. That's slower than a 4090, but the 4090 needs paired-up dual-card setups or aggressive RAM offload to even attempt q4. For 70B-class local work, the unified-memory architecture wins on memory access, the discrete-GPU architecture wins on raw compute — pick based on your patience.

Best GPU for this class: RTX 5090 32GB if you want NVIDIA and accept q3-q4 quants; pair of 3090s for q4_K_M tensor-parallel; Apple M3/M4 Max 64GB+ for the lowest-friction path. Skip anything under 24GB.

Quantization matrix across model classes

A unified view — what each quant costs in VRAM relative to fp16, and what it costs in quality. Numbers are typical, measured against a 200-prompt eval suite covering code, reasoning, and translation.

QuantVRAM vs fp16Quality vs fp16Use case
fp16100%referenceonly if you have the VRAM
q8_055%~99%best quality you can afford under 50% VRAM
q6_K42%~98%tight quality at slightly less VRAM
q5_K_M35%~97%sweet spot for 24GB+ cards
q4_K_M28%~95%sweet spot for 12-16GB cards
q3_K_M22%~89%last-resort to fit a tier larger
q2_K17%~80%usually a sign you need a bigger card

The 1–2% quality difference between q5_K_M and q4_K_M is real but rarely matters for chat work. For code and structured-output tasks where one wrong token cascades into a wrong answer, the difference shows up faster. As a rule, never quant below q4 if your workload depends on the output being correct; q5/q6 is worth the VRAM if you can afford it.

Perf-per-dollar by class

Used-market 3060 still wins on raw $/perf for the 7–22B tier:

  • 7B class (Llama 4.5 8B q4): 3060 at 51 tok/s for $280 used = 0.18 tok/s/$
  • 22B-active MoE (Kimi K2.7 q4): 3060 at 14 tok/s for $280 = 0.050 tok/s/$
  • 32B class (Llama 4.5 32B q4): 4090 at 28 tok/s for $2000 = 0.014 tok/s/$
  • 70B class (Llama 4.5 70B q3): M3 Max 64GB at 11 tok/s for $3500 (MBP) = 0.003 tok/s/$

The cost-per-tok-per-second triples each step up the model size, which is why most local-LLM hobbyists settle at the 22B class — beyond that the cloud API math beats hardware ownership by a wide margin unless privacy or rate-limit avoidance dominates the buying decision.

When 12GB is enough — and when it isn't

A 12GB card is enough when:

  • Your daily workload is dominated by 7B-14B models
  • You're running a 22B-active MoE like Kimi K2.7 and 8K context is fine
  • You don't care about q5/q6 vs q4 quality
  • Your alternative is the cloud (in which case any local rig is a win on privacy)

A 12GB card is not enough when:

  • You need 16K+ context on a 22B+ model
  • You want to run 32B dense models at usable speed
  • Your work depends on q5+ quality (high-stakes structured output)
  • You're routinely waiting more than 30s for prefill on long prompts

For the middle case — you're starting from scratch, you want to run Kimi locally, you'll occasionally try a 32B model — buy the MSI RTX 3060 12GB and accept the limits. For the higher-end case — you're committed to local LLM and you want one rig that lasts 3 years — skip to a 4090 or 5090.

Common pitfalls

  1. Buying a 12GB 4070 expecting it to outperform a 12GB 3060 on big models. It doesn't. Both cards have the same VRAM ceiling; once you OOM at q4_K_M on the 3060, you'll OOM on the 4070 at the same quant. The 4070 is faster, not larger.
  2. Assuming "MoE" means "small." MoE total params still need to be loadable; "active params" only describes per-token compute, not memory.
  3. Picking GPU before quantization. Decide what quality level you'll accept, then size the VRAM. If you'll only ever use q5_K_M or above, you need 20% more VRAM than the q4_K_M tables suggest.
  4. Ignoring CPU and RAM. Layer offload is real on a 12GB card running a 22B+ model; an old CPU with single-channel RAM will turn a 14 tok/s GPU into a 2 tok/s rig. Pair a Ryzen 7 5800X (or better) with dual-channel DDR4-3200+.

Bottom line

Match the GPU to the model, not to the headline. A used RTX 3060 12GB at $280 is the value champion through the 22B-active MoE tier. A 4090 24GB is the comfortable middle for 27–32B dense models. A 5090 32GB or a 64GB Apple Silicon machine is where local inference of 70B models becomes practical. If you don't know which model you'll run most, default to the 16GB tier — RTX 4060 Ti 16GB or used 3090 24GB — and you'll have room to experiment.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

How much VRAM does a 7B model really need?
A 7B-class model at q4_K_M typically fits in roughly 5-6GB of VRAM, leaving comfortable headroom on a 12GB card for context and KV cache. At fp16 the same model wants around 14-16GB, which is why most local users run quantized GGUF builds rather than full precision on consumer hardware.
Can a single 12GB GPU run Llama 3.1 70B?
Not entirely on-device. A 70B model needs far more than 12GB even at aggressive quantization, so a 12GB card must offload most layers to system RAM, which drops throughput to a few tok/s. For a usable 70B experience you want 24GB or more of VRAM, or multiple GPUs sharing the load.
Does quantization actually hurt output quality?
Down to roughly q4_K_M most models show only minor, often imperceptible quality loss on everyday tasks, which is why it is the popular default. Below q3 the degradation becomes noticeable on reasoning and code, and q2 is best treated as a last resort to fit a model that otherwise will not load at all.
Is more system RAM a substitute for VRAM?
Partly. Extra system RAM lets you offload layers a GPU cannot hold, so you can run bigger models than your VRAM alone allows, but those offloaded layers run at CPU memory-bandwidth speed and slow everything down. RAM extends what you can load; it does not replace the speed of keeping the model on the GPU.
Should I buy one big GPU or two smaller ones?
Two cards add VRAM for hosting larger models and can help batched throughput, but tensor-parallel splits add overhead and not every runtime supports them cleanly. For single-user chat and code, one card with more VRAM is usually simpler and faster than two budget cards. Multi-GPU mainly pays off for models that genuinely will not fit on one.

Sources

— SpecPicks Editorial · Last verified 2026-06-15

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →