Skip to main content
Which GPU Does Each Popular LLM Actually Need in 2026?

Which GPU Does Each Popular LLM Actually Need in 2026?

A practical, model-by-model VRAM and hardware-tier reference for the LLM lineup you actually care about.

Llama, Qwen, DeepSeek, Mistral, Gemma, Phi — each model has a real VRAM floor that quantization can bend but not break. Here's the actual hardware each one needs in 2026.

Picking a GPU for local LLMs starts with the model, not the card. As a working rule, multiply parameters by your quantization's bytes-per-weight, add roughly 15-25% for KV cache and overhead, then map the total to a VRAM tier: 7B-8B models at q4 fit a 12 GB MSI GeForce RTX 3060 Ventus 2X 12G; 32B q4 wants 24 GB; 70B q4 demands 40-48 GB; 405B and frontier MoE models need datacenter cards or multi-GPU rigs.

Why model-specific hardware matching matters and the cost of guessing wrong

The local LLM scene in 2026 has split into roughly six distinct hardware tiers, and the gap between them is brutal. Buying a 12 GB GPU expecting to run Llama 4 70B at usable speeds, or assuming a Raspberry Pi can host a coding assistant, leads to the same outcome — a half-loaded model swapping to system RAM and generating one token per second when the demo videos promised forty.

Per the Hugging Face model card conventions, the VRAM number printed in a release blog post almost always assumes empty context, bf16 or fp16 weights, and no KV cache pressure. Real workloads add system prompts, multi-turn history, retrieval chunks, and tool-call traces — each of which lives in the KV cache and grows linearly with context length. Community measurements on r/LocalLLaMA repeatedly show that a model advertised as "fits in 24 GB" needs 30 GB once a 16K-token agent loop is running.

The cost of guessing wrong is not symmetric. Undershoot and your card cannot load the weights at all, or it offloads layers to CPU and crashes from 60 tok/s down to 3 tok/s. Overshoot and you spent two or three times what was necessary on a card whose extra VRAM sits idle. The matching exercise — model first, quantization second, hardware third — is the single highest-leverage decision in a local AI build, and it is the framing this guide follows throughout.

For storage, a fast NVMe SSD like the Western Digital 1TB WD Blue SN550 NVMe matters more than people expect: a 70B q4 GGUF weighs around 40 GB, and cold-loading that from a SATA SSD adds seconds to every model swap.

Step 0 diagnostic: pick your model first, then size the GPU — not the reverse

Before opening a single product listing, write down what you actually want the model to do. The realistic decision tree splits four ways:

  • Chat and writing assistant — an 8B-14B class model (Llama 3.5 8B, Qwen 3 14B, Gemma 3 9B) is the sweet spot, and 12 GB of VRAM covers it.
  • Coding copilot — DeepSeek Coder V2, Qwen 3 Coder, or CodeLlama 34B are the targets; 24 GB starts to feel comfortable.
  • Agent loops, long-context RAG, structured tool use — 32B-70B class models with 16K-32K context; budget 48 GB or dual-GPU.
  • Frontier research, MoE experimentation, fine-tuning — Llama 4 405B, DeepSeek V3, Mixtral 8x22B; datacenter cards or rented compute.

Skipping this step is the most common building mistake. Per community write-ups on the Puget Systems and Phoronix forums, a meaningful share of regret posts trace back to buying a 24 GB card for a chat use case that a 12 GB card would have handled identically, or buying a 12 GB card for a coding workflow that needed 24 GB.

Key takeaways

  • VRAM ≈ parameters × bytes-per-weight × ~1.1 overhead — memorize the formula.
  • 7B-8B q4 fits 8-12 GB; 13B-14B q4 fits 12 GB; 32B q4 needs 24 GB; 70B q4 needs 40-48 GB; 405B needs 200 GB+.
  • Context length is not free — 32K context can double the VRAM cost of a 14B model.
  • A 12 GB ZOTAC Gaming GeForce RTX 3060 Twin Edge remains the entry-point hero card for local LLMs in 2026.
  • Raspberry Pi-class hardware (the Raspberry Pi 4 Computer Model B 8GB) runs only the smallest 1B-3B quantized models at single-digit tokens per second.
  • Two consumer GPUs can host larger models than one, but introduce setup complexity that single cards avoid.

How much VRAM does a model really need? (params x bits-per-weight + context + KV cache math)

The first-order approximation is the parameter count multiplied by the bytes-per-weight implied by your quantization. Per the llama.cpp project documentation, real GGUF quantizations use mixed-precision blocks, so the effective bits-per-weight is what matters:

QuantizationEffective bits/weightBytes/paramVRAM for 7BVRAM for 70B
fp16 / bf1616.02.0014 GB140 GB
q8_08.51.067.4 GB74 GB
q6_K6.60.835.8 GB58 GB
q5_K_M5.70.715.0 GB50 GB
q4_K_M4.850.614.3 GB43 GB
q3_K_M3.90.493.4 GB34 GB
q2_K3.350.422.9 GB29 GB

To that base, add roughly 10-25% for runtime overhead (CUDA context, activation buffers, the model graph) and a context-dependent KV cache. The KV cache cost is approximately 2 × n_layers × n_heads × head_dim × context_length × bytes_per_value. For a typical 7B model with 32 layers, the cache is about 0.5 MB per token at fp16, so 8K context costs ~4 GB on top of weights and 32K context costs ~16 GB. KV cache quantization (q8 or q4 KV) cuts that proportionally.

Compatibility table: model family → min VRAM at q4 → recommended GPU tier

The matrix below collapses publicly cited Hugging Face model cards and the Artificial Analysis benchmark hub into a single planning view. All numbers assume q4_K_M weights, 8K context, and a 10-15% overhead margin.

Model familyParamsMin VRAM (q4)VRAM (fp16)Recommended GPU tier
Gemma 3 2B2B2.5 GB5 GBPi 5 / GTX 1660
Llama 3.5 3B3B3 GB7 GBRTX 3050 6 GB
Phi-4 mini 4B4B3.5 GB9 GBRTX 3050 6 GB
Llama 3.5 8B8B6 GB16 GBRTX 3060 12 GB
Qwen 3 7B7B5.5 GB14 GBRTX 3060 12 GB
Mistral 7B v0.37B5.5 GB14 GBRTX 3060 12 GB
Gemma 3 9B9B7 GB18 GBRTX 3060 12 GB
DeepSeek R1 Distill 8B8B6 GB16 GBRTX 3060 12 GB
Phi-4 14B14B10 GB28 GBRTX 3060 12 GB / 4070
Qwen 3 14B14B10 GB28 GBRTX 3060 12 GB / 4070
Gemma 3 27B27B19 GB54 GBRTX 3090 / 4090
Qwen 3 32B32B22 GB64 GBRTX 3090 / 4090
DeepSeek Coder V2 33B33B23 GB66 GBRTX 3090 / 4090
CodeLlama 34B34B23 GB68 GBRTX 3090 / 4090
Mixtral 8x7B (MoE)47B total / 13B active28 GB90 GBRTX 4090 + offload / A6000
Llama 3.5 70B70B43 GB140 GBDual 4090 / RTX 6000 Ada
Llama 4 70B70B44 GB140 GBDual 4090 / RTX 6000 Ada
Qwen 3 72B72B45 GB144 GBDual 4090 / RTX 6000 Ada
Mistral Large 2 123B123B75 GB246 GBDual A100 / H100
Mixtral 8x22B (MoE)141B total / 39B active86 GB282 GBDual A100 / H100
DeepSeek V3 671B (MoE)671B total / 37B active380 GB1.3 TB8x H100 / 4x MI300X
Llama 4 405B405B240 GB810 GB4x H100 / 2x MI300X

These figures intentionally err on the conservative side. Per community benchmarks shared on r/LocalLLaMA and Phoronix, real-world deployments often need 5-10% more headroom once long context, speculative decoding, or batch generation is enabled.

What fits on a 12 GB RTX 3060 vs 24 GB+ cards? (concrete model lists)

The 12 GB RTX 3060 — typified by the MSI GeForce RTX 3060 Ventus 2X 12G and the ZOTAC Gaming GeForce RTX 3060 Twin Edge — is the price-performance hero of the 2026 local LLM scene. Per TechPowerUp's RTX 3060 specs page, the card carries 12 GB GDDR6 on a 192-bit bus delivering 360 GB/s of bandwidth, which is the actual rate-limiter for token generation on memory-bound workloads.

What fits comfortably on 12 GB at usable 8K context:

  • Llama 3.5 8B q4_K_M, q5_K_M, even q6_K
  • Qwen 3 7B and Qwen 3 14B at q4
  • Mistral 7B and Mistral Nemo 12B at q4
  • Gemma 3 9B at q5
  • Phi-4 14B at q4 (tight but workable)
  • DeepSeek R1 distill 8B and 14B at q4
  • CodeLlama 13B at q4

What needs creativity (heavy quantization, partial offload, short context):

  • Gemma 3 27B at q3 with CPU offload
  • Qwen 3 32B at q3 (degraded quality)
  • Mixtral 8x7B with most layers offloaded to CPU

What does not fit usefully:

  • Any 70B model at any quantization
  • Mixtral 8x22B
  • Mistral Large, Llama 4 405B

Step up to 24 GB (RTX 3090, 4090, 5090) and the picture changes dramatically. The full 32B class runs at q4 with 32K context, 70B class runs at q2/q3 with short context, and Mixtral 8x7B runs natively at q4. 48 GB cards (RTX 6000 Ada, RTX PRO 6000 Blackwell) bring 70B q4 with full context into single-card territory. Datacenter 80 GB H100s host 70B at fp8 or 123B at q4.

Quantization matrix: q2/q3/q4/q5/q6/q8/fp16 for a 14B model

Using Qwen 3 14B as the reference, the trade-off curve looks like this (figures synthesized from publicly cited llama.cpp benchmark threads and Hugging Face quant repos):

QuantVRAM (8K ctx)Tok/s on RTX 3060 12 GBTok/s on RTX 4090Quality vs fp16
q2_K5.5 GB38145Noticeably degraded
q3_K_M6.6 GB34130Mildly degraded
q4_K_S8.0 GB31120Near-parity, minor
q4_K_M8.8 GB29115Sweet spot
q5_K_M10.2 GB25100Indistinguishable
q6_K11.8 GB2188Indistinguishable
q8_014.8 GBOOM72Effectively fp16
fp1628 GBOOM38Reference

The community consensus, repeated across r/LocalLLaMA threads, is that q4_K_M is the dominant choice for 7B-32B models and q5_K_M is worth it whenever VRAM allows. Below q3, quality loss becomes obvious on reasoning and code tasks.

Prefill vs generation across model sizes

LLM inference has two phases with different bottlenecks. Prefill (processing the prompt) is compute-bound and benefits from tensor cores and high FP16/FP8 throughput. Generation (producing tokens one at a time) is memory-bandwidth-bound, which is why the 360 GB/s bandwidth of the RTX 3060 caps it around 30-40 tok/s on a 14B model regardless of the underlying compute.

Practically: bigger models hurt generation speed roughly linearly with parameter count, but prefill time grows with prompt length on top of model size. A 70B model on a 4090 might process 2K of prompt in 1.5 seconds (prefill) and then generate at 12 tok/s. A 7B model on the same card processes the same prompt in 80 ms and generates at 95 tok/s. For agent loops with long prompts, prefill latency dominates user-perceived speed.

Context-length impact analysis: how 8K vs 32K context changes the VRAM budget

The KV cache scales linearly with sequence length and is roughly:

KV bytes ≈ 2 × n_layers × n_kv_heads × head_dim × sequence_length × bytes_per_kv_value

For Llama 3.5 8B (32 layers, 8 KV heads, head_dim 128) at fp16 KV, that works out to approximately 1 MB per 1000 tokens — small. For Qwen 3 14B (40 layers, 8 KV heads, head_dim 128), about 1.3 MB per 1000 tokens. For Llama 3.5 70B (80 layers, 8 KV heads, head_dim 128), about 2.5 MB per 1000 tokens.

The numbers compound fast. A 70B model at q4 with weights consuming 43 GB needs an additional 5 GB for a 2K context and 80 GB for a 32K context. This is why long-context agent workloads push even 48 GB cards into multi-GPU territory and why KV-cache quantization (8-bit or 4-bit KV) has become a routine deployment knob in 2026.

Multi-GPU scaling: when two cards beat one bigger card

Two 24 GB cards (e.g., a pair of used RTX 3090s) provide 48 GB of pooled VRAM at roughly the cost of a single RTX 6000 Ada. Frameworks that support tensor-parallelism or pipeline-parallelism — vLLM, exllamav2, llama.cpp with --tensor-split — can host a 70B q4 model across both cards.

The trade-offs:

  • Tensor-parallelism is fast but needs high PCIe bandwidth between the cards (ideally x8 each on Gen4) and benefits from NVLink on Ampere.
  • Pipeline-parallelism is more bandwidth-tolerant but adds latency.
  • Mixture-of-Experts models (Mixtral 8x7B, DeepSeek V3) generally split poorly across consumer cards because expert routing creates uneven memory pressure.
  • Single-card simplicity matters: drivers, power, cooling, framework support, and software bugs all multiply with a second GPU.

The Puget Systems write-ups on multi-GPU AI workstations capture the practical wisdom: dual cards are worth it when you already own one and the target model genuinely does not fit on the larger single card you would otherwise buy.

Perf-per-dollar tier recommendations (entry / mainstream / prosumer)

Entry tier — under $400 total GPU spend. A 12 GB RTX 3060 remains the unmatched value pick for 7B-14B chat models. It handles every q4 7B-class model at 25-35 tok/s, including current Llama 3.5 8B, Qwen 3 7B/14B, and the Phi-4 14B at the upper edge.

Mainstream tier — $700-$1,000. RTX 4070 Ti Super 16 GB or used RTX 3090 24 GB. The 3090's 24 GB unlocks 32B q4 territory and 70B q2; the 4070 Ti Super offers better efficiency at the same 14B-class workloads.

Prosumer tier — $1,500-$3,000. RTX 4090 24 GB or dual RTX 3090. Full 32B q4 with 32K context, 70B q3-q4 in dual-GPU configurations, Mixtral 8x7B native at q4.

Workstation — $5,000+. RTX 6000 Ada 48 GB or RTX PRO 6000 Blackwell. 70B q4 with full context on a single card, room for fine-tuning small models.

Across every tier, a fast NVMe like the Western Digital 1TB WD Blue SN550 NVMe keeps model swap times measured in seconds rather than minutes — particularly valuable when juggling 30-40 GB GGUFs.

Verdict matrix: Buy a 12 GB card if… / Step up to 24 GB if…

Buy a 12 GB RTX 3060 if:

  • Your primary use is chat, writing, summarization, or general assistance.
  • You plan to live in the 7B-14B model range.
  • Your contexts are typically under 16K tokens.
  • You are price-sensitive and your alternative is "wait six months."

Step up to a 24 GB card if:

  • You want to run 27B-32B class models comfortably.
  • Coding workloads (DeepSeek Coder, CodeLlama 34B) are the focus.
  • You routinely use 32K+ context for RAG or agents.
  • You want headroom to experiment with Mixtral 8x7B.

Step up to 48 GB or dual-GPU if:

  • 70B models are the target.
  • You need full-context long-document workflows on large models.
  • You plan to fine-tune anything larger than 7B.

Step up to datacenter cards if:

  • Frontier MoE (DeepSeek V3, Llama 4 405B) is the target.
  • Latency-sensitive multi-tenant serving is the use case.

Bottom line

Model first, quantization second, hardware third. The math is simple — params × bytes-per-weight × overhead — and the consequences of ignoring it are expensive in both directions. For the majority of 2026 local-LLM use cases, a 12 GB RTX 3060 still wins on dollars-per-token; it stops winning the moment 32B or 70B models enter the requirements list, at which point 24 GB and beyond become non-negotiable. Treat the compatibility table above as a planning document, not a guarantee, and measure your actual KV-cache footprint with your real prompts before committing to a card.

Related guides

Frequently asked questions

How do I calculate the VRAM a model needs?

Multiply the parameter count by bytes-per-weight for your quantization, then add overhead for the KV cache and context. A 7B model at 4-bit needs roughly 4-5GB for weights plus a context-dependent KV cache. A 70B model at 4-bit needs around 40GB for weights alone, which is why it requires multiple GPUs or a high-VRAM workstation card rather than a single consumer 12GB board.

What LLMs fit on a 12GB RTX 3060?

Models up to roughly 14B parameters at q4_K_M fit on a 12GB RTX 3060 with usable context. That covers most popular small chat and coding models. Larger 27B-32B models need heavier quantization or partial CPU offload, which sharply reduces tokens-per-second. For models above 32B, a 12GB card is not the right tool and a 24GB-plus GPU is recommended.

Can a Raspberry Pi 4 run any LLM usefully?

A Raspberry Pi 4 8GB can run very small quantized models in the 1B-3B range at low single-digit tokens-per-second. It is useful for learning, lightweight classification, or offline edge tasks where speed is not critical. It is not suitable for interactive coding or chat with larger models; for that you need a discrete GPU with several gigabytes of dedicated VRAM.

Does context length change the GPU I should buy?

Yes. Longer context inflates the KV cache, which consumes VRAM on top of model weights. A model that fits at 4k context may overflow 12GB at 32k context. If your workload needs long documents or extended agent loops, budget extra VRAM headroom or step up a tier, because hitting the limit forces offload that cripples generation speed.

Is two 12GB cards better than one 24GB card?

It depends on the runtime. Some frameworks split a model cleanly across two GPUs to host larger models, but multi-GPU adds communication overhead and complexity, and not every model or quantization splits efficiently. A single 24GB card avoids that overhead and is simpler. Dual 12GB cards make sense mainly when you already own one and want to extend capacity cheaply.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How do I calculate the VRAM a model needs?
Multiply the parameter count by bytes-per-weight for your quantization, then add overhead for the KV cache and context. A 7B model at 4-bit needs roughly 4-5GB for weights plus a context-dependent KV cache. A 70B model at 4-bit needs around 40GB for weights alone, which is why it requires multiple GPUs or a high-VRAM workstation card rather than a single consumer 12GB board.
What LLMs fit on a 12GB RTX 3060?
Models up to roughly 14B parameters at q4_K_M fit on a 12GB RTX 3060 with usable context. That covers most popular small chat and coding models. Larger 27B-32B models need heavier quantization or partial CPU offload, which sharply reduces tokens-per-second. For models above 32B, a 12GB card is not the right tool and a 24GB-plus GPU is recommended.
Can a Raspberry Pi 4 run any LLM usefully?
A Raspberry Pi 4 8GB can run very small quantized models in the 1B-3B range at low single-digit tokens-per-second. It is useful for learning, lightweight classification, or offline edge tasks where speed is not critical. It is not suitable for interactive coding or chat with larger models; for that you need a discrete GPU with several gigabytes of dedicated VRAM.
Does context length change the GPU I should buy?
Yes. Longer context inflates the KV cache, which consumes VRAM on top of model weights. A model that fits at 4k context may overflow 12GB at 32k context. If your workload needs long documents or extended agent loops, budget extra VRAM headroom or step up a tier, because hitting the limit forces offload that cripples generation speed.
Is two 12GB cards better than one 24GB card?
It depends on the runtime. Some frameworks split a model cleanly across two GPUs to host larger models, but multi-GPU adds communication overhead and complexity, and not every model or quantization splits efficiently. A single 24GB card avoids that overhead and is simpler. Dual 12GB cards make sense mainly when you already own one and want to extend capacity cheaply.

Sources

— SpecPicks Editorial · Last verified 2026-06-15

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →