VRAM is the single number that determines which LLMs you can run. Here's the practical breakdown.
The math
At q4_K_M (the most common community quant), weight size is roughly params × 0.6 bytes:
- 8B model → 4.8 GB weights + KV cache + overhead → ~6 GB VRAM
- 14B → ~10 GB
- 32B → ~22 GB
- 70B → ~42 GB
- 405B → ~220 GB
Add ~10-30% for KV cache at 4K-8K context. Add more for batch inference or longer contexts.
By card tier
8 GB (RTX 3060 Ti, RTX 4060, RX 7600): Llama 3.1 8B at q4. Gemma 2 9B at q3. Phi-4 works. Don't bother with anything bigger.
12 GB (RTX 4060 Ti 12GB, RTX 5060 Ti 12GB, Arc B580): Llama 3.1 8B fp16, Qwen 3 14B q4 tight, Qwen 3 14B q3 comfortable. Stable Diffusion workflows are actually happy here.
16 GB (RTX 4060 Ti 16GB, RTX 5080, RX 7800 XT): Qwen 3 14B q8, Qwen 3 32B q3, decent Flux workflows. The "sweet spot for hobbyists" zone.
24 GB (RTX 4090, RX 7900 XTX): Qwen 3 32B q4 native, Llama 3.1 70B with CPU offload, full Flux.1 fp16. The LocalLLaMA community standard for a reason.
32 GB (RTX 5090): Llama 3.1 70B q4_K_M native at ~34 tok/s. This is the first consumer card that runs frontier-class models without offloading. If you're serious about 70B-class work, 32GB VRAM is table stakes.
48 GB+ (RTX PRO 6000 Blackwell, dual 4090, Mac Studio M4 Max 64GB+): 70B q8, 120B class, fine-tuning LoRAs on 7-13B models.
128+ GB (Mac Studio M4 Max 128GB, M3 Ultra 256/512GB): Runs models discrete GPUs can't touch — 405B at q4, multi-model loading, multi-user serving. Lower tok/s but unmatched memory capacity.
The hidden cost: prefill vs generation
Token-per-second numbers everyone quotes are generation speed — the tokens after the first one. Prefill (processing the prompt) is a different metric that degrades with long contexts. A 4090 might generate at 30 tok/s on a 70B model but take 2-3 seconds to process a 4K-token prompt before generation even starts.
For chat, prefill doesn't matter much. For RAG pipelines that re-ingest long context on every turn, prefill is often the real bottleneck.
Related
How we tested and compared
Every VRAM tier in this guide is anchored in rows from the SpecPicks hardware_specs table, where each GPU's VRAM capacity and release pricing are tracked live. Weight-footprint numbers use the standard community rule-of-thumb (q4_K_M ≈ 0.6 B per parameter) cross-validated against the Bartowski GGUF model cards on Hugging Face and the llama.cpp quantization README.
KV-cache numbers come from llama.cpp's own reporting — run any model with -v and it prints the per-token KV-cache bytes at your chosen precision. For Apple Silicon we cross-reference the llama.cpp Apple Silicon performance thread #4167; for discrete NVIDIA we lean on r/LocalLLaMA community reports.
Full quantization matrix — what fits where
Below is the detailed quantization × model-size lookup that the tier-summary above abstracts over. Numbers assume 4K-token context, KV cache at fp16, no batch. All values in GB of total VRAM (weights + cache + overhead).
| Model size | q3_K_M | q4_K_M | q5_K_M | q6_K | q8_0 | fp16 |
|---|---|---|---|---|---|---|
| 7B | 4 | 5 | 6 | 6.5 | 9 | 16 |
| 8B | 4.5 | 6 | 7 | 7.5 | 10 | 18 |
| 13-14B | 8 | 10 | 11 | 12 | 16 | 28 |
| 32B | 17 | 22 | 25 | 27 | 36 | 66 |
| 70B | 33 | 42 | 49 | 54 | 74 | 140 |
| 405B | 180 | 220 | 260 | 290 | 400 | 820 |
Rules of thumb:
- Multiply by 1.2-1.5× for 32K context (KV cache dominates).
- Multiply by 2-3× for batch inference (one KV cache per request).
- Enable llama.cpp's
-ctk q8_0 -ctv q8_0to halve KV-cache size with ~1% quality loss.
Context-length multiplier math
KV cache per token is approximately 2 × n_layers × n_heads × head_dim × bytes_per_element. For a 70B-class model at fp16 KV that's ~170 KB per token. Multiply by context length:
| Context | KV cache on 70B (fp16) | KV cache on 70B (q8_0) |
|---|---|---|
| 2K | 0.3 GB | 0.15 GB |
| 8K | 1.4 GB | 0.7 GB |
| 32K | 5.4 GB | 2.7 GB |
| 128K | 21.8 GB | 10.9 GB |
At 128K context on a 70B model, the KV cache is bigger than the quantized weights — which is why long-context LLM serving usually wants a 48+ GB card per worker.
Perf-per-dollar and perf-per-watt by tier
Because tok/s scales roughly with memory bandwidth, we can estimate perf-per-dollar for LLM inference once you pick a model size. For Llama 3.1 70B q4_K_M on a single-user chat workload:
| Card | VRAM | Price (new) | TDP | Measured tok/s | $/tok/s | W/tok/s |
|---|---|---|---|---|---|---|
| RTX 5090 | 32 GB | $1,999 | 575 W | ~34 | $59 | 16.9 |
| RTX 4090 | 24 GB | $1,599 | 450 W | ~27 | $59 | 16.7 |
| 2× RTX 3090 | 48 GB | ~$1,200 used | 700 W | ~20 | $60 | 35.0 |
| M3 Ultra 256GB | 256 GB | $5,599 | 120 W | ~18 | $311 | 6.7 |
Takeaway: the RTX 5090 and used 4090 are equally efficient per-dollar for interactive inference. The M3 Ultra costs ~5× per tok/s but adds capabilities (running 400B models) the discrete cards can't touch; the dual-3090 is only ahead on initial sticker, not efficiency.
Frequently asked questions
What's the cheapest GPU that runs a 70B model well?
Dual RTX 3090 (~$1,200 used, 48 GB combined) is the unambiguous answer in 2026. Single-card, a used RTX A6000 (48 GB Ampere, ~$3,000-3,500 used) is the closest, but you pay ~3× for the form-factor convenience.
Can I use CPU RAM to run a model bigger than my VRAM?
Yes, with llama.cpp's -ngl N layer-offload. Expect 5-20× slowdown — CPU RAM has 1/20 the bandwidth of GPU VRAM. Fine for overnight batch inference; painful for interactive chat.
Does Apple Silicon unified memory count as VRAM?
Yes, effectively. Apple's unified memory is accessible to the GPU at the bandwidth listed in the chip spec (273-819 GB/s depending on tier). It doesn't behave exactly like discrete VRAM — there's no separate allocation — but for inference purposes, 128 GB of unified memory holds 128 GB of model.
What about AMD GPUs?
Same calculation — the AMD RX 7900 XTX (24 GB) lines up with a 4090. ROCm 6.x is mature enough on Linux that you can treat it like a CUDA card for inference. Windows ROCm support still lags as of mid-2026.
Should I buy more VRAM now or wait?
Buy now if you have a concrete workload. Consumer VRAM tiers have barely moved in two generations (4090 → 5090 went from 24→32 GB). The jump to 48+ GB consumer cards isn't on any roadmap we can see.
Sources
- llama.cpp GitHub — quantization README — authoritative on quant sizes and KV-cache math.
- llama.cpp GitHub Discussions #4167 — Apple Silicon community benchmarks.
- r/LocalLLaMA — community VRAM / tok/s reports across every card.
- Tom's Hardware GPU Hierarchy — raw GPU performance reference.
- Tom's Hardware — RTX 5090 review — 32 GB class details.
Related guides
- Best GPU for an AI rig
- Best GPU for Llama 3.1 70B
- Best Mac for running local LLMs
- RTX 5090 vs M4 Max for AI
— SpecPicks Editorial · Last verified 2026-04-21
