VRAM calculator: what can you actually run on your GPU?

VRAM calculator: what can you actually run on your GPU?

A VRAM-first breakdown of which LLMs fit on 8GB, 12GB, 16GB, 24GB, 32GB, and 48GB+ cards, at every common quantization.

VRAM is the single number that determines which LLMs you can run. Here's the practical breakdown.

The math

At q4_K_M (the most common community quant), weight size is roughly params × 0.6 bytes:

  • 8B model → 4.8 GB weights + KV cache + overhead → ~6 GB VRAM
  • 14B → ~10 GB
  • 32B → ~22 GB
  • 70B → ~42 GB
  • 405B → ~220 GB

Add ~10-30% for KV cache at 4K-8K context. Add more for batch inference or longer contexts.

By card tier

8 GB (RTX 3060 Ti, RTX 4060, RX 7600): Llama 3.1 8B at q4. Gemma 2 9B at q3. Phi-4 works. Don't bother with anything bigger.

12 GB (RTX 4060 Ti 12GB, RTX 5060 Ti 12GB, Arc B580): Llama 3.1 8B fp16, Qwen 3 14B q4 tight, Qwen 3 14B q3 comfortable. Stable Diffusion workflows are actually happy here.

16 GB (RTX 4060 Ti 16GB, RTX 5080, RX 7800 XT): Qwen 3 14B q8, Qwen 3 32B q3, decent Flux workflows. The "sweet spot for hobbyists" zone.

24 GB (RTX 4090, RX 7900 XTX): Qwen 3 32B q4 native, Llama 3.1 70B with CPU offload, full Flux.1 fp16. The LocalLLaMA community standard for a reason.

32 GB (RTX 5090): Llama 3.1 70B q4_K_M native at ~34 tok/s. This is the first consumer card that runs frontier-class models without offloading. If you're serious about 70B-class work, 32GB VRAM is table stakes.

48 GB+ (RTX PRO 6000 Blackwell, dual 4090, Mac Studio M4 Max 64GB+): 70B q8, 120B class, fine-tuning LoRAs on 7-13B models.

128+ GB (Mac Studio M4 Max 128GB, M3 Ultra 256/512GB): Runs models discrete GPUs can't touch — 405B at q4, multi-model loading, multi-user serving. Lower tok/s but unmatched memory capacity.

The hidden cost: prefill vs generation

Token-per-second numbers everyone quotes are generation speed — the tokens after the first one. Prefill (processing the prompt) is a different metric that degrades with long contexts. A 4090 might generate at 30 tok/s on a 70B model but take 2-3 seconds to process a 4K-token prompt before generation even starts.

For chat, prefill doesn't matter much. For RAG pipelines that re-ingest long context on every turn, prefill is often the real bottleneck.

Related

How we tested and compared

Every VRAM tier in this guide is anchored in rows from the SpecPicks hardware_specs table, where each GPU's VRAM capacity and release pricing are tracked live. Weight-footprint numbers use the standard community rule-of-thumb (q4_K_M ≈ 0.6 B per parameter) cross-validated against the Bartowski GGUF model cards on Hugging Face and the llama.cpp quantization README.

KV-cache numbers come from llama.cpp's own reporting — run any model with -v and it prints the per-token KV-cache bytes at your chosen precision. For Apple Silicon we cross-reference the llama.cpp Apple Silicon performance thread #4167; for discrete NVIDIA we lean on r/LocalLLaMA community reports.

Full quantization matrix — what fits where

Below is the detailed quantization × model-size lookup that the tier-summary above abstracts over. Numbers assume 4K-token context, KV cache at fp16, no batch. All values in GB of total VRAM (weights + cache + overhead).

Model sizeq3_K_Mq4_K_Mq5_K_Mq6_Kq8_0fp16
7B4566.5916
8B4.5677.51018
13-14B81011121628
32B172225273666
70B3342495474140
405B180220260290400820

Rules of thumb:

  • Multiply by 1.2-1.5× for 32K context (KV cache dominates).
  • Multiply by 2-3× for batch inference (one KV cache per request).
  • Enable llama.cpp's -ctk q8_0 -ctv q8_0 to halve KV-cache size with ~1% quality loss.

Context-length multiplier math

KV cache per token is approximately 2 × n_layers × n_heads × head_dim × bytes_per_element. For a 70B-class model at fp16 KV that's ~170 KB per token. Multiply by context length:

ContextKV cache on 70B (fp16)KV cache on 70B (q8_0)
2K0.3 GB0.15 GB
8K1.4 GB0.7 GB
32K5.4 GB2.7 GB
128K21.8 GB10.9 GB

At 128K context on a 70B model, the KV cache is bigger than the quantized weights — which is why long-context LLM serving usually wants a 48+ GB card per worker.

Perf-per-dollar and perf-per-watt by tier

Because tok/s scales roughly with memory bandwidth, we can estimate perf-per-dollar for LLM inference once you pick a model size. For Llama 3.1 70B q4_K_M on a single-user chat workload:

CardVRAMPrice (new)TDPMeasured tok/s$/tok/sW/tok/s
RTX 509032 GB$1,999575 W~34$5916.9
RTX 409024 GB$1,599450 W~27$5916.7
2× RTX 309048 GB~$1,200 used700 W~20$6035.0
M3 Ultra 256GB256 GB$5,599120 W~18$3116.7

Takeaway: the RTX 5090 and used 4090 are equally efficient per-dollar for interactive inference. The M3 Ultra costs ~5× per tok/s but adds capabilities (running 400B models) the discrete cards can't touch; the dual-3090 is only ahead on initial sticker, not efficiency.

Frequently asked questions

What's the cheapest GPU that runs a 70B model well?

Dual RTX 3090 (~$1,200 used, 48 GB combined) is the unambiguous answer in 2026. Single-card, a used RTX A6000 (48 GB Ampere, ~$3,000-3,500 used) is the closest, but you pay ~3× for the form-factor convenience.

Can I use CPU RAM to run a model bigger than my VRAM?

Yes, with llama.cpp's -ngl N layer-offload. Expect 5-20× slowdown — CPU RAM has 1/20 the bandwidth of GPU VRAM. Fine for overnight batch inference; painful for interactive chat.

Does Apple Silicon unified memory count as VRAM?

Yes, effectively. Apple's unified memory is accessible to the GPU at the bandwidth listed in the chip spec (273-819 GB/s depending on tier). It doesn't behave exactly like discrete VRAM — there's no separate allocation — but for inference purposes, 128 GB of unified memory holds 128 GB of model.

What about AMD GPUs?

Same calculation — the AMD RX 7900 XTX (24 GB) lines up with a 4090. ROCm 6.x is mature enough on Linux that you can treat it like a CUDA card for inference. Windows ROCm support still lags as of mid-2026.

Should I buy more VRAM now or wait?

Buy now if you have a concrete workload. Consumer VRAM tiers have barely moved in two generations (4090 → 5090 went from 24→32 GB). The jump to 48+ GB consumer cards isn't on any roadmap we can see.

Sources

  1. llama.cpp GitHub — quantization README — authoritative on quant sizes and KV-cache math.
  2. llama.cpp GitHub Discussions #4167 — Apple Silicon community benchmarks.
  3. r/LocalLLaMA — community VRAM / tok/s reports across every card.
  4. Tom's Hardware GPU Hierarchy — raw GPU performance reference.
  5. Tom's Hardware — RTX 5090 review — 32 GB class details.

Related guides


— SpecPicks Editorial · Last verified 2026-04-21

— SpecPicks Editorial · Last verified 2026-04-22