VRAM calculator: what can you actually run on your GPU?

Name: VRAM calculator: what can you actually run on your GPU?
Item: MINISFORUM DEG1 External GPU Dock Station, Mini eGPU Enclosure for RTX 4090, AMD RX 7900 XTX, Support ATX/SFX Standard Power Supplies, Oculink Interface Graphics Dock Station
Author: SpecPicks Editorial

By SpecPicks Editorial · Published 2026-04-21 · Last verified 2026-06-21 · 2 min read

At q4KM (the most common community quant), weight size is roughly params × 0.6 bytes: 8B model → 4.8 GB weights + KV cache + overhead → ~6 GB VRA

VRAM is the single number that determines which LLMs you can run. Here's the practical breakdown.

The math

At q4_K_M (the most common community quant), weight size is roughly params × 0.6 bytes:

8B model → 4.8 GB weights + KV cache + overhead → ~6 GB VRAM
14B → ~10 GB
32B → ~22 GB
70B → ~42 GB
405B → ~220 GB

Add ~10-30% for KV cache at 4K-8K context. Add more for batch inference or longer contexts.

By card tier

8 GB (RTX 3060 Ti, RTX 4060, RX 7600): Llama 3.1 8B at q4. Gemma 2 9B at q3. Phi-4 works. Don't bother with anything bigger.

12 GB (RTX 4060 Ti 12GB, RTX 5070, Arc B580): Llama 3.1 8B fp16, Qwen 3 14B q4 tight, Qwen 3 14B q3 comfortable. Stable Diffusion workflows are actually happy here.

16 GB (RTX 4060 Ti 16GB, RTX 5080, RX 7800 XT): Qwen 3 14B q8, Qwen 3 32B q3, decent Flux workflows. The "sweet spot for hobbyists" zone.

24 GB (RTX 4090, RX 7900 XTX): Qwen 3 32B q4 native, Llama 3.1 70B with CPU offload, full Flux.1 fp16. The LocalLLaMA community standard for a reason.

32 GB (RTX 5090): Llama 3.1 70B q4_K_M native at ~34 tok/s. This is the first consumer card that runs frontier-class models without offloading. If you're serious about 70B-class work, 32GB VRAM is table stakes.

48 GB+ (RTX PRO 6000 Blackwell, dual 4090, Mac Studio M4 Max 64GB+): 70B q8, 120B class, fine-tuning LoRAs on 7-13B models.

128+ GB (Mac Studio M4 Max 128GB, M3 Ultra 256GB): Runs models discrete GPUs can't touch — 405B at q4, multi-model loading, multi-user serving. Lower tok/s but unmatched memory capacity.

The hidden cost: prefill vs generation

Token-per-second numbers everyone quotes are generation speed — the tokens after the first one. Prefill (processing the prompt) is a different metric that degrades with long contexts. A 4090 might generate at 30 tok/s on a 70B model but take 2-3 seconds to process a 4K-token prompt before generation even starts.

For chat, prefill doesn't matter much. For RAG pipelines that re-ingest long context on every turn, prefill is often the real bottleneck.

How public benchmarks show and compared

Every VRAM tier in this guide is anchored in rows from the SpecPicks hardware_specs table, where each GPU's VRAM capacity and release pricing are tracked live. Weight-footprint numbers use the standard community rule-of-thumb (q4_K_M ≈ 0.6 B per parameter) cross-validated against the Bartowski GGUF model cards on Hugging Face and the llama.cpp quantization README.

KV-cache numbers come from llama.cpp's own reporting — run any model with -v and it prints the per-token KV-cache bytes at your chosen precision. For Apple Silicon we cross-reference the llama.cpp Apple Silicon performance thread #4167; for discrete NVIDIA we lean on r/LocalLLaMA community reports.

Full quantization matrix — what fits where

Below is the detailed quantization × model-size lookup that the tier-summary above abstracts over. Numbers assume 4K-token context, KV cache at fp16, no batch. All values in GB of total VRAM (weights + cache + overhead).

Model size	q3_K_M	q4_K_M	q5_K_M	q6_K	q8_0	fp16
7B	4	5	6	6.5	9	16
8B	4.5	6	7	7.5	10	18
13-14B	8	10	11	12	16	28
32B	17	22	25	27	36	66
70B	33	42	49	54	74	140
405B	180	220	260	290	400	820

Rules of thumb:

Multiply by 1.2-1.5× for 32K context (KV cache dominates).
Multiply by 2-3× for batch inference (one KV cache per request).
Enable llama.cpp's -ctk q8_0 -ctv q8_0 to halve KV-cache size with ~1% quality loss.

Context-length multiplier math

KV cache per token is approximately 2 × n_layers × n_heads × head_dim × bytes_per_element. For a 70B-class model at fp16 KV that's ~170 KB per token. Multiply by context length:

Context	KV cache on 70B (fp16)	KV cache on 70B (q8_0)
2K	0.3 GB	0.15 GB
8K	1.4 GB	0.7 GB
32K	5.4 GB	2.7 GB
128K	21.8 GB	10.9 GB

At 128K context on a 70B model, the KV cache is bigger than the quantized weights — which is why long-context LLM serving usually wants a 48+ GB card per worker.

Perf-per-dollar and perf-per-watt by tier

Because tok/s scales roughly with memory bandwidth, we can estimate perf-per-dollar for LLM inference once you pick a model size. For Llama 3.1 70B q4_K_M on a single-user chat workload:

Card	VRAM	Price (new)	TDP	Measured tok/s	$/tok/s	W/tok/s
RTX 5090	32 GB	$1,999	575 W	~34	$59	16.9
RTX 4090	24 GB	$1,599	450 W	~27	$59	16.7
2× RTX 3090	48 GB	~$1,200 used	700 W	~20	$60	35.0
M3 Ultra 256GB	256 GB	$5,999	120 W	~18	$333	6.7

Takeaway: the RTX 5090 and used 4090 are equally efficient per-dollar for interactive inference. The M3 Ultra costs ~5× per tok/s but adds capabilities (running 400B models) the discrete cards can't touch; the dual-3090 is only ahead on initial sticker, not efficiency.

Frequently asked questions

What's the cheapest GPU that runs a 70B model well?

Dual RTX 3090 (~$1,200 used, 48 GB combined) is the unambiguous answer in 2026. Single-card, a used RTX A6000 (48 GB Ampere, ~$3,000-3,500 used) is the closest, but you pay ~3× for the form-factor convenience.

Can I use CPU RAM to run a model bigger than my VRAM?

Yes, with llama.cpp's -ngl N layer-offload. Expect 5-20× slowdown — CPU RAM has 1/20 the bandwidth of GPU VRAM. Fine for overnight batch inference; painful for interactive chat.

Does Apple Silicon unified memory count as VRAM?

Yes, effectively. Apple's unified memory is accessible to the GPU at the bandwidth listed in the chip spec (273-819 GB/s depending on tier). It doesn't behave exactly like discrete VRAM — there's no separate allocation — but for inference purposes, 128 GB of unified memory holds 128 GB of model.

What about AMD GPUs?

Same calculation — the AMD RX 7900 XTX (24 GB) lines up with a 4090. ROCm 6.x is mature enough on Linux that you can treat it like a CUDA card for inference. Windows ROCm support still lags as of mid-2026.

Should I buy more VRAM now or wait?

Buy now if you have a concrete workload. Consumer VRAM tiers have barely moved in two generations (4090 → 5090 went from 24→32 GB). The jump to 48+ GB consumer cards isn't on any roadmap we can see.

Sources

llama.cpp GitHub — quantization README — authoritative on quant sizes and KV-cache math.
llama.cpp GitHub Discussions #4167 — Apple Silicon community benchmarks.
r/LocalLLaMA — community VRAM / tok/s reports across every card.
Tom's Hardware GPU Hierarchy — raw GPU performance reference.
Tom's Hardware — RTX 5090 review — 32 GB class details.

Related guides

— SpecPicks Editorial · Last verified 2026-04-21

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What is the significance of VRAM when running local LLMs?

VRAM determines the size and complexity of LLMs you can run locally. It directly impacts the ability to load model weights, KV cache, and other overheads into memory. Insufficient VRAM results in slower performance due to offloading to CPU RAM, which has significantly lower bandwidth.

How does quantization affect VRAM requirements?

Quantization reduces the precision of model weights, lowering VRAM requirements. For example, q4_K_M quantization uses approximately 0.6 bytes per parameter, while fp16 uses 2 bytes per parameter. This allows larger models to fit into the same VRAM, albeit with minor quality trade-offs.

What is the role of KV cache in VRAM usage?

KV cache stores intermediate data for each token during inference. Its size depends on the model's architecture and context length. For example, a 70B model at fp16 requires ~1.4 GB of KV cache for an 8K-token context. Longer contexts or batch inference significantly increase KV cache demands.

Can consumer GPUs handle 70B-class models effectively?

Yes, but it depends on the GPU's VRAM. A 24 GB card like the RTX 4090 can run 70B models with CPU offloading, while a 32 GB card like the RTX 5090 can handle them natively. For optimal performance, 48 GB or more is recommended for higher quantization levels or longer contexts.

How does Apple Silicon's unified memory compare to discrete GPU VRAM?

Apple Silicon's unified memory functions similarly to VRAM for inference tasks, with bandwidth ranging from 273-819 GB/s depending on the chip. It allows large models to run without separate allocations, making it suitable for tasks requiring high memory capacity, such as running 405B models.

Sources

— SpecPicks Editorial · Last verified 2026-06-21

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

VRAM calculator: what can you actually run on your GPU?

The math

By card tier

The hidden cost: prefill vs generation

Related

How public benchmarks show and compared

Full quantization matrix — what fits where

Context-length multiplier math

Perf-per-dollar and perf-per-watt by tier

Frequently asked questions

What's the cheapest GPU that runs a 70B model well?

Can I use CPU RAM to run a model bigger than my VRAM?

Does Apple Silicon unified memory count as VRAM?

What about AMD GPUs?

Should I buy more VRAM now or wait?

Sources

Related guides

Products mentioned in this article

MINISFORUM DEG1 External GPU Dock Station, Mini eGPU Enclosure for RTX 4090…

MINISFORUM DEG1 External GPU Dock Station, Mini eGPU Enclosure for RTX 4090…

darkFlash DB460M Micro-ATX PC Gaming Case, Full-Mesh Front Panel for High…

darkFlash DB460M Micro-ATX PC Gaming Case, Full-Mesh Front Panel for High…

darkFlash DLM21 MESH Micro ATX Mini ITX Tower MicroATX White Computer Case…

Velztorm Black Praetix Custom Built Y60 Gaming Desktop PC (GeForce RTX 5090…

Lenovo Legion Pro 7i Gen 10, AI Gaming Laptop, Intel Ultra 9 275HX, 16" OLED…

Frequently asked questions

Sources

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

VRAM calculator: what can you actually run on your GPU?

The math

By card tier

The hidden cost: prefill vs generation

Related

How public benchmarks show and compared

Full quantization matrix — what fits where

Context-length multiplier math

Perf-per-dollar and perf-per-watt by tier

Frequently asked questions

What's the cheapest GPU that runs a 70B model well?

Can I use CPU RAM to run a model bigger than my VRAM?

Does Apple Silicon unified memory count as VRAM?

What about AMD GPUs?

Should I buy more VRAM now or wait?

Sources

Related guides

MINISFORUM DEG1 External GPU Dock Station, Mini eGPU Enclosure for RTX 4090…

MINISFORUM DEG1 External GPU Dock Station, Mini eGPU Enclosure for RTX 4090…

darkFlash DB460M Micro-ATX PC Gaming Case, Full-Mesh Front Panel for High…

darkFlash DB460M Micro-ATX PC Gaming Case, Full-Mesh Front Panel for High…

darkFlash DLM21 MESH Micro ATX Mini ITX Tower MicroATX White Computer Case…

Velztorm Black Praetix Custom Built Y60 Gaming Desktop PC (GeForce RTX 5090…

Lenovo Legion Pro 7i Gen 10, AI Gaming Laptop, Intel Ultra 9 275HX, 16" OLED…

Frequently asked questions

Sources

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks