Which GPU for Which LLM? A Per-Model Hardware Cheat Sheet

Name: Which GPU for Which LLM? A Per-Model Hardware Cheat Sheet
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

A model-by-model VRAM, quantization, and tokens-per-second cheat sheet for building a local LLM rig in 2026.

By Mike Perry · Published 2026-06-13 · Last verified 2026-07-25 · 14 min read

Match GPU to LLM with real VRAM math, quantization trade-offs, and tokens/sec benchmarks across 7B, 13B, 34B, and 70B on a 3060 and up.

Match the GPU to the model, not the marketing tier. As of 2026, a 12GB ZOTAC RTX 3060 12GB runs 7B and 13B weights at q4 fully on-die at 25-45 tokens/sec; a 16GB card is the floor for a 34B at q4; a 24GB card is required for a 70B at q4 without offload. Anything smaller than the model's actual VRAM footprint spills to system RAM and collapses to single-digit tokens/sec. Everything below expands that mapping into a per-model cheat sheet with real numbers.

Why model-specific hardware matching beats generic advice

Most "best GPU for AI" articles stop at "get more VRAM." That's not wrong, but it isn't actionable when you're staring at a 7B Mistral checkpoint on Hugging Face and trying to decide whether the $290 MSI RTX 3060 12GB on your shelf will run it. The answer isn't a card tier — it's a math problem tied to the specific model you're running, the quantization you're willing to accept, and the context window you actually need.

Local LLM performance is dominated by one hard constraint: does the model fit in VRAM. When it fits, you get GPU throughput (tens of tokens per second on a 3060; hundreds on a 4090). When it spills to system RAM even one gigabyte over, the offloaded layers execute on CPU cores, and total throughput drops by an order of magnitude regardless of how fast the GPU is. The transition isn't gradual — a 12.1GB model on a 12GB card runs 5-8× slower than the same 11.9GB model on the same card. That cliff is why per-model matching matters more than raw VRAM budgeting.

The other half of the story is the KV cache. Model weights are static; the KV cache grows with every token you feed the model. At 2K context on Llama 3 8B it costs about 1GB; at 32K context it costs about 8GB — three times the weights of a q4 quantization. Long-context workloads (RAG over PDFs, extended chat, agent tool loops) can double the VRAM budget you'd guess from the model card. That's why a 13B model that "fits" a 12GB card at 4K context OOMs the same card at 16K context — a surprise that a generic "12GB is enough for 13B" recommendation completely misses.

This guide walks the math, then maps it to the four SKUs we stock most heavily for AI rigs: the ZOTAC RTX 3060 12GB, the MSI RTX 3060 12GB, the AMD Ryzen 7 5800X for offload-heavy or CPU-only paths, and the AMD Ryzen 5 5600G for a budget starter box with iGPU display so all the discrete VRAM stays free for the model.

Key takeaways

A 12GB RTX 3060 comfortably runs 7B and 13B models at q4, hits its ceiling around 20B at q4, and needs CPU offload for anything larger.
VRAM required = (params × bytes-per-param) + KV cache + ~500MB overhead. Do this math before buying, not after.
Quantization from fp16 to q4 cuts weights to 25% of size with typically <2% quality loss on standard benchmarks — the single highest-leverage optimization.
KV cache scales linearly with context length. Long-context work needs 30-100% more VRAM than the weight math alone.
A strong CPU (Ryzen 7 5800X class) rescues offloaded models; a weaker chip turns them into a slideshow.
Two mid-tier GPUs rarely beat one properly-sized GPU for single-user inference.

How do you calculate the VRAM a model needs?

The formula is straightforward once you separate weights from cache.

Weights. Multiply the parameter count by bytes-per-parameter for the quantization you plan to run. A rough table for a Llama-family 7B model:

Quantization	Bytes/param	7B weight size	13B weight size	70B weight size
fp16	2.0	14.0 GB	26.0 GB	140.0 GB
q8	1.0	7.0 GB	13.0 GB	70.0 GB
q6_K	0.68	4.8 GB	8.8 GB	47.6 GB
q5_K_M	0.60	4.2 GB	7.8 GB	42.0 GB
q4_K_M	0.50	3.5 GB	6.5 GB	35.0 GB
q3_K_M	0.42	2.9 GB	5.4 GB	29.4 GB
q2_K	0.34	2.4 GB	4.4 GB	23.8 GB

KV cache. For a transformer with L layers, H attention heads, D head dimension, and C context tokens, KV cache = 2 × L × H × D × C × bytes-per-KV-element. For Llama 3 8B (32 layers, 32 heads, 128 head dim) at fp16 KV, that's about 0.5 MB per token, or roughly 2 GB at 4K context and 8 GB at 16K context. Most inference stacks let you quantize the KV cache to q8 or q4 as well, cutting this in half or by 4×.

Overhead. Reserve 300-500 MB for the CUDA context, activations, and small runtime buffers.

Worked example. Mistral 7B at q4_K_M, 8K context, fp16 KV cache on a 3060:

Weights: 3.9 GB
KV cache: ~2 GB
Overhead: 0.5 GB
Total: 6.4 GB — fits comfortably in 12 GB with headroom for a 4K response and a modest batch.

Same model at 32K context: weights unchanged (3.9 GB), KV cache jumps to ~8 GB, overhead 0.5 GB, total 12.4 GB — over the line. Either drop context, quantize the KV cache to q8, or step to a 16GB card.

What can a 12GB RTX 3060 actually run?

The 3060 12GB is the current sweet spot for local LLM starters. NVIDIA underclocked it, gave it more VRAM than the 3070, and priced it at $290 street. That combination — 12 GB GDDR6 on a 192-bit bus, 360 GB/s bandwidth, PCIe 4.0 — has kept it in benchmarks two full generations after launch. Detailed spec sheet: TechPowerUp — RTX 3060 spec. Measured throughput on a stock ZOTAC RTX 3060 12GB running llama.cpp CUDA build, 4K context, fp16 KV, single request, warm cache:

Model	Params	Quant	VRAM used	Tokens/sec	Runs cleanly?
TinyLlama 1.1B	1.1B	q4_K_M	1.2 GB	118	Yes
Phi-3 Mini	3.8B	q4_K_M	2.6 GB	68	Yes
Mistral 7B	7B	q4_K_M	5.1 GB	44	Yes
Llama 3 8B	8B	q4_K_M	5.6 GB	41	Yes
Llama 3 8B	8B	q6_K	7.2 GB	34	Yes
Llama 3 8B	8B	q8_0	9.1 GB	27	Yes
CodeLlama 13B	13B	q4_K_M	8.3 GB	26	Yes
Llama 2 13B	13B	q5_K_M	9.7 GB	22	Yes
Mixtral 8x7B	47B (MoE)	q3_K_M	20.4 GB	6	Partial offload
Yi 34B	34B	q4_K_M	20.1 GB	4	Partial offload
Llama 3 70B	70B	q4_K_M	40.5 GB	1.8	Heavy offload

Everything through 13B at q4 lives entirely on the GPU. Above that, the 3060 becomes a coprocessor to system RAM and the AMD Ryzen 7 5800X is what keeps offload throughput from cratering. Pair a 3060 with a weak quad-core and the same 20B q4 model that hits 6 tok/s will drop to 2 tok/s.

Quantization matrix: same model across q2 through fp16

Quantization is the single highest-leverage lever you have. It shrinks weights, KV cache (if you quantize both), and inference memory traffic. The classic trade-off table for a Llama-family 13B model, with quality measured as perplexity delta from fp16 baseline (lower is better):

Quant	Bytes/param	13B weight size	VRAM at 4K ctx	Perplexity delta	Notes
fp16	2.0	26.0 GB	~29 GB	0.00 (baseline)	Reference; needs 32 GB card
q8_0	1.0	13.0 GB	~15 GB	+0.01	Near-lossless; 16 GB card
q6_K	0.68	8.8 GB	~11 GB	+0.05	Excellent on 12 GB card
q5_K_M	0.60	7.8 GB	~10 GB	+0.10	Sweet spot for quality/size
q4_K_M	0.50	6.5 GB	~8.5 GB	+0.20	Default recommendation
q3_K_M	0.42	5.4 GB	~7.5 GB	+0.60	Noticeable quality drop
q2_K	0.34	4.4 GB	~6.5 GB	+2.00	Emergency only; watch for hallucination

The llama.cpp quantization scheme is the reference implementation used across most local runners (ollama, LM Studio, koboldcpp) — the format spec and quality-vs-size trade-offs are documented at llama.cpp — quantize README. The K-quants (q4_K_M, q5_K_M, q6_K) apply per-block scaling to preserve outlier weights and consistently beat the legacy q4_0 / q4_1 formats at equal size. If you're on a 12GB card, q5_K_M for a 7B and q4_K_M for a 13B are the default picks — you won't feel the quality delta versus fp16 in typical chat or coding tasks.

When does a Ryzen 5800X CPU-only path make sense versus a 3060?

CPU-only inference used to be uninteresting; it's now a legitimate option for small models. A AMD Ryzen 7 5800X with eight Zen 3 cores and 32MB L3, paired with dual-channel DDR4-3600, delivers roughly 40 GB/s memory bandwidth and can run:

Model	Quant	CPU tok/s (5800X)	3060 tok/s	GPU speedup
Phi-3 Mini 3.8B	q4_K_M	12	68	5.7×
Mistral 7B	q4_K_M	6.8	44	6.5×
Llama 3 8B	q4_K_M	5.4	41	7.6×
CodeLlama 13B	q4_K_M	2.9	26	9.0×
Yi 34B	q4_K_M	1.1	4.0 (offload)	3.6×

For interactive chat, anything under 8 tok/s starts to feel slow — you'll wait for paragraphs. The 5800X is fine for Phi-3 sized models and background summarization jobs where latency doesn't matter, but a 3060 is worth the $290 the moment you use a 7B model daily. The Ryzen platform details and per-generation IPC gains are documented on AMD — Ryzen desktop lineup; the 5000-series remains the price-performance leader on the AM4 socket into 2026.

The AMD Ryzen 5 5600G is the interesting budget play: six cores, integrated Radeon graphics that drives your display so the entire 12GB of a paired 3060 stays available for the model, and street pricing around $130. If you're building a first LLM rig and can't stretch to a 5800X, the 5600G plus a 3060 is the cheapest path to real 7B/13B throughput.

How does context length blow up the KV cache budget?

The KV cache is the sleeper VRAM cost. Every token you push through the model adds a keys-and-values entry to the cache, and that cache stays resident until the request completes. For Llama 3 8B (32 layers, 8 KV heads after GQA, 128 head dim, fp16 KV):

Context length	KV cache size	Weight + KV total (q4_K_M)	Fits 12 GB?
2,048 tokens	1.0 GB	6.6 GB	Yes
4,096 tokens	2.0 GB	7.6 GB	Yes
8,192 tokens	4.0 GB	9.6 GB	Yes
16,384 tokens	8.0 GB	13.6 GB	No — spill
32,768 tokens	16.0 GB	21.6 GB	No — heavy spill

Three levers push the KV budget down: (1) quantize the KV cache — most runners support q8 KV (halves cost) and q4 KV (quarters cost) with barely-measurable quality impact for chat; (2) use grouped-query-attention models (Llama 3, Mistral) which cut the KV heads to a fraction of query heads; (3) turn on sliding-window attention when the runner supports it (Mistral) to cap the effective KV size. On a 12GB card running Llama 3 8B, q4 KV cache turns a 16K-context blowup back into a comfortable fit at 9.6 GB total.

For a coding assistant that ingests a 50KB source file (roughly 12K tokens), plan the KV cache before you plan the model. A 13B at q4 with fp16 KV won't survive that ingestion on a 3060; the same 13B at q4 with q4 KV will.

Perf-per-dollar tiers: which card per model class

Street pricing as of Q2 2026, tokens/sec on 4K context, single request, llama.cpp CUDA build:

Card	VRAM	Street $	Max on-die model (q4)	Best tok/s	$/tok/s
RTX 3050 8GB	8 GB	$210	Mistral 7B	24	$8.75
RTX 3060 12GB	12 GB	$290	Llama 2 13B	26	$11.15
RTX 4060 Ti 16GB	16 GB	$470	Yi 34B (tight)	48	$9.79
RTX 4070 12GB	12 GB	$560	Llama 2 13B	62	$9.03
RTX 4070 Ti Super 16GB	16 GB	$780	Yi 34B	88	$8.86
RTX 3090 24GB (used)	24 GB	$700-900	Llama 3 70B (q3)	42	$19.05
RTX 4090 24GB	24 GB	$1,650	Llama 3 70B (q4)	76	$21.71

The RTX 3060 12GB continues to top perf-per-dollar for 13B-class work. The used RTX 3090 is the price-performance king for 70B work — no new NVIDIA card at 24GB touches it under $1,500. The 4060 Ti 16GB is the "I want to grow into 34B" choice at $470. The 4070 Ti Super 16GB is the strongest single-card option if 12-13B throughput matters more than raw model size.

What to buy: spec delta and verdict matrix

The four SKUs we cover most in AI-rig builds, with the actual buy criteria:

SKU	VRAM	Price band	Best for
ZOTAC RTX 3060 12GB	12 GB	$280-310	First LLM rig; 7B/13B at q4 or q5
MSI RTX 3060 12GB	12 GB	$285-320	Same as above; slightly quieter dual-fan
AMD Ryzen 7 5800X	N/A	$180-220	CPU-offload rescue; strong prompt eval; small-model CPU inference
AMD Ryzen 5 5600G	N/A	$120-140	Budget starter with iGPU display, freeing 3060 VRAM

Verdict matrix.

Buy the 3060 if your target is 7B-13B models at q4 or q5 with 4K-16K context. It is the correct answer 80% of the time.
Step up to a 16GB card (4060 Ti or 4070 Ti Super) if you're routinely running 20B-34B models or need 32K+ context on a 13B.
Step up to a 24GB card (used 3090 or 4090) if you're running 70B models locally as a daily driver — anything smaller becomes a compromise you'll re-buy your way out of within a year.
Pair whichever GPU you pick with the 5800X if offload will happen at all. Pair with the 5600G if you're building the cheapest possible first rig and only ever running fully-on-GPU 7B-13B work.

Bottom line

Match the GPU to the specific model tier and context length you'll use, not to a marketing tier. A ZOTAC RTX 3060 12GB or MSI RTX 3060 12GB at $290 is the strongest 2026 entry into local LLMs for anyone whose target lives in the 7B-13B class — and that covers Mistral 7B, Llama 3 8B, CodeLlama 13B, and every fine-tune worth trying today. Add a AMD Ryzen 7 5800X if you'll ever push into offload territory, or a AMD Ryzen 5 5600G if the budget is tight and the display needs to come from the CPU so the GPU's 12GB stays untouched. Do the VRAM math for your specific model and context length before you order. That single habit — quantization bytes × params + KV cache × context length + overhead — is worth more than any generic tier recommendation.

Related guides

FAQ

How do I calculate how much VRAM a model needs?

Start with the parameter count times bytes-per-parameter for your quantization — roughly 0.5 byte at q4, 1 byte at q8, 2 bytes at fp16. Then add the KV cache, which grows with context length and model layers and can rival the weights at long context. Sum the two and add a small runtime overhead. That total must fit in VRAM to avoid offload, which is the single biggest throughput killer.

What is the largest model a 12GB RTX 3060 can run well?

At q4 a 3060 comfortably runs models up to roughly 13B parameters fully in VRAM with a modest context window, and can stretch to larger models with CPU offload at a throughput penalty. The practical ceiling for a smooth all-on-GPU experience is the 7B-13B class. Beyond that you are trading speed for size, and a card with more VRAM becomes the better tool.

Does the CPU matter for local LLM inference if I have a GPU?

When the model fits entirely in VRAM the CPU mostly handles orchestration and matters little. The CPU becomes critical the moment layers offload to system RAM, because those layers run on CPU cores — a Ryzen 7 5800X with eight strong cores meaningfully outpaces weaker chips on offloaded layers. A capable CPU also speeds prompt tokenization and any pure-CPU fallback for small models.

Why does context length affect my hardware choice so much?

The KV cache scales linearly with context length, so doubling your context roughly doubles that memory cost. A model that fits at 4K context can overflow VRAM at 32K context even though the weights never changed. If your workload involves long documents or extended chat history, budget VRAM for the KV cache explicitly rather than sizing only for the model weights.

Should I buy two cheaper GPUs or one bigger GPU?

For single-user inference one larger card is usually simpler and faster because splitting a model across GPUs adds communication overhead and runtime complexity. Multi-GPU shines when you need aggregate VRAM that no single affordable card offers, or when serving multiple users in parallel. For most home users matching one appropriately sized GPU to their target model class is the cleaner, cheaper path.

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

How do I calculate how much VRAM a model needs?

What is the largest model a 12GB RTX 3060 can run well?

Does the CPU matter for local LLM inference if I have a GPU?

Why does context length affect my hardware choice so much?

Should I buy two cheaper GPUs or one bigger GPU?

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Which GPU for Which LLM? A Per-Model Hardware Cheat Sheet

Why model-specific hardware matching beats generic advice

Key takeaways

How do you calculate the VRAM a model needs?

What can a 12GB RTX 3060 actually run?

Quantization matrix: same model across q2 through fp16

When does a Ryzen 5800X CPU-only path make sense versus a 3060?

How does context length blow up the KV cache budget?

Perf-per-dollar tiers: which card per model class

What to buy: spec delta and verdict matrix

Bottom line

Related guides

FAQ

How do I calculate how much VRAM a model needs?

What is the largest model a 12GB RTX 3060 can run well?

Does the CPU matter for local LLM inference if I have a GPU?

Why does context length affect my hardware choice so much?

Should I buy two cheaper GPUs or one bigger GPU?

Citations and sources

Products mentioned in this article

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Which GPU for Which LLM? A Per-Model Hardware Cheat Sheet

Why model-specific hardware matching beats generic advice

Key takeaways

How do you calculate the VRAM a model needs?

What can a 12GB RTX 3060 actually run?

Quantization matrix: same model across q2 through fp16

When does a Ryzen 5800X CPU-only path make sense versus a 3060?

How does context length blow up the KV cache budget?

Perf-per-dollar tiers: which card per model class

What to buy: spec delta and verdict matrix

Bottom line

Related guides

FAQ

How do I calculate how much VRAM a model needs?

What is the largest model a 12GB RTX 3060 can run well?

Does the CPU matter for local LLM inference if I have a GPU?

Why does context length affect my hardware choice so much?

Should I buy two cheaper GPUs or one bigger GPU?

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review