Which GPU Runs Which LLM? A 2026 VRAM-to-Model Map

Name: Which GPU Runs Which LLM? A 2026 VRAM-to-Model Map
Item: MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060
Author: Mike Perry

VRAM is the gating spec — here is the honest tier ladder for local inference in 2026.

By Mike Perry · Published 2026-06-12 · Last verified 2026-07-27 · 10 min read

A 2026 VRAM-to-model map for local LLM hardware. Which GPUs fit 7B, 13B, 32B, and 70B models — and at what tokens per second.

If you want to run Llama 3.1 70B fully on a single GPU in 2026, you need roughly 40 GB of VRAM at q4_K_M, which means a 48 GB workstation card such as the RTX A6000 or an enterprise H100. A 12 GB consumer card like the MSI GeForce RTX 3060 Ventus 2X 12G is sized for the 7B–13B class. Pick by the model you want to load, then check tok/s — not the other way around.

Why VRAM, not TFLOPs, is the gating spec for local inference

For local language models, the question that decides whether a card is usable at all is "do the weights fit in VRAM?" Everything else — raw compute, memory bandwidth, PCIe lane count — is a secondary tuning knob.

The reason is simple: when a model does not fit in GPU memory, inference frameworks like llama.cpp offload layers to system RAM and shuttle activations across the PCIe bus on every token. PCIe 4.0 x16 tops out near 32 GB/s; GDDR6 on an RTX 3060 runs at 360 GB/s; HBM2e on an H100 runs above 3 TB/s. A model that needs to bounce between system RAM and the GPU has effectively traded its memory bandwidth for the slower of those two numbers, and that means tokens per second collapse.

So as of 2026, the practical map looks like this: 8 GB cards run 7B models at low context lengths, 12 GB cards run 7B comfortably and 13B usably, 16 GB cards reach 14B with room for context, 24 GB cards handle 32B-class models well, and you need 48 GB or more to keep a 70B model entirely on-GPU at a useful quantization. Anything outside those tiers either does not load or runs so slowly that you stop using it. This article walks each tier and names the models you can realistically expect to run.

Key takeaways

VRAM capacity decides what fits; memory bandwidth decides how fast it generates once loaded — in that order.
A 12 GB RTX 3060 is the practical floor for serious local LLM work: 7B at fp16 or 13B at q4_K_M with headroom for context.
24 GB consumer cards (4090-class) host 32B models comfortably and 70B with offload; 48 GB workstation cards run 70B fully on-GPU.
Quantization is the dominant lever — going from fp16 to q4_K_M roughly quarters VRAM use with a small quality hit.
Context length quietly eats VRAM: a 32k context can add 2–4 GB on top of the model weights.
Used 12 GB cards remain the perf-per-dollar pick for single-user local chat in 2026.

How much VRAM does each model class actually need? (7B / 13B / 32B / 70B at q4_K_M)

These are working approximations at the dominant community-sweet-spot quantization, q4_K_M in the GGUF format, with a modest 4k context. Real numbers vary by ±10% depending on the tokenizer, the framework, and how much KV-cache you allocate.

Model size	Weights at q4_K_M	+ 4k KV cache	Total VRAM needed	Smallest GPU that fits
7B	~4.3 GB	~0.5 GB	~5 GB	8 GB
13B	~7.5 GB	~0.8 GB	~8.5 GB	12 GB
32B	~19 GB	~1.5 GB	~21 GB	24 GB
70B	~40 GB	~3 GB	~43 GB	48 GB
405B	~225 GB	~6 GB	~230 GB	multi-GPU or CPU-offload only

Bump the quantization down to q3_K_S to gain roughly 25% VRAM headroom at a noticeable quality cost; bump up to q5_K_M or q6_K for better fidelity if the model still fits.

What can a 12 GB RTX 3060 run, and at what tok/s?

The RTX 3060 12 GB sits in the sweet spot for hobbyist local LLM work because it is the cheapest current-generation NVIDIA card with enough VRAM to host a 13B model entirely on-GPU. Community benchmarks on Hugging Face and the llama.cpp discussion threads consistently show:

7B at q4_K_M — 60–80 tok/s, interactive-feeling for any chat or code task.
13B at q4_K_M — 30–45 tok/s, still responsive for single-user use.
14B at q4_K_S — 20–30 tok/s, the upper edge of what fits with room for context.
20B+ — only with CPU offload; tok/s drops into single digits.

If you pair the card with an 8-core CPU like the AMD Ryzen 7 5800X and 32 GB of system RAM, prefill on long prompts stays smooth and any offloaded layers do not bottleneck the GPU on warm-up. Our dedicated Is the RTX 3060 12GB Still Worth It in 2026? walks the broader value case.

GPU tier spec table

A snapshot of the cards that matter for local LLM work in 2026, ordered by VRAM:

GPU	VRAM	Mem bandwidth	Approx MSRP / used	Largest comfortable model
RTX 3060 12 GB	12 GB GDDR6	360 GB/s	$260–320 used	13B q4_K_M
RTX 4060 Ti 16 GB	16 GB GDDR6	288 GB/s	$450 new	14B q4_K_M with room
RTX 3090	24 GB GDDR6X	936 GB/s	$700–900 used	32B q4_K_M
RTX 4090	24 GB GDDR6X	1008 GB/s	$1600+ used	32B q4_K_M, 70B with offload
RTX A6000 (Ada)	48 GB GDDR6 ECC	768 GB/s	$4000+ used	70B fully on-GPU
H100 80 GB	80 GB HBM3	3 TB/s	datacenter	70B at higher quant + long context

For the single-user enthusiast, the 12 GB and 24 GB tiers do the heavy lifting; anything in between is either an awkward price point or a workstation card that is rarely worth the markup at home.

Quantization matrix: what q-level should you use?

Picking the right quantization is where most local-LLM beginners lose performance. Here is the practical matrix for an 8B model and a 32B model on the RTX 3060 12 GB and a 24 GB card, respectively.

Quant	VRAM for 8B	Tok/s on 12 GB	Quality loss	VRAM for 32B	Notes
q2_K	3.2 GB	90+	noticeable	11 GB	last resort; coherence drops
q3_K_M	4.0 GB	80	small but visible	14 GB	aggressive but usable
q4_K_S	4.5 GB	75	very small	17 GB	the speed pick
q4_K_M	4.9 GB	70	minimal	19 GB	the default sweet spot
q5_K_M	5.7 GB	60	imperceptible for chat	22 GB	the quality pick
q6_K	6.6 GB	55	none in blind tests	26 GB	doesn't fit a 12 GB card for 32B
q8_0	8.5 GB	45	none	34 GB	barely usable on 12 GB; reference
fp16	16 GB	n/a	none	64 GB	doesn't fit consumer cards

For most users, q4_K_M is the answer. Drop to q4_K_S if you want a little more speed; bump up to q5_K_M if the model is a marginal fit and you want maximum quality.

Prefill vs generation: why context length quietly eats VRAM

Two phases of inference behave very differently. Prefill — reading your prompt — is compute-bound and scales linearly with the number of input tokens. Generation — emitting one token at a time — is memory-bandwidth-bound and scales with model size, not prompt length. Both contend for VRAM for the KV cache, which grows with context length times model dimension times two (key and value).

In practice, a 13B model with a 4k context holds about 0.8 GB of KV cache; bump that to 32k and the cache balloons past 6 GB, which is enough to push a 13B q4 model out of a 12 GB card. If you plan to feed large documents into the model, allocate VRAM accordingly or use a framework that offloads the KV cache to system RAM (vLLM and llama.cpp both support this, with tok/s consequences).

When do you need two GPUs or a CPU-offload fallback for 70B?

For a 70B model at any usable quality, the honest answer is: get a 48 GB card, or pair two 24 GB cards via tensor parallelism in vLLM or pipeline parallelism in llama.cpp. Two RTX 3090s give 48 GB combined and can run 70B at q4_K_M with a small PCIe-transfer tax (sub-linear scaling, roughly 1.6× a single card's throughput).

CPU offload is the fallback when you cannot afford a second GPU. With a fast desktop CPU and 64 GB of system RAM, llama.cpp will run a 70B q4 model at 1–4 tok/s — slow enough that you start avoiding it, but fast enough to validate that your prompt template works before you provision real hardware.

Perf-per-dollar and perf-per-watt: is a used 12 GB card still the value pick?

For single-user chat, code help, and summarization with 7B–13B models, the RTX 3060 12 GB at $260–320 used is hard to beat in 2026. A used 3090 doubles your VRAM but quadruples the price and the wall power, which only pays off if you actually need 32B-class models day to day.

For the perf-per-watt crowd, the 3060 also wins on idle: it draws under 15 W when not generating, which matters if the box runs 24/7 as part of a homelab. A 4090 draws over 30 W at idle and more than triples that during inference.

Bottom line: the smallest card that fits your target model

Just trying things out, 7B is fine — any 8 GB current card.
Coding help and chat, 13B is the target — 12 GB RTX 3060.
Reasoning and long-context, 32B class — 24 GB RTX 3090 or 4090.
70B fully local at usable speed — 48 GB workstation card or dual 24 GB consumer cards.
Frontier-scale models — rent cloud time, or wait for the next consumer VRAM bump.

Pick the model first, find the smallest card that fits it at q4_K_M with room for context, and only then optimize for bandwidth and price.

Common pitfalls when sizing a GPU for local LLMs

Three mistakes show up over and over in community threads.

The first is buying for parameter count, not for fit. "I want to run a 70B model" is a great target — but if your budget puts a 12 GB card in your hands, you will not run a 70B model well, and chasing it via offload means you will use the model less. Pick the model size that fits in your card at q4_K_M with ~3 GB of headroom for KV cache, and you'll actually use what you bought.

The second is ignoring memory bandwidth on otherwise-similar cards. A 16 GB card sounds strictly better than a 12 GB card until you notice that some 16 GB SKUs use a narrower memory bus and end up slower than the 12 GB card they were supposed to replace. The RTX 4060 Ti 16 GB is the classic example — it sits between the 3060 12 GB and the 3090 24 GB on capacity but loses to the 3090 on bandwidth by a wide margin. Read the spec sheet, not the headline GB number.

The third is not budgeting for the rest of the system. A 4090 in a box with 16 GB of DDR4-2400 and an older 6-core CPU will idle fine but bottleneck on prefill for long prompts, because tokenization and any offloaded layers want CPU and RAM bandwidth. Match the host to the card: an 8-core CPU and 32 GB of DDR4-3200 (or DDR5) is the practical floor for a 12 GB card; 64 GB and a current 8–12-core chip is the right pairing for 24 GB and above.

Real-world numbers: what your stack will actually look like

A few representative builds we've documented for local LLM work:

Build	Card	Cost (used)	Best for	Approx tok/s on 13B q4
Entry hobby	RTX 3060 12 GB	$280	learning, 7B–13B chat	35–45
Serious hobby	RTX 3090	$800	32B work	70–90 (on 13B)
Power user	RTX 4090	$1700	32B + DirectStorage gaming	100+ (on 13B)
Workstation	RTX A6000 48 GB	$4000+	70B on a single card	30–40 (on 70B)

For most home users, the entry-hobby build is what you actually need. A serious-hobby build is the right step up when 13B is no longer enough and you have a clear 32B use case. The power-user and workstation tiers buy you headroom for the future, not necessarily a transformative experience today.

When NOT to invest in local LLM hardware

Three honest cases where renting time on a hosted endpoint beats building a local rig in 2026: (1) your workload is occasional — a few hundred queries per month is cheaper on a hosted API than on hardware that draws power 24/7, (2) you specifically need a frontier model and no smaller open-weight model will do the job, (3) you're not going to actually use the hardware enough to amortize it — if the box sits idle, the cloud's pay-per-token model wins.

If none of those apply, the local rig pays for itself in months and pays you back with privacy, no rate limits, and the slow accumulation of a workflow that is genuinely yours.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Can an RTX 3060 12GB run Llama 3.1 70B locally?

Not comfortably on a single card. A 70B model at q4_K_M needs roughly 40GB of VRAM, so a 12GB RTX 3060 must offload most layers to system RAM, dropping throughput into the low single-digit tok/s range. The 3060 is sized for 7B-13B models, where it stays fully on-GPU.

What is the largest model an RTX 3060 12GB runs well?

A 13B model at q4_K_M fits comfortably with room for context, and a 14B model fits at a tighter quantization. Public community measurements put a 13B at q4 in the 30-45 tok/s range on the 3060, which is responsive enough for single-user chat without offload penalties.

Does memory bandwidth or VRAM capacity matter more for local LLMs?

Capacity decides whether a model loads at all; bandwidth decides how fast it generates once loaded. For local inference the order is: first fit the weights in VRAM (capacity), then chase tok/s (bandwidth). A card that fits the model slowly beats a faster card that has to offload to system RAM.

Will adding a second RTX 3060 let me run a 70B model?

Two 12GB cards give 24GB combined, enough for a 70B model only at aggressive q2/q3 quantization with reduced quality, or a 34B-class model at q4 comfortably. Multi-GPU splitting via llama.cpp or vLLM works but adds PCIe-transfer overhead, so scaling is sub-linear rather than a clean 2x.

How much does context length affect the VRAM I need?

A lot. The KV cache grows with context length and model size, so a long 32K-token context can add several gigabytes on top of the weights. Plan headroom above the raw model size — a model that just fits at 4K context may overflow VRAM and start offloading once you push the window wider.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Which GPU Runs Which LLM? A 2026 VRAM-to-Model Map

Why VRAM, not TFLOPs, is the gating spec for local inference

Key takeaways

How much VRAM does each model class actually need? (7B / 13B / 32B / 70B at q4_K_M)

What can a 12 GB RTX 3060 run, and at what tok/s?

GPU tier spec table

Quantization matrix: what q-level should you use?

Prefill vs generation: why context length quietly eats VRAM

When do you need two GPUs or a CPU-offload fallback for 70B?

Perf-per-dollar and perf-per-watt: is a used 12 GB card still the value pick?

Bottom line: the smallest card that fits your target model

Common pitfalls when sizing a GPU for local LLMs

Real-world numbers: what your stack will actually look like

When NOT to invest in local LLM hardware

Related guides

Sources

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Which GPU Runs Which LLM? A 2026 VRAM-to-Model Map

Why VRAM, not TFLOPs, is the gating spec for local inference

Key takeaways

How much VRAM does each model class actually need? (7B / 13B / 32B / 70B at q4_K_M)

What can a 12 GB RTX 3060 run, and at what tok/s?

GPU tier spec table

Quantization matrix: what q-level should you use?

Prefill vs generation: why context length quietly eats VRAM

When do you need two GPUs or a CPU-offload fallback for 70B?

Perf-per-dollar and perf-per-watt: is a used 12 GB card still the value pick?

Bottom line: the smallest card that fits your target model

Common pitfalls when sizing a GPU for local LLMs

Real-world numbers: what your stack will actually look like

When NOT to invest in local LLM hardware

Related guides

Sources

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review