LLM VRAM Requirements by Model in 2026: What Fits 12GB, 24GB, 48GB

Name: LLM VRAM Requirements by Model in 2026: What Fits 12GB, 24GB, 48GB
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

Pick the model first, then the card — the per-tier math that decides every local AI build.

By Mike Perry · Published 2026-06-17 · Last verified 2026-07-29 · 11 min read

How much VRAM you actually need for local LLMs in 2026, broken down by model size, quantization level, and context length — with concrete numbers for 12GB, 24GB, and 48GB cards.

For local LLM use in 2026, the practical answer is: a 12GB card runs 7B-14B class models comfortably at q4_K_M, a 24GB card runs 27B-32B class models at q4, and 48GB starts to make 70B accessible without aggressive offload. The model size, the quantization level, and the context length each move VRAM independently, so picking a card without first picking a target model is how most builds end up too small.

Why per-model VRAM math beats GPU marketing

Most GPU buying advice for AI talks about VRAM as one number on a spec sheet — buy more, get faster. That framing collapses the moment you sit down with a real model. A 14B-parameter model at full fp16 precision needs about 28GB just to hold the weights; the same model at q4_K_M takes around 8GB and runs comfortably on a 12GB card like the ZOTAC Gaming GeForce RTX 3060 12GB. The 5090's 32GB is genuinely useful, but only because it unlocks a specific tier of models at specific quantization levels you couldn't run before. Without the model-first frame, you end up either underbuying (a 24GB card you only ever feed 7B models, wasting half its memory) or overspending on bandwidth you never touch.

This article walks the math from the bottom up. We cover the three meaningful VRAM tiers — 12GB, 24GB, and 48GB — and what each one actually unlocks. We map the quantization grid (q2 through fp16) onto common model sizes (8B, 14B, 32B). We show how KV-cache growth from long contexts eats into the same budget. And we cover when CPU offload to system RAM via a chip like the AMD Ryzen 7 5800X is worth doing versus when you should just pick a smaller model. The goal is to get you to a build where, on day one, your target models fit cleanly in VRAM and run at usable tokens per second.

The frame to hold throughout: VRAM capacity decides what you can run at all; VRAM bandwidth decides how fast it runs once it fits. Both matter, but only after you've chosen a target. Per the TechPowerUp database, the RTX 3060 12GB has 360GB/s of memory bandwidth — slow by 2026 flagship standards, but more than enough to keep 14B-class quantized models north of 25 tok/s for single-user chat.

Key takeaways

12GB is the budget local-LLM floor. It cleanly runs 7B-14B-class models at q4_K_M with room for moderate context.
24GB is the practical sweet spot. It opens 27B-32B models at q4 and 14B models at q8 with long context.
48GB starts to make 70B viable. Below 48GB, 70B-class models need aggressive quantization or offload.
Quantization is the single biggest VRAM lever. Moving from fp16 to q4_K_M roughly cuts memory in half with minimal practical quality loss.
Context length is a hidden tax. A 32k context can add 2-4GB of KV-cache, often pushing a borderline model out of VRAM.
CPU offload is a graceful degrade, not a free upgrade. Speed drops sharply once layers move to system RAM.

How much VRAM does a 7-9B model actually need?

A 7-8B model with full fp16 weights is roughly 14-16GB. At q8 (8-bit) it falls to 7-8GB. At q4_K_M (the common 4-bit format) it sits at 4-5GB. At q3 you can get under 4GB. The pattern: every bit of precision you drop roughly halves the memory footprint for the weights, with most quality loss concentrated at q3 and below. Per the Hugging Face quantization docs, GPTQ and AWQ methods at 4-bit typically retain near-fp16 quality on broad benchmarks for 7B-class models, while community measurements at q3 show visible degradation in code and reasoning tasks.

For a 12GB card running a 7B model at q4, you have plenty of headroom: ~4GB for weights, ~1GB for the runtime, leaving 6-7GB for KV-cache. That comfortably supports 16k context or more. The same 12GB card with a 14B model at q4 leaves much less room — ~8GB for weights, leaving 3GB for cache. Usable, but 4k-8k context is the realistic working budget.

What fits on a 12GB card today?

The ZOTAC RTX 3060 12GB and the MSI GeForce RTX 3060 Ventus 2X 12G are the canonical 12GB AI rigs as of 2026. Real-world reporting from places like Artificial Analysis and the r/LocalLLaMA community converges on a clear ladder:

8B-class (Llama 3.x 8B, Qwen 2.5 7B) at q5_K_M or q6_K → fits with 16k context. Generation speed 35-50 tok/s.
14B-class (Qwen 2.5 14B, Phi-4 14B) at q4_K_M → fits with 4k-8k context. Generation 22-30 tok/s.
27B-32B-class at q4 → does NOT fully fit. You can run with partial offload, accepting 3-8 tok/s.
70B-class → not practical on 12GB even at q2; pick a smaller model.

That ladder is the single most useful thing to internalize: 12GB is a 14B card, not a 32B card.

Quantization matrix: weights size by model and bit depth

Numbers below are weight-only memory, rounded for the most common GGUF-style mixed quantizations. Real VRAM usage adds 1-2GB of runtime plus KV-cache.

Quant level	8B model	14B model	32B model	70B model
fp16	~16 GB	~28 GB	~64 GB	~140 GB
q8_0	~8 GB	~14 GB	~32 GB	~70 GB
q6_K	~6.5 GB	~11 GB	~26 GB	~57 GB
q5_K_M	~5.5 GB	~10 GB	~22 GB	~48 GB
q4_K_M	~4.8 GB	~8.5 GB	~19 GB	~42 GB
q3_K_M	~3.8 GB	~6.8 GB	~15 GB	~33 GB
q2_K	~3.0 GB	~5.5 GB	~12 GB	~26 GB

Read the matrix vertically to see what fits a card; read it horizontally to see what each quantization buys. A 24GB card at q4_K_M comfortably runs 32B; a 12GB card cannot. Two 24GB cards (48GB total via tensor parallel) put 70B at q4 within reach.

Spec/benchmark table: RTX 3060 12GB vs 24GB-class cards

The 3060 stays in this comparison because, as of 2026, it remains the cheapest path to 12GB of GDDR6. The 24GB tier is dominated by the 3090 and 4090 on the used market, plus the newer 4090 and 5090 retail SKUs. Per TechPowerUp and community measurements:

GPU	VRAM	Bandwidth	Approx. tok/s on 14B q4	Notes
RTX 3060 12GB	12 GB	360 GB/s	25-30	Budget AI floor
RTX 3090 24GB	24 GB	936 GB/s	70-90	Used-market sweet spot
RTX 4090 24GB	24 GB	1008 GB/s	90-110	Current consumer 24GB peak
RTX 5090 32GB	32 GB	1792 GB/s	130-160	Opens 32B at q5+ cleanly

Tok/s figures are rough community midpoints for single-user chat at a 4k context window; your model, runtime, and prompt mix will shift them. The point of the table is the ratio of bandwidth to throughput, not absolute numbers — a roughly 2.6× bandwidth jump from 3060 to 3090 produces roughly a 3× tok/s jump, because larger cards also let the runtime use fatter batches.

How does context length change the VRAM budget?

The KV-cache stores attention key and value tensors per token, per layer, per attention head. It grows linearly with context length and is, for most modern open-weights models, larger than people expect.

A rough rule for Llama-style models: each token of context costs roughly 2 × n_layers × n_heads × head_dim × 2 bytes (the two for K and V, the trailing 2 for fp16 cache). For a 14B Llama-style model that's roughly 200-300KB per token. At 8k context, that's 1.6-2.4GB of cache; at 32k, 6.4-10GB.

On a 12GB card running a 14B model at q4_K_M, you have about 3GB of headroom for cache after weights and runtime — enough for 8k tokens of context, but 32k will not fit unless you also quantize the KV-cache or pick a smaller model. This is why "32k context" feature labels can be misleading: the model architecture supports it; your card may not.

When CPU offload to system RAM makes sense

Most inference runtimes — llama.cpp, ExLlamaV2, vLLM — let you split layers across GPU and CPU. The GPU runs the layers it has VRAM for; the CPU handles the rest using system RAM. The cost: each token's generation now requires a CPU pass that's bound by RAM bandwidth and core count, not GPU throughput.

On a build with a Ryzen 7 5800X and dual-channel DDR4-3200, you can expect roughly 3-8 tok/s for the offloaded portion of a 32B q4 model, versus 25+ tok/s when fully in VRAM on a 12GB card running a smaller 14B at the same quantization. That tradeoff — running the bigger model slowly versus the smaller model quickly — is the real choice when you're VRAM-bound.

Offload makes sense when:

The bigger model materially changes what you can do (coding, long-form reasoning, agentic loops).
You're patient with 5-10 tok/s.
Your prompts are short, so prefill doesn't dominate.

Offload is a bad call when:

You're doing interactive chat where latency matters per token.
The smaller model is within a few percentage points on your evaluation set.
Your CPU is older than Zen 3 or you're stuck on single-channel RAM.

Prefill vs generation: where VRAM pressure actually lands

A request has two phases. Prefill processes your prompt in parallel, throughput-bound — the GPU computes attention over every input token at once. Generation produces output tokens sequentially, memory-bandwidth-bound — every new token reads the entire KV-cache and the model weights once.

For local interactive use, generation is what you feel. A short 100-token prompt and a 500-token response is 99% generation time. That's why community benchmarks usually report tok/s on the generation side, and why memory bandwidth (not raw FLOPs) is the dominant performance lever for chat. The 3090-to-4090 jump shows this clearly: similar VRAM, similar compute on paper, with the 4090's higher bandwidth driving the generation-speed delta.

Reasoning models like GLM-5.2 or DeepSeek-R1 invert this slightly — they generate huge "reasoning trace" outputs before the user-visible answer, so generation latency dominates even more. On 12GB, prefer non-reasoning models if latency matters.

Perf-per-dollar: cheapest path to 14B-class local

If your target is "run 14B-class models at usable speed," the math in 2026 looks like:

RTX 3060 12GB, ~$280 used: 14B q4 at 25-30 tok/s. Best $/tok-on-day-one.
RTX 4060 Ti 16GB, ~$450 new: 14B q4 at ~40 tok/s plus room for q5/q6 or longer context.
RTX 3090 24GB, ~$700-900 used: 14B q5_K_M with 32k context; or 32B q4. Best $/capability.
RTX 4090 24GB, ~$1500+ used: Same VRAM ceiling as 3090, with ~25% more throughput. Diminishing returns for inference-only.

If you only want to run a 14B model, the 3060 12GB is still the rational pick in 2026 — half the cost of a used 3090 for ~40% of the throughput on the workloads it can run. The 3090 wins the moment you want headroom (longer context, bigger model, fine-tuning), which is the case for most users planning their second build.

Common pitfalls

Most VRAM-mismatch problems we see in the community trace back to one of these:

Picking a card for "AI" without picking a target model. Buy a 4090 to run 7B chat → wasted memory.
Ignoring KV-cache growth. A 12GB build that runs a 14B q4 at 4k context dies at 16k.
Believing the model card's "context length" implies you can use it. Architecture vs. memory budget are independent.
Trying to run a model that doesn't fit instead of stepping down a tier. A 14B at q4 fully in VRAM beats a 32B at q4 with half its layers on CPU for nearly all interactive workloads.
Not accounting for runtime overhead. llama.cpp, vLLM, and TGI each have a 1-2GB runtime cost separate from weights and cache.

When NOT to build a 12GB local rig at all

Skip 12GB and either go to 24GB or stay on cloud APIs if:

Your target is 32B-class models or bigger, period.
You need long contexts (32k+) on anything bigger than 8B.
You want to fine-tune locally — even LoRA on a 14B benefits from 24GB.
You're doing batch or RAG workloads where larger batches matter.

In all those cases the 3060 will frustrate you within a week. The 12GB tier is for single-user chat and small agentic loops on 7B-14B models, full stop.

Bottom line

Pick the model first. Quantize aggressively but stop at q4_K_M unless you have a measured reason. Confirm the KV-cache fits your real context length. Then size the card. A 12GB build like an RTX 3060 paired with a Ryzen 7 5800X and dual-channel DDR4 is the cheapest legitimate AI rig in 2026 — it runs 14B-class q4 cleanly, takes a 32B at offloaded speeds when you need to, and leaves room for fast NVMe like the WD Blue SN550 1TB NVMe for model storage. If your target is bigger than 14B, save up for 24GB; if it's bigger than 32B, save for 48GB or use the cloud.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Can a 12GB RTX 3060 run a 32B model?

Yes, but only at aggressive quantization. A 32B model at q4_K_M needs roughly 19-20GB, which exceeds 12GB, so part of the model offloads to system RAM and tok/s drops sharply. For fully-in-VRAM speed on a 3060, 14B-class models at q4 are the practical ceiling; 32B is usable but slow.

How much does quantization actually cost in quality?

Community measurements generally show q8 is near-lossless versus fp16, q5_K_M and q4_K_M lose only a few percent on most benchmarks, while q3 and q2 show visible degradation in reasoning and code tasks. The sweet spot for 12GB cards is q4_K_M, which roughly halves VRAM versus fp16 with minimal practical quality loss.

Does context length really change my VRAM needs?

Significantly. The KV-cache grows linearly with context length and number of layers, so a long 32k-token context can add several gigabytes on top of the model weights. On a 12GB card you often have to choose between a larger model and a longer context window; both compete for the same memory budget.

Will adding a Ryzen 7 5800X and more RAM speed up offloaded models?

It helps but doesn't replace VRAM. When layers offload to system memory, generation speed becomes bound by CPU compute and memory bandwidth. A fast 8-core CPU like the Ryzen 7 5800X and dual-channel RAM raise the floor for partially-offloaded models, but anything fully in VRAM will always be dramatically faster.

Is more VRAM or faster VRAM more important for inference?

Capacity decides what you can run at all; bandwidth decides how fast it runs once it fits. For local LLMs, fitting the model entirely in VRAM is the dominant factor — a slower card with enough memory beats a faster card that has to offload. Prioritize capacity until your target models fit, then optimize for bandwidth.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

LLM VRAM Requirements by Model in 2026: What Fits 12GB, 24GB, 48GB

Why per-model VRAM math beats GPU marketing

Key takeaways

How much VRAM does a 7-9B model actually need?

What fits on a 12GB card today?

Quantization matrix: weights size by model and bit depth

Spec/benchmark table: RTX 3060 12GB vs 24GB-class cards

How does context length change the VRAM budget?

When CPU offload to system RAM makes sense

Prefill vs generation: where VRAM pressure actually lands

Perf-per-dollar: cheapest path to 14B-class local

Common pitfalls

When NOT to build a 12GB local rig at all

Bottom line

Related guides

Citations and sources

Products mentioned in this article

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

LLM VRAM Requirements by Model in 2026: What Fits 12GB, 24GB, 48GB

Why per-model VRAM math beats GPU marketing

Key takeaways

How much VRAM does a 7-9B model actually need?

What fits on a 12GB card today?

Quantization matrix: weights size by model and bit depth

Spec/benchmark table: RTX 3060 12GB vs 24GB-class cards

How does context length change the VRAM budget?

When CPU offload to system RAM makes sense

Prefill vs generation: where VRAM pressure actually lands

Perf-per-dollar: cheapest path to 14B-class local

Common pitfalls

When NOT to build a 12GB local rig at all

Bottom line

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review