For local LLM use in 2026, the practical answer is: a 12GB card runs 7B-14B class models comfortably at q4_K_M, a 24GB card runs 27B-32B class models at q4, and 48GB starts to make 70B accessible without aggressive offload. The model size, the quantization level, and the context length each move VRAM independently, so picking a card without first picking a target model is how most builds end up too small.
Why per-model VRAM math beats GPU marketing
Most GPU buying advice for AI talks about VRAM as one number on a spec sheet — buy more, get faster. That framing collapses the moment you sit down with a real model. A 14B-parameter model at full fp16 precision needs about 28GB just to hold the weights; the same model at q4_K_M takes around 8GB and runs comfortably on a 12GB card like the ZOTAC Gaming GeForce RTX 3060 12GB. The 5090's 32GB is genuinely useful, but only because it unlocks a specific tier of models at specific quantization levels you couldn't run before. Without the model-first frame, you end up either underbuying (a 24GB card you only ever feed 7B models, wasting half its memory) or overspending on bandwidth you never touch.
This article walks the math from the bottom up. We cover the three meaningful VRAM tiers — 12GB, 24GB, and 48GB — and what each one actually unlocks. We map the quantization grid (q2 through fp16) onto common model sizes (8B, 14B, 32B). We show how KV-cache growth from long contexts eats into the same budget. And we cover when CPU offload to system RAM via a chip like the AMD Ryzen 7 5800X is worth doing versus when you should just pick a smaller model. The goal is to get you to a build where, on day one, your target models fit cleanly in VRAM and run at usable tokens per second.
The frame to hold throughout: VRAM capacity decides what you can run at all; VRAM bandwidth decides how fast it runs once it fits. Both matter, but only after you've chosen a target. Per the TechPowerUp database, the RTX 3060 12GB has 360GB/s of memory bandwidth — slow by 2026 flagship standards, but more than enough to keep 14B-class quantized models north of 25 tok/s for single-user chat.
Key takeaways
- 12GB is the budget local-LLM floor. It cleanly runs 7B-14B-class models at q4_K_M with room for moderate context.
- 24GB is the practical sweet spot. It opens 27B-32B models at q4 and 14B models at q8 with long context.
- 48GB starts to make 70B viable. Below 48GB, 70B-class models need aggressive quantization or offload.
- Quantization is the single biggest VRAM lever. Moving from fp16 to q4_K_M roughly cuts memory in half with minimal practical quality loss.
- Context length is a hidden tax. A 32k context can add 2-4GB of KV-cache, often pushing a borderline model out of VRAM.
- CPU offload is a graceful degrade, not a free upgrade. Speed drops sharply once layers move to system RAM.
How much VRAM does a 7-9B model actually need?
A 7-8B model with full fp16 weights is roughly 14-16GB. At q8 (8-bit) it falls to 7-8GB. At q4_K_M (the common 4-bit format) it sits at 4-5GB. At q3 you can get under 4GB. The pattern: every bit of precision you drop roughly halves the memory footprint for the weights, with most quality loss concentrated at q3 and below. Per the Hugging Face quantization docs, GPTQ and AWQ methods at 4-bit typically retain near-fp16 quality on broad benchmarks for 7B-class models, while community measurements at q3 show visible degradation in code and reasoning tasks.
For a 12GB card running a 7B model at q4, you have plenty of headroom: ~4GB for weights, ~1GB for the runtime, leaving 6-7GB for KV-cache. That comfortably supports 16k context or more. The same 12GB card with a 14B model at q4 leaves much less room — ~8GB for weights, leaving 3GB for cache. Usable, but 4k-8k context is the realistic working budget.
What fits on a 12GB card today?
The ZOTAC RTX 3060 12GB and the MSI GeForce RTX 3060 Ventus 2X 12G are the canonical 12GB AI rigs as of 2026. Real-world reporting from places like Artificial Analysis and the r/LocalLLaMA community converges on a clear ladder:
- 8B-class (Llama 3.x 8B, Qwen 2.5 7B) at q5_K_M or q6_K → fits with 16k context. Generation speed 35-50 tok/s.
- 14B-class (Qwen 2.5 14B, Phi-4 14B) at q4_K_M → fits with 4k-8k context. Generation 22-30 tok/s.
- 27B-32B-class at q4 → does NOT fully fit. You can run with partial offload, accepting 3-8 tok/s.
- 70B-class → not practical on 12GB even at q2; pick a smaller model.
That ladder is the single most useful thing to internalize: 12GB is a 14B card, not a 32B card.
Quantization matrix: weights size by model and bit depth
Numbers below are weight-only memory, rounded for the most common GGUF-style mixed quantizations. Real VRAM usage adds 1-2GB of runtime plus KV-cache.
| Quant level | 8B model | 14B model | 32B model | 70B model |
|---|---|---|---|---|
| fp16 | ~16 GB | ~28 GB | ~64 GB | ~140 GB |
| q8_0 | ~8 GB | ~14 GB | ~32 GB | ~70 GB |
| q6_K | ~6.5 GB | ~11 GB | ~26 GB | ~57 GB |
| q5_K_M | ~5.5 GB | ~10 GB | ~22 GB | ~48 GB |
| q4_K_M | ~4.8 GB | ~8.5 GB | ~19 GB | ~42 GB |
| q3_K_M | ~3.8 GB | ~6.8 GB | ~15 GB | ~33 GB |
| q2_K | ~3.0 GB | ~5.5 GB | ~12 GB | ~26 GB |
Read the matrix vertically to see what fits a card; read it horizontally to see what each quantization buys. A 24GB card at q4_K_M comfortably runs 32B; a 12GB card cannot. Two 24GB cards (48GB total via tensor parallel) put 70B at q4 within reach.
Spec/benchmark table: RTX 3060 12GB vs 24GB-class cards
The 3060 stays in this comparison because, as of 2026, it remains the cheapest path to 12GB of GDDR6. The 24GB tier is dominated by the 3090 and 4090 on the used market, plus the newer 4090 and 5090 retail SKUs. Per TechPowerUp and community measurements:
| GPU | VRAM | Bandwidth | Approx. tok/s on 14B q4 | Notes |
|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | 360 GB/s | 25-30 | Budget AI floor |
| RTX 3090 24GB | 24 GB | 936 GB/s | 70-90 | Used-market sweet spot |
| RTX 4090 24GB | 24 GB | 1008 GB/s | 90-110 | Current consumer 24GB peak |
| RTX 5090 32GB | 32 GB | 1792 GB/s | 130-160 | Opens 32B at q5+ cleanly |
Tok/s figures are rough community midpoints for single-user chat at a 4k context window; your model, runtime, and prompt mix will shift them. The point of the table is the ratio of bandwidth to throughput, not absolute numbers — a roughly 2.6× bandwidth jump from 3060 to 3090 produces roughly a 3× tok/s jump, because larger cards also let the runtime use fatter batches.
How does context length change the VRAM budget?
The KV-cache stores attention key and value tensors per token, per layer, per attention head. It grows linearly with context length and is, for most modern open-weights models, larger than people expect.
A rough rule for Llama-style models: each token of context costs roughly 2 × n_layers × n_heads × head_dim × 2 bytes (the two for K and V, the trailing 2 for fp16 cache). For a 14B Llama-style model that's roughly 200-300KB per token. At 8k context, that's 1.6-2.4GB of cache; at 32k, 6.4-10GB.
On a 12GB card running a 14B model at q4_K_M, you have about 3GB of headroom for cache after weights and runtime — enough for 8k tokens of context, but 32k will not fit unless you also quantize the KV-cache or pick a smaller model. This is why "32k context" feature labels can be misleading: the model architecture supports it; your card may not.
When CPU offload to system RAM makes sense
Most inference runtimes — llama.cpp, ExLlamaV2, vLLM — let you split layers across GPU and CPU. The GPU runs the layers it has VRAM for; the CPU handles the rest using system RAM. The cost: each token's generation now requires a CPU pass that's bound by RAM bandwidth and core count, not GPU throughput.
On a build with a Ryzen 7 5800X and dual-channel DDR4-3200, you can expect roughly 3-8 tok/s for the offloaded portion of a 32B q4 model, versus 25+ tok/s when fully in VRAM on a 12GB card running a smaller 14B at the same quantization. That tradeoff — running the bigger model slowly versus the smaller model quickly — is the real choice when you're VRAM-bound.
Offload makes sense when:
- The bigger model materially changes what you can do (coding, long-form reasoning, agentic loops).
- You're patient with 5-10 tok/s.
- Your prompts are short, so prefill doesn't dominate.
Offload is a bad call when:
- You're doing interactive chat where latency matters per token.
- The smaller model is within a few percentage points on your evaluation set.
- Your CPU is older than Zen 3 or you're stuck on single-channel RAM.
Prefill vs generation: where VRAM pressure actually lands
A request has two phases. Prefill processes your prompt in parallel, throughput-bound — the GPU computes attention over every input token at once. Generation produces output tokens sequentially, memory-bandwidth-bound — every new token reads the entire KV-cache and the model weights once.
For local interactive use, generation is what you feel. A short 100-token prompt and a 500-token response is 99% generation time. That's why community benchmarks usually report tok/s on the generation side, and why memory bandwidth (not raw FLOPs) is the dominant performance lever for chat. The 3090-to-4090 jump shows this clearly: similar VRAM, similar compute on paper, with the 4090's higher bandwidth driving the generation-speed delta.
Reasoning models like GLM-5.2 or DeepSeek-R1 invert this slightly — they generate huge "reasoning trace" outputs before the user-visible answer, so generation latency dominates even more. On 12GB, prefer non-reasoning models if latency matters.
Perf-per-dollar: cheapest path to 14B-class local
If your target is "run 14B-class models at usable speed," the math in 2026 looks like:
- RTX 3060 12GB, ~$280 used: 14B q4 at 25-30 tok/s. Best $/tok-on-day-one.
- RTX 4060 Ti 16GB, ~$450 new: 14B q4 at ~40 tok/s plus room for q5/q6 or longer context.
- RTX 3090 24GB, ~$700-900 used: 14B q5_K_M with 32k context; or 32B q4. Best $/capability.
- RTX 4090 24GB, ~$1500+ used: Same VRAM ceiling as 3090, with ~25% more throughput. Diminishing returns for inference-only.
If you only want to run a 14B model, the 3060 12GB is still the rational pick in 2026 — half the cost of a used 3090 for ~40% of the throughput on the workloads it can run. The 3090 wins the moment you want headroom (longer context, bigger model, fine-tuning), which is the case for most users planning their second build.
Common pitfalls
Most VRAM-mismatch problems we see in the community trace back to one of these:
- Picking a card for "AI" without picking a target model. Buy a 4090 to run 7B chat → wasted memory.
- Ignoring KV-cache growth. A 12GB build that runs a 14B q4 at 4k context dies at 16k.
- Believing the model card's "context length" implies you can use it. Architecture vs. memory budget are independent.
- Trying to run a model that doesn't fit instead of stepping down a tier. A 14B at q4 fully in VRAM beats a 32B at q4 with half its layers on CPU for nearly all interactive workloads.
- Not accounting for runtime overhead. llama.cpp, vLLM, and TGI each have a 1-2GB runtime cost separate from weights and cache.
When NOT to build a 12GB local rig at all
Skip 12GB and either go to 24GB or stay on cloud APIs if:
- Your target is 32B-class models or bigger, period.
- You need long contexts (32k+) on anything bigger than 8B.
- You want to fine-tune locally — even LoRA on a 14B benefits from 24GB.
- You're doing batch or RAG workloads where larger batches matter.
In all those cases the 3060 will frustrate you within a week. The 12GB tier is for single-user chat and small agentic loops on 7B-14B models, full stop.
Bottom line
Pick the model first. Quantize aggressively but stop at q4_K_M unless you have a measured reason. Confirm the KV-cache fits your real context length. Then size the card. A 12GB build like an RTX 3060 paired with a Ryzen 7 5800X and dual-channel DDR4 is the cheapest legitimate AI rig in 2026 — it runs 14B-class q4 cleanly, takes a 32B at offloaded speeds when you need to, and leaves room for fast NVMe like the WD Blue SN550 1TB NVMe for model storage. If your target is bigger than 14B, save up for 24GB; if it's bigger than 32B, save for 48GB or use the cloud.
Related guides
- 32B Models on 12GB VRAM: What an RTX 3060 Can Really Run in 2026
- Best Budget GPU for Local 12B–14B LLM Inference: Why the RTX 3060 12GB Still Wins
- CPU Offload for Local LLMs: Does a Ryzen 7 5800X Help?
- ExLlamaV2 vs llama.cpp for Single-User Chat on an RTX 3060 12GB in 2026
Citations and sources
- TechPowerUp — GeForce RTX 3060 specifications
- Hugging Face — Quantization overview
- Artificial Analysis — LLM benchmarks and tokens-per-second comparisons
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
