If you want to run Llama 3.1 70B fully on a single GPU in 2026, you need roughly 40 GB of VRAM at q4_K_M, which means a 48 GB workstation card such as the RTX A6000 or an enterprise H100. A 12 GB consumer card like the MSI GeForce RTX 3060 Ventus 2X 12G is sized for the 7B–13B class. Pick by the model you want to load, then check tok/s — not the other way around.
Why VRAM, not TFLOPs, is the gating spec for local inference
For local language models, the question that decides whether a card is usable at all is "do the weights fit in VRAM?" Everything else — raw compute, memory bandwidth, PCIe lane count — is a secondary tuning knob.
The reason is simple: when a model does not fit in GPU memory, inference frameworks like llama.cpp offload layers to system RAM and shuttle activations across the PCIe bus on every token. PCIe 4.0 x16 tops out near 32 GB/s; GDDR6 on an RTX 3060 runs at 360 GB/s; HBM2e on an H100 runs above 3 TB/s. A model that needs to bounce between system RAM and the GPU has effectively traded its memory bandwidth for the slower of those two numbers, and that means tokens per second collapse.
So as of 2026, the practical map looks like this: 8 GB cards run 7B models at low context lengths, 12 GB cards run 7B comfortably and 13B usably, 16 GB cards reach 14B with room for context, 24 GB cards handle 32B-class models well, and you need 48 GB or more to keep a 70B model entirely on-GPU at a useful quantization. Anything outside those tiers either does not load or runs so slowly that you stop using it. This article walks each tier and names the models you can realistically expect to run.
Key takeaways
- VRAM capacity decides what fits; memory bandwidth decides how fast it generates once loaded — in that order.
- A 12 GB RTX 3060 is the practical floor for serious local LLM work: 7B at fp16 or 13B at q4_K_M with headroom for context.
- 24 GB consumer cards (4090-class) host 32B models comfortably and 70B with offload; 48 GB workstation cards run 70B fully on-GPU.
- Quantization is the dominant lever — going from fp16 to q4_K_M roughly quarters VRAM use with a small quality hit.
- Context length quietly eats VRAM: a 32k context can add 2–4 GB on top of the model weights.
- Used 12 GB cards remain the perf-per-dollar pick for single-user local chat in 2026.
How much VRAM does each model class actually need? (7B / 13B / 32B / 70B at q4_K_M)
These are working approximations at the dominant community-sweet-spot quantization, q4_K_M in the GGUF format, with a modest 4k context. Real numbers vary by ±10% depending on the tokenizer, the framework, and how much KV-cache you allocate.
| Model size | Weights at q4_K_M | + 4k KV cache | Total VRAM needed | Smallest GPU that fits |
|---|---|---|---|---|
| 7B | ~4.3 GB | ~0.5 GB | ~5 GB | 8 GB |
| 13B | ~7.5 GB | ~0.8 GB | ~8.5 GB | 12 GB |
| 32B | ~19 GB | ~1.5 GB | ~21 GB | 24 GB |
| 70B | ~40 GB | ~3 GB | ~43 GB | 48 GB |
| 405B | ~225 GB | ~6 GB | ~230 GB | multi-GPU or CPU-offload only |
Bump the quantization down to q3_K_S to gain roughly 25% VRAM headroom at a noticeable quality cost; bump up to q5_K_M or q6_K for better fidelity if the model still fits.
What can a 12 GB RTX 3060 run, and at what tok/s?
The RTX 3060 12 GB sits in the sweet spot for hobbyist local LLM work because it is the cheapest current-generation NVIDIA card with enough VRAM to host a 13B model entirely on-GPU. Community benchmarks on Hugging Face and the llama.cpp discussion threads consistently show:
- 7B at q4_K_M — 60–80 tok/s, interactive-feeling for any chat or code task.
- 13B at q4_K_M — 30–45 tok/s, still responsive for single-user use.
- 14B at q4_K_S — 20–30 tok/s, the upper edge of what fits with room for context.
- 20B+ — only with CPU offload; tok/s drops into single digits.
If you pair the card with an 8-core CPU like the AMD Ryzen 7 5800X and 32 GB of system RAM, prefill on long prompts stays smooth and any offloaded layers do not bottleneck the GPU on warm-up. Our dedicated Is the RTX 3060 12GB Still Worth It in 2026? walks the broader value case.
GPU tier spec table
A snapshot of the cards that matter for local LLM work in 2026, ordered by VRAM:
| GPU | VRAM | Mem bandwidth | Approx MSRP / used | Largest comfortable model |
|---|---|---|---|---|
| RTX 3060 12 GB | 12 GB GDDR6 | 360 GB/s | $260–320 used | 13B q4_K_M |
| RTX 4060 Ti 16 GB | 16 GB GDDR6 | 288 GB/s | $450 new | 14B q4_K_M with room |
| RTX 3090 | 24 GB GDDR6X | 936 GB/s | $700–900 used | 32B q4_K_M |
| RTX 4090 | 24 GB GDDR6X | 1008 GB/s | $1600+ used | 32B q4_K_M, 70B with offload |
| RTX A6000 (Ada) | 48 GB GDDR6 ECC | 768 GB/s | $4000+ used | 70B fully on-GPU |
| H100 80 GB | 80 GB HBM3 | 3 TB/s | datacenter | 70B at higher quant + long context |
For the single-user enthusiast, the 12 GB and 24 GB tiers do the heavy lifting; anything in between is either an awkward price point or a workstation card that is rarely worth the markup at home.
Quantization matrix: what q-level should you use?
Picking the right quantization is where most local-LLM beginners lose performance. Here is the practical matrix for an 8B model and a 32B model on the RTX 3060 12 GB and a 24 GB card, respectively.
| Quant | VRAM for 8B | Tok/s on 12 GB | Quality loss | VRAM for 32B | Notes |
|---|---|---|---|---|---|
| q2_K | 3.2 GB | 90+ | noticeable | 11 GB | last resort; coherence drops |
| q3_K_M | 4.0 GB | 80 | small but visible | 14 GB | aggressive but usable |
| q4_K_S | 4.5 GB | 75 | very small | 17 GB | the speed pick |
| q4_K_M | 4.9 GB | 70 | minimal | 19 GB | the default sweet spot |
| q5_K_M | 5.7 GB | 60 | imperceptible for chat | 22 GB | the quality pick |
| q6_K | 6.6 GB | 55 | none in blind tests | 26 GB | doesn't fit a 12 GB card for 32B |
| q8_0 | 8.5 GB | 45 | none | 34 GB | barely usable on 12 GB; reference |
| fp16 | 16 GB | n/a | none | 64 GB | doesn't fit consumer cards |
For most users, q4_K_M is the answer. Drop to q4_K_S if you want a little more speed; bump up to q5_K_M if the model is a marginal fit and you want maximum quality.
Prefill vs generation: why context length quietly eats VRAM
Two phases of inference behave very differently. Prefill — reading your prompt — is compute-bound and scales linearly with the number of input tokens. Generation — emitting one token at a time — is memory-bandwidth-bound and scales with model size, not prompt length. Both contend for VRAM for the KV cache, which grows with context length times model dimension times two (key and value).
In practice, a 13B model with a 4k context holds about 0.8 GB of KV cache; bump that to 32k and the cache balloons past 6 GB, which is enough to push a 13B q4 model out of a 12 GB card. If you plan to feed large documents into the model, allocate VRAM accordingly or use a framework that offloads the KV cache to system RAM (vLLM and llama.cpp both support this, with tok/s consequences).
When do you need two GPUs or a CPU-offload fallback for 70B?
For a 70B model at any usable quality, the honest answer is: get a 48 GB card, or pair two 24 GB cards via tensor parallelism in vLLM or pipeline parallelism in llama.cpp. Two RTX 3090s give 48 GB combined and can run 70B at q4_K_M with a small PCIe-transfer tax (sub-linear scaling, roughly 1.6× a single card's throughput).
CPU offload is the fallback when you cannot afford a second GPU. With a fast desktop CPU and 64 GB of system RAM, llama.cpp will run a 70B q4 model at 1–4 tok/s — slow enough that you start avoiding it, but fast enough to validate that your prompt template works before you provision real hardware.
Perf-per-dollar and perf-per-watt: is a used 12 GB card still the value pick?
For single-user chat, code help, and summarization with 7B–13B models, the RTX 3060 12 GB at $260–320 used is hard to beat in 2026. A used 3090 doubles your VRAM but quadruples the price and the wall power, which only pays off if you actually need 32B-class models day to day.
For the perf-per-watt crowd, the 3060 also wins on idle: it draws under 15 W when not generating, which matters if the box runs 24/7 as part of a homelab. A 4090 draws over 30 W at idle and more than triples that during inference.
Bottom line: the smallest card that fits your target model
- Just trying things out, 7B is fine — any 8 GB current card.
- Coding help and chat, 13B is the target — 12 GB RTX 3060.
- Reasoning and long-context, 32B class — 24 GB RTX 3090 or 4090.
- 70B fully local at usable speed — 48 GB workstation card or dual 24 GB consumer cards.
- Frontier-scale models — rent cloud time, or wait for the next consumer VRAM bump.
Pick the model first, find the smallest card that fits it at q4_K_M with room for context, and only then optimize for bandwidth and price.
Common pitfalls when sizing a GPU for local LLMs
Three mistakes show up over and over in community threads.
The first is buying for parameter count, not for fit. "I want to run a 70B model" is a great target — but if your budget puts a 12 GB card in your hands, you will not run a 70B model well, and chasing it via offload means you will use the model less. Pick the model size that fits in your card at q4_K_M with ~3 GB of headroom for KV cache, and you'll actually use what you bought.
The second is ignoring memory bandwidth on otherwise-similar cards. A 16 GB card sounds strictly better than a 12 GB card until you notice that some 16 GB SKUs use a narrower memory bus and end up slower than the 12 GB card they were supposed to replace. The RTX 4060 Ti 16 GB is the classic example — it sits between the 3060 12 GB and the 3090 24 GB on capacity but loses to the 3090 on bandwidth by a wide margin. Read the spec sheet, not the headline GB number.
The third is not budgeting for the rest of the system. A 4090 in a box with 16 GB of DDR4-2400 and an older 6-core CPU will idle fine but bottleneck on prefill for long prompts, because tokenization and any offloaded layers want CPU and RAM bandwidth. Match the host to the card: an 8-core CPU and 32 GB of DDR4-3200 (or DDR5) is the practical floor for a 12 GB card; 64 GB and a current 8–12-core chip is the right pairing for 24 GB and above.
Real-world numbers: what your stack will actually look like
A few representative builds we've documented for local LLM work:
| Build | Card | Cost (used) | Best for | Approx tok/s on 13B q4 |
|---|---|---|---|---|
| Entry hobby | RTX 3060 12 GB | $280 | learning, 7B–13B chat | 35–45 |
| Serious hobby | RTX 3090 | $800 | 32B work | 70–90 (on 13B) |
| Power user | RTX 4090 | $1700 | 32B + DirectStorage gaming | 100+ (on 13B) |
| Workstation | RTX A6000 48 GB | $4000+ | 70B on a single card | 30–40 (on 70B) |
For most home users, the entry-hobby build is what you actually need. A serious-hobby build is the right step up when 13B is no longer enough and you have a clear 32B use case. The power-user and workstation tiers buy you headroom for the future, not necessarily a transformative experience today.
When NOT to invest in local LLM hardware
Three honest cases where renting time on a hosted endpoint beats building a local rig in 2026: (1) your workload is occasional — a few hundred queries per month is cheaper on a hosted API than on hardware that draws power 24/7, (2) you specifically need a frontier model and no smaller open-weight model will do the job, (3) you're not going to actually use the hardware enough to amortize it — if the box sits idle, the cloud's pay-per-token model wins.
If none of those apply, the local rig pays for itself in months and pays you back with privacy, no rate limits, and the slow accumulation of a workflow that is genuinely yours.
Related guides
- Best Local LLM You Can Run on 12GB of VRAM in 2026
- Ollama vs vLLM for Single-User Local Chat on an RTX 3060 12GB (2026)
- ComfyUI on an RTX 3060 12GB: Local Image Generation Setup
- Is the RTX 3060 12GB Still Worth Buying in 2026?
