If you want to run a specific local LLM in 2026, match the model to the hardware: a 12GB GPU like the RTX 3060 hosts 7B-13B models at q4 with full GPU offload, a 16-24GB card unlocks 20B-32B class, and 70B-class work needs 40GB+ or aggressive offload. A mid-range Ryzen 7 5800X is enough CPU once the model fits on the GPU; bandwidth, not core count, sets your tokens per second.
Pick the model first, then the build
Most local-LLM build advice still starts with the GPU. That is backwards. The hardware that matters depends entirely on which model you actually plan to run day to day — a coding assistant in the 7B-13B range demands very different silicon from a 70B reasoning model, and overspending on a card whose VRAM you will never use is the most common mistake we see in r/LocalLLaMA build threads.
Start by being honest about the workload. If you primarily want a coding pair, a chat assistant for daily prompts, summarization, or document Q&A, the modern 7B and 13B instruction-tuned models (Llama 3.1 8B, Qwen 2.5 7B/14B, Mistral 7B variants) are the realistic target. They fit comfortably in 12GB at q4_K_M, run at interactive speeds, and produce output quality that closely tracks much larger models on routine tasks. If you instead need long-context reasoning, complex code synthesis, or multi-turn agentic chains, you are looking at the 20B-32B class and a 16-24GB card. And if you are chasing 70B-class quality at home, your build is fundamentally different — closer to a workstation than a gaming PC.
Once the model class is fixed, the build collapses to three decisions: how much VRAM, how much memory bandwidth, and how cleanly the rest of the platform feeds the GPU. That order matters. We will work through each decision with real numbers and end with a perf-per-dollar table covering the most common 12GB-class build.
Key takeaways
- VRAM is the gate. Pick a card that holds the model weights plus the KV cache for your target context length, with headroom for activations.
- Quant tier decides fit. q4_K_M is the sweet spot for most 7B-13B chat models on 12GB; q5 and q8 are nice-to-have, not required for usable quality.
- Bandwidth decides speed. Token generation re-reads weights every step, so GB/s memory bandwidth on the card sets your ceiling on tokens-per-second.
- The CPU only matters when you spill to system RAM. Once the model fits entirely on the GPU, a Ryzen 7 5800X-class chip is plenty.
- Context length is a hidden VRAM tax. A 32K context can double the KV cache footprint of a 13B model.
- Buy used or previous-gen for value. The RTX 3060 12GB at street pricing remains the best dollars-per-usable-VRAM card for the 7B-13B class as of 2026.
How much VRAM does each model size actually need at q4_K_M?
This is the table that decides your entire build. The numbers below are for weights at q4_K_M plus a typical 4K-context KV cache; bump KV cache up roughly 2x for 32K context.
| Model size | Weights at q4_K_M | KV cache @ 4K | Total VRAM | Fits on |
|---|---|---|---|---|
| 7B | ~4.0 GB | ~0.5 GB | ~4.5 GB | 6GB+ |
| 8B (Llama 3.1) | ~4.5 GB | ~0.6 GB | ~5.1 GB | 6GB+ |
| 13B | ~7.5 GB | ~1.0 GB | ~8.5 GB | 12GB comfortably |
| 14B (Qwen 2.5) | ~8.0 GB | ~1.1 GB | ~9.1 GB | 12GB comfortably |
| 20B class | ~11.5 GB | ~1.5 GB | ~13.0 GB | 16GB+ |
| 32B | ~18.5 GB | ~2.5 GB | ~21.0 GB | 24GB |
| 70B | ~40.0 GB | ~6.0 GB | ~46.0 GB | 48GB+ or split |
The implication: a single 12GB card covers everything up to mid-teen-billion parameters at q4_K_M, but no 12GB GPU will host a 32B model without aggressive quant cuts or system-RAM offload that destroys throughput.
Can an RTX 3060 12GB run a 13B model without offload?
Yes, and comfortably. According to the RTX 3060 12GB specifications on TechPowerUp, the card carries 12GB of GDDR6 on a 192-bit bus delivering 360 GB/s of memory bandwidth. The 13B q4_K_M footprint above is ~8.5GB, leaving roughly 3GB for activations and a longer context window. The community measurements posted in the llama.cpp discussions consistently report 12-18 tokens per second for 13B q4_K_M generation on this card, which is more than fast enough for interactive chat.
The catch is context. A 4K context fits trivially; an 8K context stays comfortable; a 32K context inflates the KV cache to 8GB+ and you will start swapping. If you need very long contexts on a 12GB card, drop the model to 7B-8B class or accept system-RAM spillover.
Why does memory bandwidth matter more than core count for token generation?
Autoregressive generation works one token at a time, and every token requires re-reading the entire model's weights to compute the next-token probabilities. That makes generation a memory-bandwidth-bound problem, not a compute-bound one — the GPU's matrix-multiply units sit partly idle waiting on memory, and you cannot get more tokens per second than your bandwidth allows.
A useful first-order model: peak token rate is approximately bandwidth divided by quantized model size. A 13B model at q4_K_M is ~7.5GB; an RTX 3060 12GB at 360 GB/s gives a ceiling of ~48 tokens/sec, and real-world measurements land around one third of that after sampling, attention overhead, and kernel inefficiencies — exactly the 12-18 tok/s the community reports. The same logic explains why an Apple M2 Max with 400 GB/s of unified memory bandwidth runs the same models at roughly comparable speeds despite very different compute architecture.
Core count, by contrast, matters mostly for prompt processing (prefill), where the model crunches the input prompt in parallel. A 3060's 3,584 CUDA cores prefill long prompts faster than an integrated GPU, but for a 200-token follow-up question the prefill takes a small fraction of total response time and bandwidth dominates.
Does the CPU matter when the model fits entirely in VRAM?
Once weights and KV cache live entirely on the GPU, the CPU does relatively little: tokenize the input, sample logits, and feed the GPU. A mid-range chip like the AMD Ryzen 7 5800X is comfortably enough — its 8 cores and 32MB L3 keep up with the GPU's appetite, and a faster CPU would yield single-digit-percent gains at best. We routinely run interactive chat against a 3060 12GB paired with a 5800X and never see the CPU as the bottleneck.
Where the CPU matters is the moment you spill into system RAM, which is the next question.
How does CPU+iGPU offload (Ryzen 5 5600G) compare to a dedicated 3060 12GB?
The Ryzen 5 5600G is tempting for budget LLM builds because it has integrated graphics — no discrete GPU required. In practice, the 5600G's Vega 7 iGPU shares system memory at roughly 50 GB/s (DDR4-3200 dual-channel), which is seven times slower than the RTX 3060 12GB's 360 GB/s. Even with all 6 cores and 12 threads running llama.cpp's CPU path, a 13B q4_K_M model averages 3-5 tokens per second on a 5600G — usable but not interactive. The same model on the 3060 12GB runs 12-18 tok/s.
The 5600G remains a fine processor for the system; just do not expect to skip the discrete card.
What changes when you spill to system RAM?
The instant a model's working set exceeds your GPU's VRAM, llama.cpp and similar runtimes split layers across the GPU and CPU. Generation now traverses PCIe and DDR memory at every token, and your tokens-per-second collapses by roughly 5-10x. On the 3060 12GB, a 13B q4 model runs 12-18 tok/s entirely on the GPU; the same model with even one layer offloaded to system RAM drops below 5 tok/s.
The practical rule: pick a quant level that keeps you fully on the GPU. Trading q4 for q8 to gain a tiny quality bump is rarely worth losing 70% of your throughput.
Spec delta: RTX 3060 12GB vs higher-VRAM options
| Card | VRAM | Bandwidth | TDP | Typical street | Fits up to (q4_K_M) |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB GDDR6 | 360 GB/s | 170 W | $260-330 | 13B-14B class |
| RTX 4060 Ti 16GB | 16 GB GDDR6 | 288 GB/s | 165 W | $420-470 | 20B class |
| RTX 4070 12GB | 12 GB GDDR6X | 504 GB/s | 200 W | $500-560 | 13B-14B, faster |
| RTX 4080 16GB | 16 GB GDDR6X | 717 GB/s | 320 W | $1000+ | 20B class, fast |
| RTX 3090 24GB | 24 GB GDDR6X | 936 GB/s | 350 W | $700-900 used | 32B class |
| RTX 4090 24GB | 24 GB GDDR6X | 1008 GB/s | 450 W | $1700+ | 32B class, fast |
The pattern: the 3060 12GB stays the value champion for 7B-13B work because its bandwidth-per-dollar is excellent and the next obvious step up (4060 Ti 16GB) costs roughly 60% more for marginal speed gain in the same model class. If you need 32B, jump straight to a used 3090 24GB.
Quantization matrix: q2 through fp16
The table below reflects measurements community contributors have posted across the llama.cpp discussions for a 3060 12GB; treat them as ballpark.
| Quant | 8B VRAM | 8B tok/s | 8B quality | 32B VRAM | 32B fits 12GB? |
|---|---|---|---|---|---|
| q2_K | ~3.0 GB | ~25 | poor | ~12.5 GB | barely, sub-par |
| q3_K_M | ~3.5 GB | ~22 | usable | ~14.5 GB | no |
| q4_K_M | ~4.5 GB | ~20 | strong | ~18.5 GB | no |
| q5_K_M | ~5.5 GB | ~18 | strong+ | ~22.0 GB | no |
| q6_K | ~6.5 GB | ~16 | near-fp16 | ~25.5 GB | no |
| q8_0 | ~8.5 GB | ~14 | near-fp16 | ~32.0 GB | no |
| fp16 | ~16 GB | n/a OOM | reference | ~64 GB | no |
The takeaway: on a 12GB card you live in the q4_K_M-q5_K_M band for 7B-13B work. Going below q4 sacrifices output quality faster than it saves VRAM; going above q5 buys little measurable benefit and risks pushing context length out of memory.
Prefill vs generation: where does each pairing stall?
Prefill is parallel matrix-multiplies over the input prompt; generation is sequential, one token at a time. The 3060 12GB has reasonable compute (12.7 TFLOPs FP16 per NVIDIA's RTX 3060 product page) and modest bandwidth, so it generally stalls on bandwidth in generation. An RTX 4070 12GB with 504 GB/s bandwidth generates 30-40% faster than a 3060 on the same 13B q4 model, even though their VRAM and compute are similar — bandwidth is what changed.
If your workload is heavy prefill (RAG with long retrieved contexts, large code-completion windows), prioritize a card with more FLOPS even if VRAM is similar. If it is mostly conversational generation, prioritize bandwidth.
Context-length impact on a 12GB card
A 13B q4_K_M model occupies ~7.5 GB of weights. The KV cache scales linearly with context length and model dimensions. Empirically, a 13B model in llama.cpp adds roughly 150-200 MB of KV cache per 1K of context. A 4K context adds ~750 MB; a 16K context adds ~3 GB; a 32K context adds ~6 GB. On the 12GB card, that 32K context pushes total occupancy past 13.5 GB and you will spill. Keep contexts under 8-16K on a 12GB card, or drop to a 7B model when long contexts matter.
Perf-per-dollar and perf-per-watt math
Using the table above, the most relevant pairing is the RTX 3060 12GB with a Ryzen 7 5800X (or Ryzen 5 5600G if you skip the GPU). On a 13B q4_K_M chat workload:
- 3060 12GB + 5800X: ~15 tok/s at ~280 W system power. Build cost (DIY) ~$650. ~0.054 tok/s per dollar, ~0.053 tok/s per watt.
- 4070 12GB + 5800X: ~22 tok/s at ~330 W. Build cost ~$900. ~0.024 tok/s per dollar, ~0.067 tok/s per watt.
- 4060 Ti 16GB + 5800X: ~14 tok/s at ~265 W (similar to 3060 due to bandwidth ceiling), VRAM jumps to 16GB. Build cost ~$800.
- 3090 24GB + 5800X: ~22 tok/s at ~480 W on 13B; ~9 tok/s on 32B q4. Build cost (used 3090) ~$1100.
- 5600G iGPU only: ~4 tok/s at ~95 W. Build cost ~$450.
For pure value at 7B-13B-class local chat, the 3060 12GB still wins. The 3090 wins if you must run 32B-class models. The 4070 wins on perf-per-watt but at substantially higher upfront cost.
Common pitfalls
- Buying a 16GB card you do not need. A 4060 Ti 16GB tempts buyers into "future-proof" 20B work that they rarely actually run. The same money on a 3060 12GB plus a faster CPU often delivers a better daily experience.
- Underspecifying context length. People test at 4K context, fit the model perfectly, then load up RAG pipelines that send 16K-32K contexts and wonder why throughput craters. Plan for the longest context you actually use.
- Mixing fp16 weights with a 12GB card. Loading a model in fp16 on a 12GB card almost always OOMs above 7B. Use a quantized GGUF or AWQ build instead.
- Ignoring CPU offload thresholds. Tools default to a layer-offload count that puts a few layers on the CPU; force
--n-gpu-layers 999to keep everything on the GPU when it fits. - Forgetting power headroom. A 3060 12GB pulls 170W TGP and a 5800X up to 142W package power. Pair them with at least a 650W 80+ Bronze PSU.
When NOT to build local
If your workload is fully covered by hosted APIs and you do not care about privacy, latency, or cost predictability for moderate volume, hosted models will outperform any 12GB-class local build on quality. Local makes sense when the workload runs continuously (cost), requires offline operation (latency, sovereignty), or sends data that cannot leave your machine (privacy). Otherwise the build money is better spent on API credits.
Related guides
- RTX 3060 12GB benchmarks
- Ryzen 7 5800X benchmarks
- Best GPUs for local LLM in 2026
- Quantization on a 12GB GPU
Bottom line
For 7B-13B-class local LLM work in 2026, the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G paired with an AMD Ryzen 7 5800X and a fast NVMe drive like the WD Blue SN550 1TB NVMe for model storage remains the price-performance sweet spot. Step up to a used 3090 24GB only when 32B-class models or long contexts actually matter to you, and skip the build entirely if a 4070 12GB or better is already on your shortlist for gaming — the LLM premium over a gaming-class card is zero. The AMD Ryzen 5 5600G is a good processor for the rest of the system if you ever need an iGPU fallback, but it is not a substitute for the discrete card.
Citations and sources
- TechPowerUp — GeForce RTX 3060 12GB specifications
- NVIDIA — GeForce RTX 3060 / 3060 Ti product page
- llama.cpp discussions on GitHub
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
