Match the VRAM, not the parameter count. To run a specific large language model on consumer hardware in 2026, the rule of thumb is: total model footprint at your chosen quantization, plus 1–3 GB for context cache and overhead, has to fit inside your GPU's VRAM. For most readers that means a 12GB RTX 3060 is the floor for 12B-class models at q4_K_M, an 8GB card limits you to 7B–8B models, and CPU-only on a Ryzen 5 5600G only makes sense for 1B–3B models or very patient workflows.
Key takeaways
- VRAM, not GPU FLOPS, is the gating factor for which model you can run.
- A 12GB card hits the sweet spot for Llama 3.x 8B, Qwen 3 14B, and DeepSeek 12B-class derivatives at q4_K_M.
- Quantization (q4_K_M, q5_K_M, etc.) is the single biggest lever between "fits" and "does not fit."
- Context-cache cost scales with prompt length and is often the reason a 12GB card OOMs on a model that should fit.
- CPU-only inference on a Ryzen 5600G is usable for 7B models at low context but loses to GPU by 6–8× on speed.
The VRAM-first mental model
llama.cpp and Ollama load three things into VRAM: model weights, key/value cache for the context window, and a small overhead for kernels and intermediate buffers. The biggest of the three is weights. The second-biggest, at long context, is the K/V cache.
For a 12B model at q4_K_M, weights take about 7.5 GB. K/V cache at 4k context is around 1 GB; at 8k it doubles to 2 GB; at 16k it doubles again. Add 500 MB of overhead and you can see why a 12GB card runs out at 12k context on a 12B q4_K_M model: 7.5 + 4 + 0.5 = 12 GB, no room to grow.
The RTX 3060 12GB has 12 GB of GDDR6 per TechPowerUp, and the practical ceiling is closer to 11.5 GB once driver and Windows display reservations are accounted for. Plan accordingly.
Quantization is the lever — here is the matrix
The HuggingFace transformers quantization documentation lays out the modern quant landscape (GGUF, GPTQ, AWQ, bitsandbytes). For llama.cpp / Ollama, you are picking from the GGUF family.
| Quant | Bits/weight | Quality loss | Typical use |
|---|---|---|---|
| fp16 | 16 | None | Reference; only fits when VRAM > 2× param count in GB |
| q8_0 | 8 | <1% | Highest-fidelity GGUF; needs 50% more VRAM than q4 |
| q6_K | ~6.5 | ~1% | Safe quality floor for most production work |
| q5_K_M | ~5.5 | 1–2% | Sweet spot for 8GB cards on 7B models |
| q4_K_M | ~4.8 | 2–4% | Sweet spot for 12GB cards on 12–14B models |
| q3_K_M | ~3.7 | 5–8% | Squeeze a larger model onto smaller card; accept quality drop |
| q2_K | ~2.7 | 10–20% | Last resort; degradation is visible |
The "K" suffix denotes the k-quant family, which uses per-block scaling for better quality-per-bit than the older q4_0 / q4_1 formats. The "M" suffix is "medium" within k-quants. Default to "_M" variants unless you are deliberately squeezing the smallest.
Per-model hardware guide for Llama, DeepSeek, and Qwen
Mid-2026 community consensus, drawn from llama.cpp issues and HuggingFace model cards, lines up roughly as follows.
Llama family
| Model | Quant | Min VRAM | Recommended card |
|---|---|---|---|
| Llama 3.x 1B | q4_K_M | 1 GB | Any GPU; CPU-only fine |
| Llama 3.x 3B | q4_K_M | 2.5 GB | 4GB+ card |
| Llama 3.x 8B | q4_K_M | 5.5 GB | 8GB+ card |
| Llama 3.x 8B | q5_K_M | 6.5 GB | 8GB card, tight |
| Llama 3.x 8B | q8_0 | 9 GB | 12GB+ card |
| Llama 3.x 70B | q4_K_M | 42 GB | Multi-GPU or CPU offload |
DeepSeek family (V4-era derivatives)
| Model | Quant | Min VRAM | Recommended card |
|---|---|---|---|
| DeepSeek 1.5B distilled | q4_K_M | 1.2 GB | Any GPU |
| DeepSeek 7B distilled | q4_K_M | 5 GB | 8GB+ card |
| DeepSeek 12B class | q4_K_M | 7.5 GB | 12GB card (RTX 3060) |
| DeepSeek 14B class | q4_K_M | 9 GB | 12GB+ card |
| DeepSeek 32B class | q4_K_M | 19 GB | 24GB card |
| DeepSeek V4 671B-MoE | — | 400 GB+ | Datacenter / cloud only |
Qwen 3 family
| Model | Quant | Min VRAM | Recommended card |
|---|---|---|---|
| Qwen 3 0.5B | q4_K_M | 0.5 GB | Any GPU; CPU fine |
| Qwen 3 1.5B | q4_K_M | 1.2 GB | Any GPU |
| Qwen 3 4B | q4_K_M | 3 GB | 6GB+ card |
| Qwen 3 7B | q5_K_M | 6 GB | 8GB card |
| Qwen 3 14B | q4_K_M | 8.5 GB | 12GB card (RTX 3060) |
| Qwen 3 32B | q4_K_M | 19 GB | 24GB card |
| Qwen 3 72B | q4_K_M | 43 GB | Multi-GPU |
What the RTX 3060 12GB can do — the practical envelope
A 12GB card hits the sweet spot for one specific workload: a 12B–14B model at q4_K_M with 4–8k context, running at 25–35 tokens per second. That envelope covers Llama 3 8B, Qwen 3 14B, and any DeepSeek derivative up to ~12B active parameters. Per llama.cpp's repository, CUDA support has matured to the point where the 3060 12GB hits 90%+ of its theoretical memory bandwidth on autoregressive generation.
For larger models (32B class), the 12GB card forces CPU offload. With a Ryzen 7 5800X feeding it through PCIe 4.0, offload is workable but slow — expect a 3–5× generation slowdown versus the same model on a 24GB card with no offload.
CPU-only on a Ryzen 5600G — when does it make sense?
The Ryzen 5 5600G with DDR4-3200 in dual-channel hits about 45 GB/s of memory bandwidth — enough to run Llama 3.x 8B at q4_K_M at roughly 8 tokens per second. That is conversational speed for short outputs but a marathon for long context.
CPU-only on the 5600G makes sense when:
- You are running a 1B–3B model for very tight latency-cheap tasks like local autocompletion or speech wake-word.
- You want to test models before committing to GPU hardware.
- You are bandwidth-limited in a homelab where idle nodes are free but PCIe slots are scarce.
It does not make sense for daily chat or coding assist workflows. The GPU advantage is 6–8× on the RTX 3060 12GB, and that is on a card you can buy used for the price of two months of cloud API tokens.
VRAM-vs-context trade-off table
Same 12B q4_K_M model on a 12GB card; how does context length shrink your headroom?
| Context length | K/V cache (12B q4) | Weights | Total used | Free |
|---|---|---|---|---|
| 2k | 0.6 GB | 7.5 GB | 8.6 GB | 3.4 GB |
| 4k | 1.2 GB | 7.5 GB | 9.2 GB | 2.8 GB |
| 8k | 2.4 GB | 7.5 GB | 10.4 GB | 1.6 GB |
| 12k | 3.6 GB | 7.5 GB | 11.6 GB | 0.4 GB (risky) |
| 16k | 4.8 GB | 7.5 GB | 12.8 GB | OOM |
A practical upper bound on the 3060 12GB for a 12B q4_K_M model is 10–12k context. Push for 16k and you will hit out-of-memory errors mid-prompt.
Common pitfalls when matching models to GPUs
- Counting params, not bytes. A 14B model is not 14 GB. It is 28 GB at fp16, 14 GB at q8, 8.5 GB at q4_K_M. Always quote VRAM after quantization.
- Forgetting context overhead. A model that loads fine at 2k context can OOM at 8k. Test with the prompt length you will actually use.
- Mixing quant families. GGUF q4_K_M, AWQ 4-bit, and GPTQ 4-bit are not interchangeable. Pick a runtime, pick its quant family.
- Assuming larger models always beat smaller ones. Qwen 3 14B at q4_K_M can underperform Llama 3 8B at q5_K_M on certain reasoning benches. Always test with your actual prompts.
- Buying for the model you do not yet run. A 24GB card to run a 70B model you have not validated is a $2,000 bet. Start on a 12GB card, prove the workflow, then upsize.
When NOT to buy more VRAM
If your typical task is short-prompt chat (1–4k context) and the answer is good enough on a 7B or 8B model, more VRAM is wasted spend. The RTX 3060 12GB at $200–280 used is the canonical right answer in this case. Buying 24GB to handle a 32B model that takes 3× longer to respond is not a win for interactive use; it is a win for batch jobs only.
Bottom line
The model picks the card; the card does not pick the model. For 2026 the canonical setups are: 8GB card for 7B q4_K_M, 12GB card (RTX 3060) for 12–14B q4_K_M, 24GB card for 32B q4_K_M, multi-GPU or quantized CPU offload for 70B+. Pair with a Ryzen 5 5600G for low-cost builds or a Ryzen 7 5800X when you want headroom for CPU-side prefill and tool-use orchestration. Across the consumer envelope, the Zotac Twin Edge RTX 3060 12GB and MSI Ventus 2X RTX 3060 12GB remain the value leaders.
Citations and sources
- HuggingFace — Transformers quantization overview — primary reference for GGUF, GPTQ, AWQ, bitsandbytes quant families.
- TechPowerUp — GeForce RTX 3060 — VRAM, memory bandwidth, architecture spec used in the cache math.
- llama.cpp GitHub repository — upstream engine; backend support matrix and quant format documentation.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
