On a 12GB GPU like the RTX 3060 in 2026, q4_K_M is the right default for 7B-13B chat models: it fits the model with room for a usable context, runs at 18-22 tok/s on the 3060, and keeps near-fp16 output quality. q5_K_M trades 10-15% throughput for a marginal quality gain; q8 is rarely worth the doubled VRAM cost; q3 and below sacrifice quality you can feel.
Quantization in one paragraph
Quantization in llama.cpp's GGUF format stores model weights at lower bit precision (2, 3, 4, 5, 6, 8 bits) instead of fp16. Lower bits mean smaller files, smaller memory footprints, and — because token generation is bandwidth-bound — faster throughput. The cost is some lost precision, which shows up as occasional weaker responses, slightly worse code completion, or subtle drift on long reasoning chains. The whole question for a 12GB-class card is where the breakeven sits between bits and tokens-per-second, and the practical answer in 2026 is at q4_K_M for most chat work, with q5_K_M as the quality-first alternative when the model is smaller than 9B parameters.
Key takeaways
- q4_K_M is the universal default on a 12GB card. Strong quality, comfortable VRAM headroom, best speed-per-MB.
- q5_K_M is the quality alternative for 7B-8B models where you have 2-3 GB to spare.
- q8 doubles VRAM for a quality lift you usually cannot perceive in chat or code completion.
- q3 and below trade real quality for marginal VRAM savings. Use only when you must squeeze a larger model onto a card.
- Bandwidth limits speed. Smaller quants run faster on the same card because there is less data to move per token.
- KV cache is independent of weight quant. q4 weights still need full-precision KV unless you also enable KV-cache quant.
What does q4_K_M actually mean?
q4_K_M is a 4-bit quantization in llama.cpp's K-quant scheme, with a "medium" mix that keeps a handful of more important tensors (attention layers, embeddings) at 6-bit while the bulk of weights drop to 4-bit. The result is a near-uniform 4.5-5 bits per weight effective, with measurable quality preserved versus a uniform 4-bit quant. The detailed spec is documented in the llama.cpp quantize README. The "K" indicates the K-quant family, "M" the medium mix; "S" is small (more aggressive), "L" is large (more conservative).
The practical upshot: q4_K_M is the most-tested quantization in the ecosystem, with thousands of community-uploaded models on Hugging Face. Throughput and quality both land predictably.
Spec frame: the RTX 3060 12GB
TechPowerUp's RTX 3060 12GB page confirms the card's headline numbers: 12 GB GDDR6 on a 192-bit bus, 360 GB/s of memory bandwidth, 3,584 CUDA cores, and a 170 W TGP. The bandwidth is the single number that matters most for generation speed, and 360 GB/s sets the ceiling for tokens per second on this card.
How VRAM scales with quant on an 8B model
The table below uses Llama 3.1 8B as a reference, the most common 8B family in 2026. KV cache assumes 4K context.
| Quant | Weights | KV cache @ 4K | Total VRAM | tok/s (3060 12GB) | Quality |
|---|---|---|---|---|---|
| q2_K | ~2.7 GB | ~0.6 GB | ~3.3 GB | ~25 | noticeable degradation |
| q3_K_M | ~3.6 GB | ~0.6 GB | ~4.2 GB | ~22 | acceptable for non-critical chat |
| q4_K_M | ~4.6 GB | ~0.6 GB | ~5.2 GB | ~20 | strong, near-fp16 |
| q5_K_M | ~5.5 GB | ~0.6 GB | ~6.1 GB | ~18 | strong+ |
| q6_K | ~6.4 GB | ~0.6 GB | ~7.0 GB | ~16 | near-fp16 |
| q8_0 | ~8.5 GB | ~0.6 GB | ~9.1 GB | ~14 | near-fp16 (imperceptible vs q5) |
| fp16 | ~16 GB | ~1.0 GB | ~17 GB | OOM | reference |
A 12GB card holds every quant tier of an 8B model up to q8 with room for a longer context. The choice is purely about where you want to be on the speed/quality curve.
How VRAM scales with quant on a 32B model
| Quant | Weights | KV cache @ 4K | Total VRAM | Fits 12GB? |
|---|---|---|---|---|
| q2_K | ~11.5 GB | ~1.8 GB | ~13.3 GB | no, just over |
| q3_K_M | ~14.5 GB | ~1.8 GB | ~16.3 GB | no |
| q4_K_M | ~18.5 GB | ~1.8 GB | ~20.3 GB | no |
| q5_K_M | ~22.0 GB | ~1.8 GB | ~23.8 GB | no |
| q6_K | ~25.5 GB | ~1.8 GB | ~27.3 GB | no |
| q8_0 | ~32.0 GB | ~1.8 GB | ~33.8 GB | no |
A 32B model does not fit on a 12GB card in any usable form. Even q2_K, which already sacrifices noticeable quality, exceeds the available VRAM. The right answer for 32B is to drop down to a 13B model on the 3060 12GB, or step up to a 24GB card.
Is q8 worth the extra VRAM over q4?
For most chat, code completion, and summarization work, the answer is no. Independent measurements summarized in Hugging Face's quantization overview and in the llama.cpp discussions thread consistently show q4_K_M scoring within roughly 1-2 percentage points of fp16 on standard benchmarks (MMLU, HellaSwag, ARC), with most of that gap closed by q5_K_M. q8 closes the remaining gap almost completely, but the human-perceivable difference in everyday chat is small.
The cost is large. q8 roughly doubles the weights footprint versus q4_K_M, costing ~30% throughput on the bandwidth-bound 3060 and eliminating any chance of running a 13B model with reasonable context. Unless you are running formal evaluations where the last point of accuracy matters, q4_K_M or q5_K_M is the right choice.
Where the quality cliff lives: q3 and below
q2_K and q3_K family quants exist to squeeze larger models onto smaller cards. The quality cost is real and visible. On an 8B model at q2_K, code suggestions become noticeably worse, instruction-following degrades, and the model occasionally produces text that scans as fluent but ignores the prompt. q3_K_M is more usable — many people run 13B q3 on cards that cannot quite host q4 — but you should treat it as a compromise, not a default.
If a particular model only barely fits at q3, look at a smaller model at q4_K_M instead. A Llama 3.1 8B at q4_K_M will almost always outperform a Llama 3.1 13B at q2_K on the same card.
Why smaller quants run faster on the same card
Token generation is memory-bandwidth-bound. Every generated token requires reading the model's weights once. A 4-bit model has roughly half the bytes-per-weight of an 8-bit model, so the GPU spends half the time waiting on memory and produces tokens roughly twice as fast in the ideal case. Real-world ratios land lower than the theoretical because of attention overhead, sampling, and kernel inefficiencies, but the trend is consistent: lower quant equals more tokens per second on the same hardware.
This is why bandwidth and quant are the two knobs that matter most for throughput on a 12GB card. Going from q8 to q4_K_M on the 3060 12GB roughly doubles your throughput on the same model — no hardware change required.
Quality vs throughput: real-world measurements
Community-posted measurements in the llama.cpp discussions tracking perplexity (a proxy for quality) and tok/s across quants for the same base model show a consistent curve. Below is a representative summary for Llama 3.1 8B on the 3060 12GB.
| Quant | tok/s | Perplexity delta vs fp16 | Notes |
|---|---|---|---|
| fp16 | OOM | 0.0 | reference, does not fit |
| q8_0 | 14 | +0.02 | imperceptible quality change |
| q6_K | 16 | +0.05 | imperceptible |
| q5_K_M | 18 | +0.10 | strong, sometimes preferred for evals |
| q4_K_M | 20 | +0.20 | universal default |
| q3_K_M | 22 | +0.55 | acceptable for casual use |
| q2_K | 25 | +1.50 | noticeable, avoid for production |
The shape of the curve says everything: marginal returns above q5_K_M, marginal cost below q5_K_M down to q4_K_M, sharp quality cliff below q4_K_M.
Context-length impact
KV cache scales linearly with context length and roughly with hidden size. On an 8B model, a 4K context costs ~600 MB, 16K costs ~2.4 GB, 32K costs ~4.8 GB. On a 13B model the same factor is roughly 1.4x larger. A q4_K_M 13B model at 32K context occupies ~7.5 + 1.4 = ~14 GB and will not fit on a 12GB card. The fix is either smaller context, smaller model, or KV-cache quantization (q4_0 KV cache halves the KV footprint at small quality cost).
Prefill vs generation: which is your bottleneck?
Prefill (processing the prompt) is compute-bound and benefits from FLOPs. Generation (producing one token at a time) is bandwidth-bound. The 3060 12GB at 12.7 FP16 TFLOPs prefills a 200-token prompt in well under a second, even for a 13B q4 model. After that the user-perceived latency comes almost entirely from generation throughput.
If your usage is short prompts and long responses (interactive chat, code generation), you live entirely in the generation-bound regime and quant choice is the lever. If your usage is long prompts and short responses (RAG with retrieved context, summarization), prefill dominates and a slightly higher-quant model can still feel snappy because most of the wall-clock is spent on the prefill that happens once per query.
Common pitfalls
- Loading fp16 weights without realizing. Many hosted model files default to fp16. On a 12GB card, fp16 OOMs for anything above 7B. Always pick a quantized GGUF or AWQ build explicitly.
- Ignoring KV-cache quant. If you push toward long contexts on a 12GB card, q4_0 or q8_0 KV-cache quant can rescue a build that otherwise spills.
- Mistaking marketing labels. "Q4" in the file name without "K_M" or "K_S" usually means the older non-K quant. K-quants are uniformly better quality at the same bit width — prefer them.
- Comparing tok/s across cards without controlling for quant. A 3060 at q4 will beat a 4070 at q8 on the same model. Always match quant before comparing.
- Forgetting batch size. Single-stream chat does not benefit from larger batch sizes; batch-1 generation throughput is what you actually live with.
When NOT to bother quantizing
If you have a 24GB card and care about quality more than tokens per second, run q6 or q8 of the largest model that fits and ignore the speed cost. If you have a 48GB workstation card, you can run fp16 of anything up to 14B without issue. Quantization is a tool for fitting larger models onto smaller cards, not an end in itself.
Worked examples
Example 1 — 8B daily chat assistant on a 3060 12GB. Pick q4_K_M, 4K-8K context, full GPU offload. Expect ~20 tok/s, sub-second prefill on short prompts, and quality indistinguishable from fp16 for everyday tasks. Build pairs cleanly with an AMD Ryzen 7 5800X and a WD Blue SN550 1TB NVMe for fast model loads.
Example 2 — 13B coding assistant on a 3060 12GB. q4_K_M weights at ~7.5 GB plus ~1.5 GB KV cache for an 8K context leaves you with about 3 GB of headroom. Throughput lands at 13-16 tok/s; quality is strong for most code-completion workloads. Either the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G handles this build.
Example 3 — Trying a 32B reasoning model. Do not. On a 12GB card, even q2_K does not fit, and aggressive offload destroys throughput. If 32B reasoning matters to you, jump to a used 3090 24GB and run q4_K_M with room to spare.
Bottom line
For local LLM work on a 12GB GPU in 2026, default to q4_K_M for any model in the 7B-13B class. Step up to q5_K_M when the model is small enough to spare 1-2 GB and you want a quality margin. Avoid q8 unless you have abundant VRAM and a reason to chase the last quality point. Avoid q3 and below unless squeezing a larger model is the only path. Pair the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G with the AMD Ryzen 7 5800X and a WD Blue SN550 1TB NVMe and the resulting build covers the q4_K_M sweet spot cleanly.
Citations and sources
- llama.cpp quantize README
- TechPowerUp — GeForce RTX 3060 12GB specifications
- Hugging Face — Transformers quantization overview
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
