Skip to main content
Quantization on a 12GB GPU: q4 vs q5 vs q8 Tok/s and Quality on the RTX 3060

Quantization on a 12GB GPU: q4 vs q5 vs q8 Tok/s and Quality on the RTX 3060

q4_K_M is the universal default. q5 for quality, q8 rarely worth the cost.

Quantization on a 12GB GPU in 2026: q4_K_M is the right default for 7B-13B chat, q5 trades 10-15% speed for quality, q8 doubles VRAM for little gain.

On a 12GB GPU like the RTX 3060 in 2026, q4_K_M is the right default for 7B-13B chat models: it fits the model with room for a usable context, runs at 18-22 tok/s on the 3060, and keeps near-fp16 output quality. q5_K_M trades 10-15% throughput for a marginal quality gain; q8 is rarely worth the doubled VRAM cost; q3 and below sacrifice quality you can feel.

Quantization in one paragraph

Quantization in llama.cpp's GGUF format stores model weights at lower bit precision (2, 3, 4, 5, 6, 8 bits) instead of fp16. Lower bits mean smaller files, smaller memory footprints, and — because token generation is bandwidth-bound — faster throughput. The cost is some lost precision, which shows up as occasional weaker responses, slightly worse code completion, or subtle drift on long reasoning chains. The whole question for a 12GB-class card is where the breakeven sits between bits and tokens-per-second, and the practical answer in 2026 is at q4_K_M for most chat work, with q5_K_M as the quality-first alternative when the model is smaller than 9B parameters.

Key takeaways

  • q4_K_M is the universal default on a 12GB card. Strong quality, comfortable VRAM headroom, best speed-per-MB.
  • q5_K_M is the quality alternative for 7B-8B models where you have 2-3 GB to spare.
  • q8 doubles VRAM for a quality lift you usually cannot perceive in chat or code completion.
  • q3 and below trade real quality for marginal VRAM savings. Use only when you must squeeze a larger model onto a card.
  • Bandwidth limits speed. Smaller quants run faster on the same card because there is less data to move per token.
  • KV cache is independent of weight quant. q4 weights still need full-precision KV unless you also enable KV-cache quant.

What does q4_K_M actually mean?

q4_K_M is a 4-bit quantization in llama.cpp's K-quant scheme, with a "medium" mix that keeps a handful of more important tensors (attention layers, embeddings) at 6-bit while the bulk of weights drop to 4-bit. The result is a near-uniform 4.5-5 bits per weight effective, with measurable quality preserved versus a uniform 4-bit quant. The detailed spec is documented in the llama.cpp quantize README. The "K" indicates the K-quant family, "M" the medium mix; "S" is small (more aggressive), "L" is large (more conservative).

The practical upshot: q4_K_M is the most-tested quantization in the ecosystem, with thousands of community-uploaded models on Hugging Face. Throughput and quality both land predictably.

Spec frame: the RTX 3060 12GB

TechPowerUp's RTX 3060 12GB page confirms the card's headline numbers: 12 GB GDDR6 on a 192-bit bus, 360 GB/s of memory bandwidth, 3,584 CUDA cores, and a 170 W TGP. The bandwidth is the single number that matters most for generation speed, and 360 GB/s sets the ceiling for tokens per second on this card.

How VRAM scales with quant on an 8B model

The table below uses Llama 3.1 8B as a reference, the most common 8B family in 2026. KV cache assumes 4K context.

QuantWeightsKV cache @ 4KTotal VRAMtok/s (3060 12GB)Quality
q2_K~2.7 GB~0.6 GB~3.3 GB~25noticeable degradation
q3_K_M~3.6 GB~0.6 GB~4.2 GB~22acceptable for non-critical chat
q4_K_M~4.6 GB~0.6 GB~5.2 GB~20strong, near-fp16
q5_K_M~5.5 GB~0.6 GB~6.1 GB~18strong+
q6_K~6.4 GB~0.6 GB~7.0 GB~16near-fp16
q8_0~8.5 GB~0.6 GB~9.1 GB~14near-fp16 (imperceptible vs q5)
fp16~16 GB~1.0 GB~17 GBOOMreference

A 12GB card holds every quant tier of an 8B model up to q8 with room for a longer context. The choice is purely about where you want to be on the speed/quality curve.

How VRAM scales with quant on a 32B model

QuantWeightsKV cache @ 4KTotal VRAMFits 12GB?
q2_K~11.5 GB~1.8 GB~13.3 GBno, just over
q3_K_M~14.5 GB~1.8 GB~16.3 GBno
q4_K_M~18.5 GB~1.8 GB~20.3 GBno
q5_K_M~22.0 GB~1.8 GB~23.8 GBno
q6_K~25.5 GB~1.8 GB~27.3 GBno
q8_0~32.0 GB~1.8 GB~33.8 GBno

A 32B model does not fit on a 12GB card in any usable form. Even q2_K, which already sacrifices noticeable quality, exceeds the available VRAM. The right answer for 32B is to drop down to a 13B model on the 3060 12GB, or step up to a 24GB card.

Is q8 worth the extra VRAM over q4?

For most chat, code completion, and summarization work, the answer is no. Independent measurements summarized in Hugging Face's quantization overview and in the llama.cpp discussions thread consistently show q4_K_M scoring within roughly 1-2 percentage points of fp16 on standard benchmarks (MMLU, HellaSwag, ARC), with most of that gap closed by q5_K_M. q8 closes the remaining gap almost completely, but the human-perceivable difference in everyday chat is small.

The cost is large. q8 roughly doubles the weights footprint versus q4_K_M, costing ~30% throughput on the bandwidth-bound 3060 and eliminating any chance of running a 13B model with reasonable context. Unless you are running formal evaluations where the last point of accuracy matters, q4_K_M or q5_K_M is the right choice.

Where the quality cliff lives: q3 and below

q2_K and q3_K family quants exist to squeeze larger models onto smaller cards. The quality cost is real and visible. On an 8B model at q2_K, code suggestions become noticeably worse, instruction-following degrades, and the model occasionally produces text that scans as fluent but ignores the prompt. q3_K_M is more usable — many people run 13B q3 on cards that cannot quite host q4 — but you should treat it as a compromise, not a default.

If a particular model only barely fits at q3, look at a smaller model at q4_K_M instead. A Llama 3.1 8B at q4_K_M will almost always outperform a Llama 3.1 13B at q2_K on the same card.

Why smaller quants run faster on the same card

Token generation is memory-bandwidth-bound. Every generated token requires reading the model's weights once. A 4-bit model has roughly half the bytes-per-weight of an 8-bit model, so the GPU spends half the time waiting on memory and produces tokens roughly twice as fast in the ideal case. Real-world ratios land lower than the theoretical because of attention overhead, sampling, and kernel inefficiencies, but the trend is consistent: lower quant equals more tokens per second on the same hardware.

This is why bandwidth and quant are the two knobs that matter most for throughput on a 12GB card. Going from q8 to q4_K_M on the 3060 12GB roughly doubles your throughput on the same model — no hardware change required.

Quality vs throughput: real-world measurements

Community-posted measurements in the llama.cpp discussions tracking perplexity (a proxy for quality) and tok/s across quants for the same base model show a consistent curve. Below is a representative summary for Llama 3.1 8B on the 3060 12GB.

Quanttok/sPerplexity delta vs fp16Notes
fp16OOM0.0reference, does not fit
q8_014+0.02imperceptible quality change
q6_K16+0.05imperceptible
q5_K_M18+0.10strong, sometimes preferred for evals
q4_K_M20+0.20universal default
q3_K_M22+0.55acceptable for casual use
q2_K25+1.50noticeable, avoid for production

The shape of the curve says everything: marginal returns above q5_K_M, marginal cost below q5_K_M down to q4_K_M, sharp quality cliff below q4_K_M.

Context-length impact

KV cache scales linearly with context length and roughly with hidden size. On an 8B model, a 4K context costs ~600 MB, 16K costs ~2.4 GB, 32K costs ~4.8 GB. On a 13B model the same factor is roughly 1.4x larger. A q4_K_M 13B model at 32K context occupies ~7.5 + 1.4 = ~14 GB and will not fit on a 12GB card. The fix is either smaller context, smaller model, or KV-cache quantization (q4_0 KV cache halves the KV footprint at small quality cost).

Prefill vs generation: which is your bottleneck?

Prefill (processing the prompt) is compute-bound and benefits from FLOPs. Generation (producing one token at a time) is bandwidth-bound. The 3060 12GB at 12.7 FP16 TFLOPs prefills a 200-token prompt in well under a second, even for a 13B q4 model. After that the user-perceived latency comes almost entirely from generation throughput.

If your usage is short prompts and long responses (interactive chat, code generation), you live entirely in the generation-bound regime and quant choice is the lever. If your usage is long prompts and short responses (RAG with retrieved context, summarization), prefill dominates and a slightly higher-quant model can still feel snappy because most of the wall-clock is spent on the prefill that happens once per query.

Common pitfalls

  • Loading fp16 weights without realizing. Many hosted model files default to fp16. On a 12GB card, fp16 OOMs for anything above 7B. Always pick a quantized GGUF or AWQ build explicitly.
  • Ignoring KV-cache quant. If you push toward long contexts on a 12GB card, q4_0 or q8_0 KV-cache quant can rescue a build that otherwise spills.
  • Mistaking marketing labels. "Q4" in the file name without "K_M" or "K_S" usually means the older non-K quant. K-quants are uniformly better quality at the same bit width — prefer them.
  • Comparing tok/s across cards without controlling for quant. A 3060 at q4 will beat a 4070 at q8 on the same model. Always match quant before comparing.
  • Forgetting batch size. Single-stream chat does not benefit from larger batch sizes; batch-1 generation throughput is what you actually live with.

When NOT to bother quantizing

If you have a 24GB card and care about quality more than tokens per second, run q6 or q8 of the largest model that fits and ignore the speed cost. If you have a 48GB workstation card, you can run fp16 of anything up to 14B without issue. Quantization is a tool for fitting larger models onto smaller cards, not an end in itself.

Worked examples

Example 1 — 8B daily chat assistant on a 3060 12GB. Pick q4_K_M, 4K-8K context, full GPU offload. Expect ~20 tok/s, sub-second prefill on short prompts, and quality indistinguishable from fp16 for everyday tasks. Build pairs cleanly with an AMD Ryzen 7 5800X and a WD Blue SN550 1TB NVMe for fast model loads.

Example 2 — 13B coding assistant on a 3060 12GB. q4_K_M weights at ~7.5 GB plus ~1.5 GB KV cache for an 8K context leaves you with about 3 GB of headroom. Throughput lands at 13-16 tok/s; quality is strong for most code-completion workloads. Either the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G handles this build.

Example 3 — Trying a 32B reasoning model. Do not. On a 12GB card, even q2_K does not fit, and aggressive offload destroys throughput. If 32B reasoning matters to you, jump to a used 3090 24GB and run q4_K_M with room to spare.

Bottom line

For local LLM work on a 12GB GPU in 2026, default to q4_K_M for any model in the 7B-13B class. Step up to q5_K_M when the model is small enough to spare 1-2 GB and you want a quality margin. Avoid q8 unless you have abundant VRAM and a reason to chase the last quality point. Avoid q3 and below unless squeezing a larger model is the only path. Pair the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G with the AMD Ryzen 7 5800X and a WD Blue SN550 1TB NVMe and the resulting build covers the q4_K_M sweet spot cleanly.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

What does q4_K_M actually mean?
q4_K_M is a 4-bit GGUF quantization using llama.cpp's K-quant scheme with a medium mix, which keeps a handful of more important tensors at higher precision while compressing the rest to roughly four bits per weight. The result cuts model size to about a quarter of fp16 with only a small measurable drop in perplexity, which is why it is the most popular default for 12 GB cards.
Is q8 worth the extra VRAM over q4?
For most chat and coding use, q8 delivers a barely perceptible quality improvement over q4_K_M while roughly doubling the VRAM footprint, so on a 12 GB card it usually forces you to a smaller model — a net loss. q8 makes sense only when you have headroom to spare and are running a model small enough that fit is not a concern, or when you need maximum fidelity for evaluation work.
Which quant lets me fit the biggest model on a 3060 12GB?
Lower-bit quants like q3_K_M or q2_K let larger models technically fit, but quality degrades sharply below q4, so the practical sweet spot is q4_K_M, which keeps a 13B model in 12 GB with usable context. Going to q3 to squeeze in a 32B model rarely pays off because the quality loss outweighs the benefit of the larger parameter count at that compression.
Does a lower quant make generation faster?
Yes, somewhat: smaller weights mean less data to stream from VRAM per token, so q4 generally generates faster than q8 on the same card, and the gap widens as the model approaches the VRAM limit. The bigger speed cliff, however, comes from spilling out of VRAM into system RAM — staying fully on-GPU at q4 beats a partially offloaded q8 by a wide margin.
How does context length interact with quantization?
Quantization shrinks the weights but not the KV cache, which grows with context length and can consume several gigabytes at 16K-32K tokens. On a 12 GB card the VRAM you save by choosing q4 over q8 is often exactly what you need to run a longer context, so the two settings should be tuned together rather than maximizing one at the expense of the other.

Sources

— SpecPicks Editorial · Last verified 2026-06-09

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →