Yes — for plain chat. No — for agents. Q4_K_M is the right quant for single-turn Q&A on a 12GB GPU like an RTX 3060, but agentic coding chains 20–50 tool calls per task and the per-token error compounds geometrically across that trace. Move to Q5_K_M minimum, Q6_K when VRAM allows, and pair with Q8_0 KV cache — that's the operator-grade default for Aider, Cline, and LangGraph workloads as of 2026.
The quant choice that doesn't show up on MMLU
The local-LLM community settled on Q4_K_M as the default GGUF quant in 2024 for one reason: it's the smallest GGUF format whose perplexity penalty rounds away on chat benchmarks. MMLU drops 0.3–0.8 points moving from FP16 to Q4_K_M on a 27B-class model. ChatArena ELO drops roughly 8–15 points. For a turn-by-turn conversation, that's invisible. You ask one question, you get one answer, errors don't cascade.
Agents don't work that way. A typical Aider session opens a repo, lists files, reads four or five candidate files, proposes an edit, runs a linter, iterates on the diff, runs tests, fixes a test failure, and commits — that's 15–30 tool calls minimum, and a moderately complex SWE-bench task hits 40–60. Every one of those calls includes a structured-output decision: pick a file path, pick a function name, pick a line number, decide whether to call the linter or move on. Q4_K_M is fine at any one of those decisions. It's not fine across 30 of them in a row.
The r/LocalLLaMA "Q4_K_M is a trap for agents" thread from late November 2026 ran a single Qwen3-Coder-27B model through SWE-bench-lite at five quant levels on the same RTX 3060 12GB rig. Q4_K_M scored 21% pass. Q5_K_M scored 31%. Q6_K scored 34%. Q8_0 scored 37%. FP16 (offloaded) scored 38%. The MMLU delta across that same span was 1.4 points. The agent-task delta was a factor of 1.8×.
Key takeaways
- Single-step error rates ≠ multi-step success rates. A 1.5% per-token error compounds across 30 sequential decisions to roughly 36% cumulative miss probability — pass-rate math, not perplexity math.
- The KV cache quantizes harder than the weights. Pairing Q5_K_M weights with Q4_0 KV cache costs you more accuracy than going to Q5_K_M weights + Q8_0 KV.
- Q5_K_M is the new floor for agents on 12GB. ~1.3GB extra VRAM vs Q4_K_M for the 27B model class. You almost always have it.
- Q6_K is the practical Pareto-optimal. Within 1% of FP16 quality on most evals, faster than Q8_0 on Ampere/Ada because the K-quant kernel is better tuned.
- Single-3060 limit for serious agent work: 27B-class model, Q5_K_M weights, Q8_0 KV, 16–24K context. Beyond that you need a second 12GB card or a 16GB+ replacement.
What changes when you move from chat to multi-step agents?
Chat is single-turn. The model sees a prompt, emits 200–500 tokens, you read them, you reply. Errors at the token level either get corrected by your follow-up question or stay localized to one assistant turn that you can re-roll.
Agents loop. The model emits a tool call, the runtime executes it, the result feeds back into the context window as the next prompt, and the model emits another tool call. Now every output token has three failure modes that didn't matter for chat: the model can hallucinate a function name that doesn't exist, mis-quote a line number off by ±5, or invent a CLI flag the actual tool doesn't accept. Q4_K_M's "fine" perplexity hides exactly these long-tail tokens — the rare-but-correct function name two tokens deep in the catalog vocabulary that the rounding bumped off the top-1 logit.
Aider, Cline, LangGraph, and CrewAI all enforce structured outputs (JSON schemas, tool-call grammars), so the rare-token problem looks like increased retry pressure, more invalid JSON errors, and more "function not found" loops. Operators see this as "the model is dumber today" but it's the same model — the quant is just rounding the right answer out of reach more often than chat use revealed.
Why does Q4_K_M look fine on MMLU but collapse on SWE-bench?
MMLU is 14,042 single-question multiple-choice items. Each item is one decision. A 1% per-decision error rate produces a 99% top-1 score in the limit. Q4_K_M's actual per-decision penalty vs FP16 on MMLU is closer to 0.5%, hence the ~99.5% retained accuracy you see everyone quote.
SWE-bench-lite tasks have 30+ sequential model decisions where any single error cascades — a wrong file path means the next read is against the wrong file, which means the next edit targets the wrong function, which means the patch is silently wrong. The probability of completing a 30-step task at per-step accuracy p is approximately p³⁰. At p = 0.99 that's 74%. At p = 0.97 it's 40%. At p = 0.95 it's 21%.
Q4_K_M doesn't degrade per-step accuracy from 0.99 to 0.95 uniformly — it's worse on the rare decisions (obscure function names, unusual control flow) and unchanged on common ones. But agentic tasks load exactly the rare decisions disproportionately, because everyday repos have unusual control flow and project-specific function names.
What does the math say about per-token error compounding?
A token-level cross-entropy increase of 1.2% between Q4_K_M and FP16 on Qwen3-Coder-27B (measured on the wikitext-2 validation set as of late 2026) translates to roughly 0.4% increased top-1 token error rate on code generation. Across a single 800-token diff that's a 0.4% × 800 ≈ 3.2 expected miss tokens — survivable if downstream linting catches the syntactic errors.
Across a 30-step agent trace where each step emits 400 tokens of structured output, you're looking at 30 × 400 × 0.004 ≈ 48 expected miss tokens, and the structured-output JSON schema can't repair semantic misses (a wrong file path is valid JSON). The empirical SWE-bench pass-rate drop (21% Q4_K_M vs 31% Q5_K_M, per the r/LocalLLaMA benchmark) lines up with this math: Q5_K_M halves the per-token error rate, and pass-rate scales roughly with the inverse of cumulative miss probability.
Why are Q5 and Q6 underrated, and Q8 oddly worse for agents?
Q5_K_M and Q6_K share the same K-quant kernel family in llama.cpp, which has been tuned aggressively since mid-2024. The kernel hits within 3% of Q4_K_M throughput on Ampere/Ada GPUs, runs cooler, and uses superblock scales that preserve outlier weights better than the older Q5_0/Q6_0 formats.
Q8_0 is the wildcard. The Q8_0 kernel in llama.cpp is straightforward (one byte per weight, no superblock scales), and on consumer GPUs it lands 18–25% slower than Q6_K despite the simpler math, because the K-quant kernels exploit Ada's tensor cores more aggressively. For chat, Q8_0's quality edge outweighs the throughput penalty. For agents, the slower wall-clock per tool call (each Aider iteration takes ~28% longer at Q8_0 than Q6_K on a 3060) accumulates into 2–4 minutes per task — and operators notice that more than they notice the marginal quality lift.
Practical default for a single 12GB card: Q6_K if the model class fits, Q5_K_M otherwise. Q8_0 only when you're running 7B-class models with VRAM to spare.
How much VRAM do Q5/Q6/Q8 actually cost on a 12GB RTX 3060?
The ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB and MSI GeForce RTX 3060 Ventus 2X 12G both expose the same 12GB GDDR6 framebuffer. About 1.0–1.2GB is consumed by display, CUDA context, and the model's compute buffer, leaving ~10.8GB usable for weights plus KV cache.
| Quant | 7B (Llama 3.3) | 9B (Gemma 4) | 14B (DeepSeek-Coder) | 27B (Qwen3-Coder) | 32B (DeepSeek-V4) |
|---|---|---|---|---|---|
| Q4_K_M | 4.4 GB | 5.6 GB | 8.6 GB | 16.4 GB | 19.2 GB |
| Q5_K_M | 5.1 GB | 6.5 GB | 10.0 GB | 18.9 GB | 22.5 GB |
| Q6_K | 5.7 GB | 7.3 GB | 11.3 GB | 21.4 GB | 25.6 GB |
| Q8_0 | 7.2 GB | 9.2 GB | 14.2 GB | 27.1 GB | 32.0 GB |
| FP16 | 13.5 GB | 17.2 GB | 26.5 GB | 50.6 GB | 60.1 GB |
A 27B model at Q5_K_M alone is 18.9GB — way over a single 12GB card. The community uses three strategies: (1) drop to a 14B class model at Q5_K_M (fits at 10GB with ~800MB KV headroom), (2) split a 27B across two 12GB cards (works, see the dual-3060 section), or (3) offload some layers to CPU (slow but functional).
What's the practical KV cache quant pairing?
The KV cache is the dynamic memory the model uses to store keys and values for every prior token in the context window. It scales linearly with context length. For a 14B-class model at 8K context, you're looking at ~1.4GB of KV cache at F16. At 32K context, that's ~5.6GB — often more than the weights themselves.
| Weights | KV cache | Quality vs FP16/F16 | VRAM saved per 8K ctx |
|---|---|---|---|
| Q5_K_M | F16 | 98% | baseline |
| Q5_K_M | Q8_0 | 98% (no measurable drop) | ~0.7 GB |
| Q5_K_M | Q5_1 | 95% (visible at >24K ctx) | ~1.1 GB |
| Q5_K_M | Q4_0 | 87% (tool-arg hallucinations climb) | ~1.4 GB |
| Q6_K | F16 | 99% | baseline |
| Q6_K | Q8_0 | 99% | ~0.7 GB |
Q8_0 KV cache is the safe default. The accuracy drop from F16 → Q8_0 KV is below the noise floor on every benchmark we've seen, and you reclaim ~700MB per 8K of context — enough to push a 14B Q5_K_M from 8K to 16K context on a single 12GB card.
Q4_0 KV is aggressive and the runbook's empirical experience matches the r/LocalLLaMA data: tool-argument hallucination rates climb visibly above 16K context. Don't use it for agent work.
Prefill vs generation cost for long agent traces
Aider's prefill cost (the input tokens it sends each iteration) dwarfs its generation cost. A typical Aider iteration sends 4,000–12,000 prefill tokens (the file content plus prior conversation) and generates 200–800 tokens (the diff plus reasoning).
On an RTX 3060 12GB running a 14B Q5_K_M model, prefill clocks ~1,800 tok/s and generation clocks ~62 tok/s. A 30-iteration Aider session moves ~240,000 prefill tokens and ~12,000 generation tokens — 133 seconds of prefill and 194 seconds of generation. The CPU side (the AMD Ryzen 7 5800X is the community-default companion) and the WD Blue SN550 NVMe for fast model loads matter less than they used to once the model is resident in VRAM.
Context-length impact on multi-turn coding tasks
Context length matters less than the community thinks for most agent workflows. Aider's context-management strategy is to inject only the files it has decided to edit, plus a project summary. A repo of 200 files commonly fits inside 16K context with summaries. Long-context (32K+) is mostly useful for codebases with files >2,000 lines or for agentic refactors that touch many files at once.
The trade is sharp: doubling the context window doubles the KV cache (a fixed VRAM cost) and roughly doubles prefill time (a fixed wall-clock cost). At 32K context on a 14B Q5_K_M + Q8_0 KV setup, you'll use ~16.7GB total — over the 12GB budget. Either drop the model class or drop the context.
Perf-per-dollar verdict: RTX 3060 12GB vs RTX 4060 Ti 16GB vs Apple M-series
| Rig | Price (used) | 14B Q5_K_M tok/s | 27B Q5_K_M support | Agent suitability |
|---|---|---|---|---|
| RTX 3060 12GB | $260 | 62 | partial (needs 2 cards) | excellent for 14B class |
| RTX 4060 Ti 16GB | $480 | 81 | yes (28K ctx) | excellent, single-card 27B |
| Dual RTX 3060 12GB | $520 | 105 (split) | yes (24K ctx) | excellent, redundant |
| Apple M4 Max 64GB | $3,500 | 71 | yes | excellent but expensive |
| Apple M2 Pro 32GB | $1,400 | 38 | partial | OK if you already own it |
The single RTX 3060 12GB at $260 used is the cheapest serious-agent rig in 2026. The 4060 Ti 16GB at $480 is the best perf-per-dollar single-card option. The dual-3060 at $520 wins on absolute throughput for under $550.
Verdict matrix
- Use Q5_K_M if you're on a single 12GB card running 14B-class models with 8–16K agent context. This is the most common rig and the right default.
- Use Q6_K if you have a 16GB card or a dual-12GB split, and you want the highest quality the K-quant family delivers.
- Use Q8_0 if you have 24GB+ and the model is small enough that the throughput penalty doesn't blow your iteration time.
- Use FP16 if you have ≥48GB and you're doing research-grade evaluation where no quant is acceptable.
- Never use Q4_K_M for agents unless you've explicitly verified pass-rate on your task class against Q5_K_M. The chat-benchmark identity-of-perception is a trap.
Bottom line: recommended stack for a single-3060 agentic coding rig
- GPU: ZOTAC RTX 3060 Twin Edge 12GB or MSI RTX 3060 Ventus 2X 12G (~$260 used)
- CPU: AMD Ryzen 7 5800X (~$200 used, plenty of headroom for prefill orchestration)
- Storage: WD Blue SN550 1TB NVMe for fast model loads
- Model: Qwen3-Coder-14B or DeepSeek-Coder-14B at Q5_K_M
- KV cache: Q8_0
- Context: 16K (extend to 24K with Q5_1 KV if you must)
- Runtime: llama.cpp main branch, with
--split-mode none --kv-quant q8_0 - Agent harness: Aider, Cline, or Continue.dev pointing at the local OpenAI-compatible endpoint
Expected throughput: 60–65 tok/s generation, 1,800 tok/s prefill, ~1.2s first-token latency. SWE-bench-lite pass-rate: 28–32%. That's the price-floor working agent rig in 2026.
Related guides
- Llama.cpp Console Released: What Changes for Local LLM Operators on a 12GB GPU
- CUDA 13.3 Landed: What Local LLM Operators Need to Know for RTX 3060 / 4090 Rigs
- Best Mini PC for Local LLM Inference in 2026
