For a single-card 24 GB setup, load Q4_K_M with llama.cpp (or ik_llama.cpp if you need the extra 8–15% at Q5_K_S). Dual RTX 3060 12 GB cards can run Qwen 3.6 27B with tensor-parallel split but carry a PCIe latency tax; a used RTX 3090 24 GB beats them on every metric except resale flexibility. Update llama.cpp before enabling MTP — a regression in recent main was patched within 48 hours.
Why 24 GB Is the Canonical Budget for 27B Dense Models
The 27B parameter class has become the de facto "fits in a single prosumer GPU" tier for local inference, and 24 GB is its natural ceiling. Per Qwen's model card on Hugging Face, Qwen 3.6 27B ships as a dense transformer (not MoE), which means every parameter loads into VRAM at inference time. At Q4_K_M the weight file lands around 16–17 GB, leaving 6–8 GB for the KV cache, attention buffers, and activation tensors at reasonable context windows.
That math is why 24 GB is the floor and not the ceiling: 20 GB cards (RTX 3080 Ti, RTX 4080 SUPER in some OC variants) load Q4_K_M but leave under 4 GB for KV cache, limiting you to ~8K context before spilling to RAM. 24 GB cards run Q5_K_S comfortably with 16K context and Q6_K with 8K context — the full quality ladder without CPU offload.
Per the Qwen team's launch post, Qwen 3.6 27B is positioned as a flagship-level agentic coding model (beating Qwen3.5-397B-A17B on SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.0, and SkillsBench) and ships natively multimodal with a 262K context window. It also ships with MTP (multi-token prediction) heads that llama.cpp supports as of recent builds — but see the MTP section below for an important caveat on current builds.
Key Takeaways
- Q4_K_M + llama.cpp is the safe default: ~16.5 GB weights, 6–7 GB KV headroom, 16K context without offload on a single RTX 3090/4090 24 GB card.
- ik_llama.cpp beats mainline by 8–15% at Q5_K_S and Q6_K due to its IQ-quant kernels; gap narrows to 3–5% at Q4_K_M.
- vLLM AWQ is faster for batched serving (batch ≥ 4) but slower at batch=1 due to setup overhead.
- MTP regression was patched in llama.cpp within 48 hours — pull main before benchmarking.
- Dual RTX 3060 12 GB adds PCIe round-trip latency that costs 15–25% versus a single 3090 at batch=1.
- KV cache quantization (Q8_0) halves KV memory with no measurable quality loss at ≤16K context.
Quant Comparison: VRAM vs Quality for Qwen 3.6 27B
The table below synthesizes published VRAM usage figures from llama.cpp release notes and community measurements posted to r/LocalLLaMA. Numbers are approximate — VRAM usage varies slightly with batch size, context, and build flags.
| Quant | Weight size (GB) | 8K ctx VRAM | 16K ctx VRAM | 32K ctx VRAM | Quality vs FP16 |
|---|---|---|---|---|---|
| Q2_K | ~10.5 | ~12 GB | ~14 GB | ~18 GB | Noticeable loss |
| Q3_K_M | ~13.0 | ~15 GB | ~17 GB | ~21 GB | Mild loss |
| Q4_K_M | ~16.5 | ~18 GB | ~21 GB | OOM | Minimal loss |
| Q5_K_S | ~18.5 | ~21 GB | OOM | OOM | Near-lossless |
| Q6_K | ~21.0 | ~23 GB | OOM | OOM | Near-lossless |
| Q8_0 | ~28.5 | OOM | OOM | OOM | Indistinguishable |
| FP16 | ~54.0 | OOM | OOM | OOM | Reference |
On a single 24 GB card: Q4_K_M at 16K context and Q5_K_S at 8K context are the sweet spots. Q6_K barely fits at 8K with aggressive flash-attention settings and Q8_0 does not fit at all. For dual-card setups (e.g., two RTX 3060 12 GB sharing 24 GB via tensor-parallel), Q4_K_M at 8K context fits but PCIe overhead limits practical throughput.
Backend Comparison: llama.cpp vs ik_llama.cpp vs vLLM vs BeeLlama
Per the r/LocalLLaMA backend-comparison thread that surfaced this week, results on a single RTX 3090 24 GB at Q4_K_M, batch=1, 8K context:
| Backend | Prefill (tok/s) | Decode (tok/s) | Notes |
|---|---|---|---|
| llama.cpp (main) | ~1,850 | ~28 | Baseline; MTP adds ~35% when working |
| ik_llama.cpp | ~1,900 | ~30 | +5–8% via IQ-quant kernel optimizations |
| vLLM AWQ | ~2,100 | ~24 | Faster prefill; slower decode at batch=1 |
| BeeLlama | ~1,750 | ~27 | Community fork; less tested on Qwen 3.6 |
At Q5_K_S, ik_llama.cpp's advantage widens to 8–15% on decode — per the same thread, this is where its custom IQ-quant kernels for NVIDIA hardware do the most work. At batch=4, vLLM's advantage flips to become dominant across the board.
Recommendation: llama.cpp main for casual use; ik_llama.cpp for the best single-user throughput at Q5_K_S+; vLLM AWQ for server/multi-user scenarios.
Prefill vs Generation Throughput
Prefill (processing the prompt) and decode (generating new tokens) scale very differently. Per cited community measurements at batch=1:
- Prefill scales roughly linearly with prompt length up to flash-attention's effective window. All CUDA backends are roughly within 15% of each other.
- Decode is the bottleneck for interactive use and is dominated by memory bandwidth, not compute. On a 3090 24 GB (936 GB/s bandwidth), Q4_K_M decode tops out around 28–32 tok/s regardless of backend.
- Batch=4 decode throughput roughly doubles versus batch=1 for vLLM (better kernel occupancy) but only adds ~20% for llama.cpp (less optimized batched decode path).
Does MTP Help on Qwen 3.6 27B?
Multi-token prediction ships in Qwen 3.6 27B's weights as auxiliary heads. When correctly active in llama.cpp, per llama.cpp release notes, MTP delivers 1.3–1.8× decode throughput at batch=1 — it predicts 2–4 tokens per forward pass and verifies them speculatively, reducing total forward passes needed.
The critical caveat: A regression in recent main temporarily disabled or mis-routed MTP for several model families. Per the LocalLLaMA PSA thread, users who pulled main between the regression commit and the fix (a ~48-hour window) saw 20–40% throughput regressions that looked like normal performance without a baseline to compare. The fix is in current main.
How to verify MTP is firing: Run llama.cpp with --verbose and grep for "mtp" in startup output. If MTP loaded, you'll see head dimensions logged. If you see "mtp head not found" or no MTP lines, either your model lacks MTP weights or your build needs updating.
Context Length Safety on 24 GB
KV cache grows quadratically with context for standard attention. The math with Q4_K_M on a 24 GB card (per llama.cpp documentation):
| Context | KV Cache (FP16) | KV Cache (Q8_0) | Total VRAM (Q4_K_M) | Fits 24 GB? |
|---|---|---|---|---|
| 8K | ~1.5 GB | ~0.75 GB | ~18.5 GB | Yes |
| 16K | ~3.0 GB | ~1.5 GB | ~20 GB | Yes |
| 32K | ~6.0 GB | ~3.0 GB | ~23 GB | Marginal |
| 64K | ~12.0 GB | ~6.0 GB | OOM | No (Q8_0 only barely) |
For 32K context with Q4_K_M, quantize the KV cache to Q8_0 (flag: --cache-type-k q8_0 --cache-type-v q8_0). Q8_0 KV shows no measurable quality degradation at ≤32K context per the cited r/LocalLLaMA quantized-MTP-KV-cache experiment. Q4_0 KV works at 64K but introduces a perplexity tax that grows with context length.
Multi-GPU Scaling: When Does a Second Card Help?
Per the linked thread measurements on dual RTX 3060 12 GB (total 24 GB via llama.cpp tensor-split):
- Weights: Split evenly across both cards — fits fine.
- KV cache: Behavior depends on split mode —
--split-mode rowconcentrates KV on the main GPU (effectively limiting KV budget to one card's free VRAM), while the default--split-mode layersplits KV across cards alongside the layers it owns. - PCIe latency: Each token forward pass crosses the PCIe bus once per layer, adding ~15–25% decode latency versus a single 3090 that fits the model monolithically.
The exception is 70B-class models (Llama 3 70B, Qwen 72B) or batched inference with 4+ concurrent users where the additional VRAM unlocks quants not possible on a single card.
Verdict on dual 3060 12 GB for Qwen 3.6 27B: Technically works, practically inferior to a single used RTX 3090 24 GB. The 3090 benchmarks faster, has more KV headroom, and consumes less power.
Perf-per-Dollar Matrix
Approximate current street prices and decode throughput at Q4_K_M, batch=1:
| GPU | VRAM | ~Price (used/new) | Q4_K_M decode (tok/s) | $/tok/s |
|---|---|---|---|---|
| RTX 3060 12 GB ×2 | 24 GB (split) | ~$250 ($125×2) | ~22 | ~$11 |
| RTX 3090 24 GB | 24 GB | ~$450 used | ~30 | ~$15 |
| RTX 4090 24 GB | 24 GB | ~$1,400 new | ~38 | ~$37 |
| RTX 5090 32 GB | 32 GB | ~$2,000 new | ~55 (est.) | ~$36 |
The RTX 3090 24 GB is the clear value leader for single-user Qwen 3.6 27B. The 4090 buys ~27% more throughput for 3× the price. The 5090 adds Q8_0 capability (32 GB) plus higher bandwidth — justified only if you need the full 32 GB for larger models or batch serving.
Bottom Line: Recommended Config Matrix
For single-user interactive use on a single 24 GB GPU:
| Scenario | Quant | Backend | Context | MTP |
|---|---|---|---|---|
| Daily driver, 8K context | Q5_K_S | ik_llama.cpp | 8K | Enable (verify build) |
| Daily driver, 16K context | Q4_K_M | llama.cpp or ik_llama.cpp | 16K | Enable |
| Server / batched | Q4_K_M AWQ | vLLM | 8K | N/A (vLLM handles internally) |
| Code assistant, long context | Q4_K_M + Q8_0 KV | llama.cpp | 32K | Enable |
| Budget dual 3060 | Q4_K_M | llama.cpp tensor-split | 8K | Enable |
Verdict Matrix
Get llama.cpp Q4_K_M if… you want zero-fuss setup, stay on mainline, and value community support over maximum throughput. Works on any CUDA GPU, well-documented, stable.
Get ik_llama.cpp Q5_K_S if… you want the best single-user decode performance and are comfortable building from source. The IQ-quant kernels matter most at Q5+.
Get vLLM AWQ if… you're running a local API server with multiple concurrent users, or integrating with OpenAI-compatible tools at higher batch sizes. Not the best for single-session interactive use.
Related Guides
- MTP in llama.cpp: The Regression, the Fix, and the KV-Cache Free Lunch
- Best GPU for Local LLM Inference 2026
- RTX 3090 vs RTX 4090 for AI Inference
Citations and Sources
- Qwen/Qwen3-27B model card — Hugging Face
- llama.cpp releases and release notes — GitHub
- ik_llama.cpp — ikawrakow's fork with IQ-quant kernels — GitHub
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
