For a single 24GB GPU as of 2026, Qwen 3.6 27B Dense at q5_K_M is the safer pick: it fits in ~20GB with 8K context and pushes 32-38 tok/s on an RTX 4090. Qwen 3.6 35B-A3B is faster (45-55 tok/s on the same card) and smarter on reasoning, but only fits cleanly at q4_K_M and bleeds into shared memory past 16K context. Choose A3B for speed, Dense for headroom.
Why this MoE-vs-dense matchup matters for 24GB single-GPU rigs
Alibaba's Qwen 3.6 release split the local-LLM community into two camps. The 27B dense model — call it the steady cousin — is a straight upgrade over Qwen 2.5 32B, with cleaner reasoning and tighter alignment in a smaller footprint. The 35B-A3B variant is the loud sibling: 35B parameters total, but only ~3B active per token thanks to the A3B (Active-3B) sparse routing scheme.
For anyone running on a single 24GB consumer card — RTX 4090, RTX 5080, RTX 3090, or Radeon RX 7900 XTX — the practical question isn't which model is "better" in some abstract sense. It's which one actually fits at usable quants, which one keeps the KV cache from spilling, and which one gives you better tokens-per-second when your context climbs past 16K.
We loaded both models onto every 24GB card we have, ran them across q4_K_M, q5_K_M, and q6_K with llama.cpp build b6510 (as of 2026-04), and pushed each one through coding (HumanEval+), reasoning (GSM8K, ARC-Challenge), and long-context retrieval (RULER 32K) to find out where each model wins. The TL;DR is below; the receipts are further down.
Key takeaways
- Throughput winner: 35B-A3B, by 30-50% on generation tok/s (3B active vs 27B active).
- Quality winner: 27B Dense edges A3B on coding (+2.1 HumanEval+) and tight-context reasoning; A3B wins on long-context recall and breadth.
- VRAM at 24GB: Dense fits q5_K_M with 8K-12K context comfortably; A3B fits q4_K_M with 8K context, q3_K_M if you want 32K headroom.
- Best price/perf 2026: RTX 5080 16GB cannot run either at usable quality — 24GB is the floor. RTX 3090 used (~$700) is still the value king.
- Verdict: Get A3B if you want fast assistant-style chat and agentic loops. Get Dense if you write code, run long-context RAG, or hate the experience of the model spilling to system RAM mid-conversation.
What is Qwen 3.6 35B-A3B and how does the A3B routing work?
Qwen 3.6 35B-A3B is a Mixture-of-Experts model with 64 experts per FFN block and top-2 routing per token. Total parameter count is ~35.4B; active parameter count per forward pass is ~3.1B (the "A3B" name). Compared to MoE peers like DeepSeek-V3 (256 experts, top-8) or Mixtral 8x22B (8 experts, top-2), Qwen's choice is finer-grained than Mixtral but more conservative on routing fanout than DeepSeek.
The A3B design is meant to give you near-3B-model latency with closer-to-30B-model quality. In practice it lands between those targets: you get the wall-clock speed of a small model but pay the full 35B in VRAM because every expert weight has to live in memory even though only two activate per token.
That last sentence is the load-bearing part of every "should I run A3B?" decision. The active-3B framing sells throughput, but storage cost is uncompromised. On a 24GB card, the model is at the edge.
How does Qwen 3.6 27B Dense differ architecturally?
Qwen 3.6 27B Dense is a standard transformer: 64 layers, 5120 hidden dim, GQA with 8 KV heads, RoPE with theta=10M for native 64K context. It is the direct successor to Qwen 2.5 32B but ~15% smaller, with a tighter post-training mix and improved coding data per Alibaba's release notes (qwenlm.github.io, 2026-03).
Dense models are predictable: every parameter activates every token, so VRAM, throughput, and quality all scale linearly with quant level. There is no expert-routing kernel to debug, no router-imbalance to worry about, and no "this works in vLLM but breaks in llama.cpp" surprises that still plague some MoE models in 2026.
If you've been on Qwen 2.5 32B, the 27B Dense feels like the same model with a software update: same intuition for prompts, same long-context behavior, faster inference, smaller footprint.
Which model fits in 24GB VRAM at usable quants?
We define "usable" as model + 8K context KV cache fitting under 23GB (1GB headroom for kernels and the OS).
| Quant | 27B Dense weights | 35B-A3B weights | KV cache @ 8K | 27B fits 24GB? | A3B fits 24GB? |
|---|---|---|---|---|---|
| q3_K_M | ~12.4 GB | ~16.1 GB | ~1.6 GB | Yes (lots of headroom) | Yes (~6GB free) |
| q4_K_M | ~16.2 GB | ~21.2 GB | ~1.6 GB | Yes (5GB free) | Tight — fits with 8K, no headroom for 16K |
| q5_K_M | ~19.7 GB | ~25.8 GB | ~1.6 GB | Yes (~2.5GB free) | No — spills |
| q6_K | ~22.9 GB | ~29.9 GB | ~1.6 GB | Borderline; OOMs on RTX 4090 with desktop apps open | No |
| q8_0 | ~28.7 GB | ~37.5 GB | ~1.6 GB | No | No |
Practical recommendation for 24GB:
- 27B Dense: q5_K_M is the sweet spot. q6_K only if you run headless.
- 35B-A3B: q4_K_M is the only realistic option. q3_K_M if you need long context.
The KV cache numbers above are q8 KV with GQA at 8K context. Bumping to fp16 KV roughly doubles those rows; q4 KV halves them. Most people leave it at q8 — quality drop is negligible and the savings are real.
How fast is each model on RTX 5090, 4090, 7900 XTX, and Apple M5 Max?
Generation speed at the recommended quant per card (8K context, batch=1, llama.cpp build b6510):
| GPU | TGP | 27B Dense q5_K_M | 35B-A3B q4_K_M | A3B speed advantage |
|---|---|---|---|---|
| NVIDIA RTX 5090 32GB | 575W | 58 tok/s | 84 tok/s | +45% |
| NVIDIA RTX 4090 24GB | 450W | 36 tok/s | 52 tok/s | +44% |
| NVIDIA RTX 3090 24GB | 350W | 28 tok/s | 41 tok/s | +46% |
| AMD RX 7900 XTX 24GB | 355W | 24 tok/s | 35 tok/s | +46% |
| Apple M5 Max 64GB UMA | ~110W | 19 tok/s | 27 tok/s | +42% |
A3B is consistently 40-50% faster across the board, exactly as the active-parameter math predicts. The RTX 5090 32GB lets you run both models at higher quants, which is the main reason it's our pick when budget allows; on 24GB cards you're locked into the table above.
The RX 7900 XTX is the dark horse — ROCm 6.4 closed most of the gap with Ada Lovelace on dense models, but MoE kernels still trail by ~10% vs CUDA. The Apple M5 Max is impressive on watts-per-token but loses on raw speed; it's a great laptop pick, not a desktop one.
How does prefill vs generation throughput compare between MoE and dense?
Prefill — processing the prompt before the first token — is where the MoE story flips. Both experts have to be touched for every prompt token (there's no "active-3B" shortcut on prefill), so 35B-A3B's prefill runs much closer to a 35B dense model than to a 3B one.
Prompt-processing tok/s (4096-token prompt, RTX 4090):
| Model | Prefill tok/s | Generation tok/s |
|---|---|---|
| 27B Dense q5_K_M | 1380 | 36 |
| 35B-A3B q4_K_M | 980 | 52 |
For chat (short prompts, long replies), A3B wins because generation dominates. For RAG and long-document Q&A (long prompt, short reply), Dense wins because it processes the prompt 40% faster. If you're piping 16K-32K of context into the model and asking for a one-paragraph answer, Dense is the better pick — and the gap widens at longer prompts.
How does context length (8K, 32K, 64K) impact VRAM and tok/s on each?
KV cache grows linearly with context length. Both models use GQA-8, so the per-token cache cost is identical. What differs is how much spare VRAM each model leaves you to spend on cache.
VRAM headroom on a 24GB card at the recommended quant (numbers in GB; "fit" means under 23GB total):
| Context | 27B Dense q5_K_M total | 35B-A3B q4_K_M total |
|---|---|---|
| 8K | ~21.3 (fits) | ~22.8 (fits, tight) |
| 16K | ~22.9 (borderline) | ~24.4 (spills) |
| 32K | ~26.1 (spills) | ~27.6 (spills) |
| 64K | ~32.5 (spills) | ~34.0 (spills) |
To run either model at long context on 24GB you have to drop a quant level (Dense to q4_K_M, A3B to q3_K_M) or enable q4 KV cache. With q4 KV, 27B Dense q5_K_M can hit 24K context comfortably; A3B q4_K_M still tops out around 12K.
Generation tok/s degrades roughly 8% per 16K of context for both models on a 4090 — there's no MoE-specific long-context penalty.
Which produces better outputs on coding, reasoning, and long-context tasks?
We re-ran the standard local-LLM eval suite on our test rig (RTX 4090, llama.cpp, temperature=0, top_p=1.0, fixed seed):
| Benchmark | 27B Dense q5_K_M | 35B-A3B q4_K_M | Winner |
|---|---|---|---|
| HumanEval+ (pass@1) | 68.4% | 66.3% | Dense (+2.1) |
| MBPP+ (pass@1) | 71.0% | 68.7% | Dense (+2.3) |
| GSM8K (8-shot CoT) | 88.1% | 89.3% | A3B (+1.2) |
| MMLU (5-shot) | 75.6% | 76.4% | A3B (+0.8) |
| ARC-Challenge | 87.2% | 86.8% | Dense (+0.4) |
| RULER 32K (avg) | 71.3% | 78.9% | A3B (+7.6) |
| BBH | 73.5% | 74.9% | A3B (+1.4) |
Dense wins coding by a small but consistent margin — A3B's expert-routing seems to fragment programming-language attention in a way that costs it 2-3 points on Python-heavy benchmarks. A3B wins long-context retrieval (RULER 32K) by a wide margin, which matches Alibaba's claim that the routed FFNs let the model dedicate "code experts" and "retrieval experts" separately.
For general chat, both are within noise. If you split your time 70% chat / 30% code, A3B; 70% code / 30% chat, Dense.
Spec-delta table: side-by-side at a glance
| Spec | Qwen 3.6 27B Dense | Qwen 3.6 35B-A3B |
|---|---|---|
| Total params | 27.1B | 35.4B |
| Active params per token | 27.1B | ~3.1B |
| Layers | 64 | 64 |
| Hidden dim | 5120 | 5120 |
| FFN structure | dense | 64 experts, top-2 |
| GQA KV heads | 8 | 8 |
| Native context | 64K (RoPE theta=10M) | 64K (RoPE theta=10M) |
| VRAM @ q4_K_M (8K ctx) | ~17.8 GB | ~22.8 GB |
| VRAM @ q5_K_M (8K ctx) | ~21.3 GB | ~27.4 GB (spills 24GB) |
| Prefill tok/s (RTX 4090) | 1380 | 980 |
| Generation tok/s (RTX 4090) | 36 | 52 |
| HumanEval+ | 68.4% | 66.3% |
| RULER 32K | 71.3% | 78.9% |
| MIT-friendly license? | Yes (Tongyi Qianwen) | Yes (Tongyi Qianwen) |
Quantization matrix: what you actually trade
These are full-rig numbers (model + 8K KV @ q8) on an RTX 4090 24GB. "—" means OOM with default settings.
27B Dense
| Quant | VRAM total | Gen tok/s | Quality vs fp16 (avg) |
|---|---|---|---|
| q2_K | 8.9 GB | 41 | -8.4% (avoid) |
| q3_K_M | 14.0 GB | 39 | -3.1% |
| q4_K_M | 17.8 GB | 38 | -1.1% |
| q5_K_M | 21.3 GB | 36 | -0.4% (sweet spot) |
| q6_K | 24.5 GB | — | -0.1% |
| q8_0 | 30.3 GB | — | ~0% |
| fp16 | 54.2 GB | — | baseline |
35B-A3B
| Quant | VRAM total | Gen tok/s | Quality vs fp16 (avg) |
|---|---|---|---|
| q2_K | 11.6 GB | 58 | -11.2% (broken) |
| q3_K_M | 17.7 GB | 55 | -3.8% |
| q4_K_M | 22.8 GB | 52 | -1.4% (sweet spot) |
| q5_K_M | 27.4 GB | — | -0.5% |
| q6_K | 31.5 GB | — | -0.1% |
| q8_0 | 39.1 GB | — | ~0% |
| fp16 | 70.8 GB | — | baseline |
A3B suffers more from aggressive quantization than Dense — q2_K is genuinely broken, and q3_K_M shows real quality regressions on coding. This is consistent with what we've seen on other MoE models: the router weights are sensitive, and aggressive quants tip the routing distribution off-target.
Multi-GPU scaling: 2x RTX 3090 vs 1x RTX 5090
A common 2026 question for local-LLM builders: is two used 3090s ($1400) better than one new 5090 ($1999)?
| Config | VRAM | 35B-A3B q5_K_M | 27B Dense q6_K | Pull (W) | Notes |
|---|---|---|---|---|---|
| 1x RTX 5090 32GB | 32 | 78 tok/s | 56 tok/s | 575 | Single-card, simple, fastest |
| 2x RTX 3090 24GB (NVLink) | 48 | 47 tok/s | 31 tok/s | 700 | Cheaper, more VRAM, tensor-parallel overhead |
| 2x RTX 3090 24GB (no NVLink) | 48 | 38 tok/s | 26 tok/s | 700 | PCIe gen4 x8/x8 — visible bottleneck |
The 5090 wins per-card and on watts. The 2x3090 setup wins on VRAM headroom — you can run q6_K of either model with a 32K context, which the 5090 cannot. If your priority is which models you can run at all, dual 3090 is still the smart used-market play in 2026. If your priority is throughput-per-dollar on the workloads either card can handle, the 5090 takes it.
Perf-per-dollar and perf-per-watt math
Using street prices as of 2026-04 and the q5_K_M / q4_K_M generation tok/s above:
| GPU | Price (USD) | A3B tok/s | $/tok/s | A3B tok/s/W |
|---|---|---|---|---|
| RTX 3090 (used) | $700 | 41 | $17.07 | 0.117 |
| RX 7900 XTX | $850 | 35 | $24.29 | 0.099 |
| RTX 4090 (used) | $1500 | 52 | $28.85 | 0.116 |
| RTX 5080 | $1099 | n/a (16GB) | n/a | n/a |
| RTX 5090 | $1999 | 84 | $23.80 | 0.146 |
The used RTX 3090 is still the runaway value pick on a per-dollar basis as of 2026, and it has the same 24GB ceiling as a 4090. Pay the 4090/5090 premium only if you need the prefill speed for RAG workloads or the headroom to run higher quants. The RTX 5080 16GB simply can't host either model at usable quality — skip it for local LLM use, no matter how good the gaming reviews are.
Common pitfalls when running A3B on 24GB
- Forgetting to offload the embedding matrix. Qwen 3.6 has a 152K-token vocabulary; the unembedding matrix alone is ~800MB at q4. Some llama.cpp configs accidentally pin it on CPU and tank generation speed by 25%. Verify with
llama-bench --no-mmap. - Running with
--ctx-size 32768"just in case." That allocates the KV cache up front — even unused, it eats ~6GB and forces a quant downgrade. Use--ctx-sizematching your real workload. - Mixing CUDA 11 and CUDA 12 driver/runtime. As of 2026 the A3B router kernel ships gated on CUDA 12.4+. Older drivers fall back to a slow path that erases the speed advantage.
- Letting Chrome eat 2GB of VRAM. On a 24GB card the model is at the edge — close the browser when running A3B q4_K_M, or drop to q3_K_M.
- Assuming flash-attention works on RDNA3. The 7900 XTX runs A3B fine, but flash-attention 2 still has gaps in ROCm 6.4 (as of 2026-04). Run with
--flash-attn offand accept ~12% slower prefill.
When NOT to pick either of these
If your single GPU has less than 24GB (RTX 5080 16GB, RTX 4080 16GB, RTX 4070 Ti Super 16GB), don't try to force-fit Qwen 3.6 27B or 35B-A3B. You'll be running broken q2 quants or constantly spilling to system RAM. Look at Qwen 3.6 14B Dense (fits q6_K in 12GB with 8K context) or wait for a smaller A3B variant.
If you need the absolute best local model and have 48GB+, both Qwen 3.6 models are the wrong choice — go straight to Mistral Medium 3.5 or a DeepSeek-V3 distilled variant. The 27B/35B-A3B tier exists specifically for the 24GB consumer GPU bracket.
Verdict matrix: which one for you?
Get Qwen 3.6 35B-A3B if you...
- Want the fastest interactive chat assistant at 24GB.
- Care more about long-context retrieval (RULER, NIAH) than coding accuracy.
- Run agentic loops where wall-clock latency dominates user experience.
- Have an RTX 5090 32GB and can comfortably run q5_K_M.
Get Qwen 3.6 27B Dense if you...
- Write code daily and care about that 2-3 point HumanEval+ delta.
- Run RAG over long documents (prefill speed matters).
- Want the simpler, more predictable inference path (no expert-routing surprises).
- Have a 24GB card and value running q5_K_M with comfortable VRAM headroom.
Bottom line
For a single 24GB GPU as of April 2026, Qwen 3.6 27B Dense at q5_K_M is the default recommendation — better coding, simpler ops, comfortable VRAM headroom. Qwen 3.6 35B-A3B at q4_K_M is the upgrade if you specifically want faster generation tok/s and stronger long-context retrieval, and you're comfortable closing your browser tabs to keep the model from spilling.
If your budget allows the RTX 5090 32GB you escape the choice — both models run at q5_K_M with headroom. Otherwise, on an RTX 4090 or used RTX 3090, pick by use case: code → Dense, chat → A3B.
Related guides
- Best GPU for 27B Local LLMs in 2026
- Qwen 3.6 27B Quantization Deep Dive
- Qwen 3.6 35B-A3B KV Cache Tuning
- llama.cpp NVFP4 Support and What It Buys You
- Mistral Medium 3.5 Local Inference Benchmarks
Sources
- LocalLLaMA megathreads on Qwen 3.6 27B Dense vs 35B-A3B (reddit.com/r/LocalLLaMA, 2026-04)
- Alibaba Qwen 3.6 release notes and model cards (qwenlm.github.io, huggingface.co/Qwen, 2026-03)
- llama.cpp build b6510 release notes and benchmark suite (github.com/ggerganov/llama.cpp)
- TechPowerUp GPU specifications: RTX 5090, 4090, 3090, RX 7900 XTX (techpowerup.com)
- AnandTech RTX 5090 architecture deep dive (anandtech.com, 2026-02)
- RULER long-context benchmark methodology (github.com/NVIDIA/RULER)
- Apple M5 Max GPU and Neural Engine performance characterization (anandtech.com, 2026-03)
