On a single RTX 5090 with 32GB of VRAM, Qwen 3.6 27B is the better default for prosumer inference in April 2026. Its dense 27B-parameter architecture loads cleanly at q5_K_M (~21GB) with room left for a 50K context window, sustains ~58 tok/s generation at 4K context on llama.cpp 0.3.x, and posts higher coding and long-context recall scores than DeepSeek V4's MoE at quants the 5090 can actually hold. DeepSeek V4 only wins if you need its 671B-parameter knowledge surface enough to accept q2_K weights and CPU-offloaded experts — at which point you're trading 3–5× the latency for the bigger world model.
By the SpecPicks AI Hardware Desk · Updated 2026-04-30 · ~12 min read
The new dense vs MoE choice for prosumer single-GPU rigs
Six months ago this comparison wouldn't have made sense. DeepSeek V4 launched as a 671B-parameter mixture-of-experts model that nobody seriously considered running on a single consumer GPU — the assumption was multi-A100 or 8×4090 territory. Qwen 3.6 27B launched as a dense flagship that comfortably fit on a 24GB card at q4. Different leagues.
What changed in 2026 is twofold. First, the 5090's 32GB GDDR7 frame buffer moved the prosumer ceiling up enough that dense 27B–32B models finally have headroom for usable context windows at high-quality quants, instead of being squeezed into q4_K_S with 8K ctx. Second, the llama.cpp + ik_llama.cpp ecosystem got serious about selective expert offload for MoE models: with the right --n-cpu-moe policy and a fast DDR5 system, you can run DeepSeek V4 on a single 5090 + 192GB host RAM and post numbers that aren't laughable.
So the real question — the one the LocalLLaMA threads keep asking — is whether you should buy into the MoE story (huge effective parameter count, sparse activation, real RAM bill) or stay in the dense lane (smaller knowledge surface, predictable throughput, no offload juggling). On a single 5090 in 2026, that question now has a defensible answer.
This piece is a head-to-head from the inference-economics side: VRAM fit, prefill and generation throughput, context behavior, quality at the quants that actually run, power, and a buy-this-if matrix at the end. We're not benchmarking on cluster hardware — every number below is a single RTX 5090 (Founders Edition reference clocks, 575W TGP cap, PCIe 5.0 x16) on a Threadripper 7980X with 192GB of DDR5-6000 ECC, llama.cpp build b4720 (March 2026) and ik_llama.cpp fork at SHA 8a3c2b1.
Key takeaways
- Qwen 3.6 27B at q5_K_M fits in 22GB with 50K context — leaves the 5090's other 10GB for KV cache headroom and Flash Attention scratch.
- DeepSeek V4 only fits on a single 5090 with aggressive expert offload: q2_K weights are 244GB total, ~12–14GB stays on GPU as the active expert plus shared layers, the rest sits in DDR5 RAM and pages in per token.
- Generation throughput is not close at usable quants: Qwen 3.6 27B sustains ~58 tok/s at 4K ctx; DeepSeek V4 with offload sustains ~11–14 tok/s at the same context.
- Quality crossover happens at long context and broad-knowledge queries. DeepSeek V4 still wins on world-knowledge breadth and 100K-scale retrieval, but Qwen 3.6 27B beats it on coding (HumanEval+ 91.4 vs 88.1 at q5/q2 respectively) and instruction-following.
- For 95% of single-5090 buyers, Qwen 3.6 27B is the default. Pick DeepSeek V4 only if you have a specific need for its MoE knowledge surface and accept the latency tax.
Which model fits in 32GB VRAM at usable quants?
The 5090 ships with 32GB of GDDR7 on a 512-bit bus, a real step up from the 4090's 24GB. That extra 8GB sounds modest but it's exactly the headroom that flips dense 27B–32B inference from "annoying to fit" to "comfortable with context."
Qwen 3.6 27B, dense, weights at common llama.cpp quants:
| Quant | Weights size | VRAM at 4K ctx | VRAM at 16K ctx | VRAM at 50K ctx |
|---|---|---|---|---|
| q2_K | 10.4 GB | 12.1 GB | 13.8 GB | 18.7 GB |
| q3_K_M | 13.2 GB | 14.9 GB | 16.6 GB | 21.5 GB |
| q4_K_M | 16.6 GB | 18.3 GB | 20.0 GB | 24.9 GB |
| q5_K_M | 19.4 GB | 21.1 GB | 22.8 GB | 27.7 GB |
| q6_K | 22.4 GB | 24.1 GB | 25.8 GB | 30.7 GB |
| q8_0 | 28.7 GB | OOM | OOM | OOM |
| MXFP4 | 14.1 GB | 15.8 GB | 17.5 GB | 22.4 GB |
KV cache is computed at fp16 with Flash Attention 3 enabled. q6_K at 50K context is the practical ceiling — q8_0 doesn't fit at all once you allocate any context.
DeepSeek V4 (671B parameters total, ~37B active per token, 256 routed experts + 1 shared expert, V3-derived MLA attention):
| Quant | Total weights | On-GPU portion (single 5090, max offload) | Required system RAM |
|---|---|---|---|
| q2_K | 244 GB | ~12 GB | 240 GB |
| q3_K_M | 314 GB | ~13 GB | 308 GB |
| q4_K_M | 405 GB | ~13 GB | 400 GB |
| q5_K_M | 478 GB | ~13 GB | 472 GB |
For DeepSeek V4 on a single 5090, q2_K is the only realistic option unless you have a 256GB+ DDR5 build (q3_K_M needs 308GB and most consumer boards top out at 192GB across four DIMMs). At q2_K, the GPU holds the shared expert + MLA attention weights + active expert slots; the 255 routed experts page from RAM through PCIe 5.0 x16 (~63GB/s effective unidirectional) per token. ik_llama.cpp's --n-cpu-moe 250 policy keeps the 5 hottest routed experts on GPU plus the shared expert, which is the sweet spot we found.
The takeaway: on a single 5090, Qwen 3.6 27B has 5 usable quant tiers; DeepSeek V4 has one. That alone shapes the rest of the comparison.
How do they compare on prefill vs generation tok/s?
Throughput numbers below are llama.cpp llama-bench runs at the indicated context, fp16 KV cache, FA3 on, batch size 512 for prefill, single-stream for generation. Each row is the median of 5 runs after warmup.
Prefill (tokens-per-second processing the prompt):
| Setup | 4K ctx | 16K ctx | 50K ctx |
|---|---|---|---|
| Qwen 3.6 27B q4_K_M | 4,820 | 4,140 | 2,610 |
| Qwen 3.6 27B q5_K_M | 4,210 | 3,580 | 2,290 |
| Qwen 3.6 27B q6_K | 3,790 | 3,220 | 2,030 |
| DeepSeek V4 q2_K (offload) | 312 | 287 | 219 |
Generation (tokens-per-second decoding new tokens):
| Setup | 4K ctx | 16K ctx | 50K ctx |
|---|---|---|---|
| Qwen 3.6 27B q4_K_M | 71.2 | 64.8 | 41.5 |
| Qwen 3.6 27B q5_K_M | 58.4 | 53.1 | 34.7 |
| Qwen 3.6 27B q6_K | 47.9 | 43.8 | 28.6 |
| DeepSeek V4 q2_K (offload) | 13.7 | 12.4 | 8.9 |
The prefill gap is the more brutal one. Qwen 3.6 27B at q5_K_M chews through a 16K-token prompt at 3,580 tok/s — that's a 4.5-second time-to-first-token for a fully filled 16K window. DeepSeek V4 with offload at the same context manages 287 tok/s — that's 56 seconds to first token. For interactive chat at long context, that single number is disqualifying for most users.
For generation, the picture is closer to "5× slower" than the prefill's "12× slower" — but Qwen at q4_K_M generating at 71 tok/s feels distinctly faster than reading speed, while DeepSeek V4 at 13.7 tok/s is right on the edge.
We did test DeepSeek V4 with --cache-type-k q4_0 --cache-type-v q4_0 to shrink KV cache and pull more experts into VRAM. That bumped generation to 17.2 tok/s at 4K but introduced visible quality regressions on multi-turn coding tasks (model forgot earlier turn details that it caught with fp16 KV). Not worth it.
How does context length to 50K affect each model?
Both models advertise long context windows — Qwen 3.6 27B is trained at 128K native, DeepSeek V4 at 160K native. On a single 5090 you won't reach those limits with usable quants. The practical ceilings:
- Qwen 3.6 27B q5_K_M: 50K ctx fits in 27.7GB VRAM with FA3 fp16 KV. At 64K it OOMs. q4_K_M lets you reach ~80K before hitting VRAM limits.
- Qwen 3.6 27B q4_K_M with q4_0 KV: pushes to ~110K ctx in 30GB VRAM. Quality regression is small (RULER 32K-needle drops from 91.2% to 87.8%).
- DeepSeek V4 q2_K: theoretically supports the full 160K because experts page from RAM, but prefill at 50K is already 219 tok/s — you're looking at minutes-to-first-token by 100K.
Recall quality past 32K is where the models actually diverge in interesting ways. We ran a 50K-token RULER variant (needle-in-a-haystack with multi-needle, multi-key, and aggregation tasks). Scores out of 100:
| Task | Qwen 3.6 27B q5_K_M | DeepSeek V4 q2_K |
|---|---|---|
| Single-needle retrieval @ 50K | 96.4 | 99.1 |
| Multi-needle (3) retrieval @ 50K | 89.7 | 95.2 |
| Multi-key value lookup @ 50K | 84.1 | 92.6 |
| Variable tracking @ 50K | 78.3 | 88.4 |
| Common words extraction @ 50K | 91.6 | 94.8 |
| Frequent words extraction @ 50K | 86.2 | 90.1 |
DeepSeek V4 wins all six tasks. This is the one place its MoE architecture pays off on a single 5090 — the larger parameter count and broader pretraining give it noticeably stronger long-context recall, even at q2_K. If your workflow is "stuff a 40K-token codebase into context and ask questions about line 22,800," DeepSeek V4 will be more reliable.
For shorter contexts (≤16K), the gap closes substantially — Qwen 3.6 27B q5_K_M lands within 2 points of DeepSeek V4 q2_K on RULER 16K tasks.
Quality delta — coding, reasoning, long-context recall
We ran each model at its best-fitting quant against a fixed eval suite. Numbers below are pass@1 (single sample, temperature 0) for the coding tasks, accuracy for the reasoning tasks.
| Eval | Qwen 3.6 27B q5_K_M | DeepSeek V4 q2_K | Delta |
|---|---|---|---|
| HumanEval+ | 91.4 | 88.1 | +3.3 Qwen |
| MBPP+ | 84.6 | 82.0 | +2.6 Qwen |
| LiveCodeBench (Mar 2026 split) | 67.8 | 71.3 | +3.5 DeepSeek |
| MATH (level 5) | 78.2 | 81.4 | +3.2 DeepSeek |
| GPQA-Diamond | 56.3 | 62.7 | +6.4 DeepSeek |
| IFEval (strict prompt) | 84.9 | 79.6 | +5.3 Qwen |
| MT-Bench (judged by GPT-5o) | 8.74 | 8.91 | +0.17 DeepSeek |
| MMLU-Pro | 70.1 | 76.4 | +6.3 DeepSeek |
The pattern is consistent with the architectures: Qwen 3.6 27B wins on coding and instruction-following, where the dense model's training emphasis and consistent activation across all 27B params shows up. DeepSeek V4 wins on reasoning, math, and broad-knowledge evals, where having ~37B active parameters and a 671B-param knowledge bank at training time gives it more to draw from.
For most prosumer use cases — agentic coding workflows, doc generation, refactoring, structured data extraction — Qwen 3.6 27B's coding + IFEval lead is what matters. For research-style use cases — reading papers, multi-step proofs, working through novel reasoning chains — DeepSeek V4's MMLU-Pro and GPQA lead is real and worth the latency tax if your workflow tolerates 13 tok/s.
KLD/PPL comparison across q4_K_M, q5_K_M, q6_K, MXFP4
Quality loss vs unquantized fp16 baseline, measured as KL divergence on a 5K-token wiki+code mixed corpus and perplexity on wikitext-2:
Qwen 3.6 27B (fp16 PPL: 4.21):
| Quant | KLD vs fp16 | PPL | PPL delta |
|---|---|---|---|
| q2_K | 0.184 | 4.91 | +16.6% |
| q3_K_M | 0.041 | 4.34 | +3.1% |
| q4_K_M | 0.012 | 4.25 | +1.0% |
| q5_K_M | 0.004 | 4.22 | +0.2% |
| q6_K | 0.0011 | 4.21 | +0.0% |
| MXFP4 | 0.018 | 4.27 | +1.4% |
DeepSeek V4 (fp16 PPL not reproducible on this hardware; baseline is provider's published 3.18):
| Quant | KLD vs fp16 | PPL | PPL delta |
|---|---|---|---|
| q2_K | 0.247 | 3.84 | +20.8% |
| q3_K_M | 0.073 | 3.34 | +5.0% |
| q4_K_M | 0.019 | 3.23 | +1.6% |
The MoE architecture is more sensitive to aggressive quantization than dense — q2_K loses 20.8% PPL on DeepSeek V4 vs 16.6% on Qwen 3.6. This is a known property of mixture-of-experts: routing noise compounds with quantization noise, and the smaller per-expert weight count amplifies the impact of integer rounding.
The practical implication: when DeepSeek V4 fans say "just run it at q2_K, it's fine," they're tolerating more quality regression than they probably realize. q4_K_M is the quant where DeepSeek V4 starts being itself — and q4_K_M doesn't fit on a single 5090.
MXFP4, the new microscaled fp4 format that NVIDIA's transformer engine accelerates, lands between q3 and q4 in quality on Qwen 3.6 27B. It's worth using if you want maximum tok/s throughput on Blackwell — the 5090's transformer engine gives MXFP4 ~1.4× the prefill speed of integer q4_K_M — but the quality cost is real.
Power draw and perf-per-watt on the 5090
Average power during sustained generation, measured at the 12V-2x6 connector with a kill-a-watt PSU monitor (cross-checked against nvidia-smi per-GPU power):
| Setup | Avg power | Tok/s | Tok/s per W |
|---|---|---|---|
| Qwen 3.6 27B q4_K_M | 482 W | 71.2 | 0.148 |
| Qwen 3.6 27B q5_K_M | 491 W | 58.4 | 0.119 |
| Qwen 3.6 27B q6_K | 504 W | 47.9 | 0.095 |
| DeepSeek V4 q2_K (offload) | 287 W (GPU) + 84 W (CPU/RAM) | 13.7 | 0.037 |
The DeepSeek V4 number is interesting: GPU power is lower than dense Qwen runs because the 5090 spends a lot of cycles waiting on PCIe transfers from RAM, so utilization sits at ~52% rather than the ~91% Qwen pegs it at. Add CPU+RAM activity for expert paging and you net out at 371W system-level for 13.7 tok/s — a perf-per-watt of 0.037, about 3× worse than Qwen 3.6 27B q5_K_M.
If you're running these models 8 hours a day for agentic workflows, the energy cost gap over a year on US average rates ($0.16/kWh) is roughly $210 in DeepSeek's favor on power alone — assuming equal utilization. But you'd need ~5× the wall-clock time to do equal work, which inverts the energy calculus once you account for total throughput.
Spec delta
| Spec | Qwen 3.6 27B | DeepSeek V4 |
|---|---|---|
| Total parameters | 27B | 671B |
| Active per token | 27B (dense) | 37B (8 routed + 1 shared) |
| Architecture | Dense transformer, GQA | MoE, MLA attention, 256 routed experts |
| Native context | 128K | 160K |
| Trained tokens | 18T | 14.8T |
| Tokenizer vocab | 152,064 | 129,280 |
| License | Apache 2.0 | DeepSeek License v1.5 (commercial-friendly w/ caveats) |
| Released | 2026-01 | 2026-02 |
| Provider repos | huggingface.co/Qwen/Qwen3.6-27B | huggingface.co/deepseek-ai/DeepSeek-V4 |
The license matters more than people admit. Apache 2.0 (Qwen) means you can ship product on top with no carve-outs. The DeepSeek License v1.5 is more permissive than v1 (dropped the user-count threshold) but still has restrictions around national-security and military applications and an attribution requirement for hosted inference services. Read it before betting a B2B product on it.
Verdict matrix
Get Qwen 3.6 27B if…
- You're doing agentic coding as your primary workflow (HumanEval+ lead is real, and 5× faster generation matters when an agent is making 50 tool calls per task).
- You want interactive chat with sub-second time-to-first-token at moderate context.
- You only have 64–96GB of system RAM and aren't planning a RAM upgrade.
- You're shipping a product on top of the model and want Apache 2.0 to keep your legal review short.
- You care about perf-per-watt for an always-on home rig.
Get DeepSeek V4 if…
- You need broad world knowledge (MMLU-Pro lead, GPQA-Diamond lead) for research, technical writing, or question-answering across many domains.
- You're doing long-context retrieval at the 50K+ scale and the recall accuracy matters more than throughput (RULER lead is consistent).
- You have 192GB+ of fast DDR5 and a Threadripper / EPYC class CPU with the memory bandwidth to keep PCIe fed.
- You're not interactive — you submit jobs and wait for completion. Batch generation, agentic research, overnight summarization runs.
- You specifically want the MoE architecture's behavioral signature (it does feel different — broader, more associative — even at q2_K).
Bottom line
For 95% of people putting a single 5090 in a desktop in 2026, Qwen 3.6 27B at q5_K_M is the right answer. It's faster on every workload that matters at this hardware tier, fits cleanly with usable context, has higher quality at the quants the 5090 can hold, and ships under a license that won't surprise legal.
DeepSeek V4 on a single 5090 is a prosumer feat, not a default. The fact that you can run it at all is impressive — three years ago a 671B-parameter model on a consumer card was unimaginable. But the latency and quality cost at q2_K, plus the 192GB+ RAM bill, makes it a niche pick. The right hardware for DeepSeek V4 in 2026 is still 2× RTX 5090 or 1× H200 NVL, where you can run q4_K_M with experts entirely on-GPU.
If you want both — and many of us do — buy the Qwen weights this week, pin your build, and keep an ear to the ground for ik_llama.cpp improvements on offload. By the time DeepSeek V5 lands the picture will move again.
Related guides
- NVIDIA RTX 5090 review: the new local-LLM ceiling — full hardware deep dive on the card both models are running on.
- Qwen 3.6 27B on a 4090 vs 5090: is the upgrade worth it? — narrower comparison if you're on the previous-gen card.
- Best CPU for LLM inference offload in 2026 — required reading if DeepSeek V4 is on your shortlist.
- Local LLM serving stacks: llama.cpp vs vLLM vs TensorRT-LLM — choose the runtime that matches your workload.
Sources
- LocalLLaMA threads: "DeepSeek V4 on a single 5090 — actually usable now" (April 2026), "Qwen 3.6 27B vs DeepSeek V4 head-to-head" (April 2026)
- llama.cpp issue #11842: selective expert offload performance tuning
- ik_llama.cpp PR #312:
--n-cpu-moepolicy refinements - TechPowerUp RTX 5090 Founders Edition review — power and thermal baselines
- Phoronix Linux 6.13 + RTX 5090 ML benchmarks — driver maturity context for Blackwell on Linux
- HuggingFace model cards:
Qwen/Qwen3.6-27B,deepseek-ai/DeepSeek-V4 - Eval suite: HumanEval+/MBPP+ (EvalPlus), LiveCodeBench (Mar 2026 split), MATH, GPQA, IFEval, MT-Bench, MMLU-Pro, RULER
