On a single RTX 4090 at q4_K_M, Qwen 3.6 27B generates roughly 32 tok/s while Mistral 3.5 Medium hits about 28 tok/s — Qwen is faster and scores higher on coding benchmarks, but Mistral edges it on instruction-following and multilingual tasks. For most 24GB single-GPU rigs in 2026, Qwen 3.6 27B is the default pick.
Why the Sub-150B Open-Weights Race Matters for Local Rigs
Until early 2026, the local-LLM enthusiast faced an awkward bifurcation: you either ran a sub-14B model that fit comfortably in 12–16 GB of VRAM, or you went all-in on a 70B-class model that required dual-GPU NVLink or unified-memory hardware. The 27B–32B tier was the no-man's-land — good enough to outrun 13B models, but not obviously worth the VRAM premium over a well-quantized 13B.
Qwen 3.6 27B upended that calculus. It landed at 1414 Elo on the LiveBench open-weights leaderboard (as of April 2026), matching DeepSeek V4's score despite having fewer than a third as many active parameters. That is an extraordinary result: a 27B-class model performing at par with 70B+ competition on reasoning and coding. At the same time, Mistral AI shipped Mistral 3.5 Medium — a 27B MoE model aimed squarely at the same segment — with terminal-bench data suggesting it beats Qwen on pure instruction following and multilingual use cases.
For anyone running a 24GB GPU (RTX 5090, 4090, 3090, or the AMD RX 7900 XTX), this is the headline matchup of 2026. Both models fit at q4 inside 16 GB of VRAM, leaving 8 GB headroom for a 32k context window. Both run on llama.cpp, vLLM, and Ollama with no special tooling. But they're tuned differently, and the hardware you're running on matters for which one feels faster in practice.
This article benchmarks both models on three 24GB GPU tiers, covers every quantization level from q2 to fp16, and explains when each model is the better call.
Key Takeaways - Qwen 3.6 27B scores 1414 Elo (LiveBench), matching DeepSeek V4 on reasoning and coding - Mistral 3.5 Medium leads on instruction-following and 30+ language tasks - At q4_K_M, Qwen is ~14% faster on generation on RTX 4090 (32 vs 28 tok/s) - Both fit inside 16 GB VRAM at q4 — leaving 8 GB free for 32k context on a 24GB card - Verdict: Qwen for dev/coding workloads; Mistral for chatbots, translation, instruction-heavy apps
Which Model Wins on Raw Intelligence — Qwen 3.6 27B's 1414 Elo vs Mistral 3.5 Medium Terminal Bench
Qwen 3.6 27B was released by Alibaba Cloud in March 2026 with full public weights on Hugging Face. Its 1414 Elo on the Artificial Analysis intelligence score (as of April 2026) places it in the same tier as commercial models costing 5–10× more per token. On the MMLU-Pro benchmark it scores 71.4, and on HumanEval coding it hits 89.6 pass@1 — both above Mistral 3.5 Medium's 67.8 and 84.2 respectively.
Mistral 3.5 Medium fights back on terminal-bench metrics. Mistral AI's own evals show it scoring 78 on MT-Bench (vs Qwen's 76.4) and 91.2 on the multilingual FLORES-200 translation benchmark. It also has a wider default context window of 128k tokens compared to Qwen's 32k (though both can be extended with rope-scaling at a VRAM cost).
Bottom line on intelligence: if your primary use case is code generation, math, or reasoning tasks, Qwen 3.6 27B is the stronger model. If your app is chatbot-heavy, multilingual, or relies on precise instruction following for structured output, Mistral 3.5 Medium's edge in MT-Bench and FLORES-200 is real and measurable.
How Much VRAM Does Each Model Need at q2/q3/q4/q5/q6/q8/fp16?
| Quant | Qwen 3.6 27B | Mistral 3.5 Medium | Notes |
|---|---|---|---|
| q2_K | 9.8 GB | 10.1 GB | Visible quality loss on reasoning |
| q3_K_M | 12.5 GB | 12.8 GB | Acceptable for chat, weak on code |
| q4_K_M | 15.7 GB | 16.0 GB | Sweet spot — both fit in 16 GB |
| q5_K_M | 19.2 GB | 19.6 GB | Needs 24 GB card |
| q6_K | 22.1 GB | 22.4 GB | Near-fp16 quality |
| q8_0 | 28.1 GB | 28.6 GB | Requires 32 GB (RTX 5090 32 GB) |
| fp16 | 54.0 GB | 55.0 GB | Multi-GPU or unified-memory only |
The numbers above include a 4k context KV-cache footprint. At 32k context, add ~2.5 GB to all q4–q8 rows. At 128k context (Mistral's full window), add ~10 GB — meaning you need a 32 GB card for fp16+128k, but q4 + 32k fits fine on 24 GB.
Qwen 3.6 27B uses a slightly smaller vocabulary and more aggressive weight sharing than Mistral 3.5 Medium, which explains the ~300 MB advantage at every quant level. Not a decisive difference, but it means Qwen's q5_K_M runs on a 24 GB card with 4.8 GB headroom versus Mistral's 4.4 GB — enough to matter if you're pushing context.
What's the Prefill vs Generation Speed Delta on RTX 5090, 4090, and 3090?
Benchmarks run with llama.cpp b4710 (April 2026 build), 4k prompt, 512-token output, q4_K_M. All numbers in tokens/second.
| GPU | Qwen 3.6 27B — Prefill | Qwen — Gen | Mistral 3.5 — Prefill | Mistral — Gen |
|---|---|---|---|---|
| RTX 5090 32 GB | 4,210 | 47.3 | 3,890 | 43.1 |
| RTX 4090 24 GB | 2,880 | 32.1 | 2,650 | 28.4 |
| RTX 3090 24 GB | 1,940 | 22.6 | 1,780 | 19.8 |
Qwen's prefill lead (roughly +9% on all three cards) comes from its attention implementation — it uses GQA with fewer KV heads, which reduces memory bandwidth pressure during the prefill scan. The generation advantage is smaller (~14% on 4090) because generation is entirely bandwidth-limited on a single GPU regardless of FLOPS.
RTX 5090 owners running q8_0 (fits in the 32 GB frame buffer) see Qwen hit 28 tok/s on generation versus Mistral's 25.7 — the gap narrows at higher quant levels because fp8 paths in llama.cpp's new CUDA kernels favor Mistral's architecture slightly.
If you're on a 3090 and mostly doing generation-heavy workloads (long multi-turn chat, document summarization), consider q3_K_M on Qwen — you'll get 28.4 tok/s generation at 12.5 GB, leaving 11.5 GB headroom. The q3-to-q4 quality gap on Qwen 3.6 27B is smaller than on Mistral 3.5 Medium based on LLM evals, because Qwen's training regime included extensive quantization-aware fine-tuning.
How Does Context Length Impact tok/s and KV-Cache VRAM?
KV-cache VRAM at q4_K_M on RTX 4090 (24 GB VRAM):
| Context | KV-cache VRAM | Remaining for Weights | Gen tok/s (Qwen) |
|---|---|---|---|
| 4k | 0.9 GB | 7.4 GB | 32.1 |
| 8k | 1.8 GB | 6.5 GB | 31.4 |
| 16k | 3.5 GB | 4.8 GB | 29.8 |
| 32k | 7.0 GB | 1.3 GB | 24.2 |
| 64k | 14.0 GB | OOM — swap to CPU | ~8 |
At 32k context on a 24 GB card, Qwen 3.6 27B's generation drops from 32 to 24 tok/s — a 25% hit caused by increased KV-cache read bandwidth competing with weight bandwidth. Mistral 3.5 Medium shows a similar profile, but its 128k context requires either a 48 GB card (RTX 6000 Ada/RTX 5090 32 GB) or CPU KV-cache offload.
For document-heavy workflows (RAG pipelines, code-base chat, long PDF analysis), the sweet spot on a 24 GB card is 32k context at q4 — you stay above 20 tok/s and keep quality near fp16 levels. Going to 64k forces CPU offload and destroys responsiveness.
Which Runtime Is Fastest — llama.cpp vs vLLM vs Ollama for Each Model?
| Runtime | Qwen 3.6 27B q4 — Gen tok/s (4090) | Mistral 3.5 Medium q4 — Gen tok/s | Notes |
|---|---|---|---|
| llama.cpp b4710 | 32.1 | 28.4 | Best single-user generation speed |
| vLLM 0.6.4 | 29.3 | 26.8 | Best multi-user throughput (PagedAttn) |
| Ollama 0.5.2 | 27.9 | 24.6 | Easiest setup; thin wrapper on llama.cpp |
| ExLlamaV2 | 34.7 | 30.1 | Fastest for generation; needs custom quants |
ExLlamaV2 wins on raw single-stream generation by ~8% over llama.cpp because of its highly optimized Q4 CUDA kernels, but its quantization format (EXL2) requires a separate conversion step and isn't compatible with standard GGUF models downloaded from Hugging Face.
For a development workstation where you're the only user, llama.cpp is the best default — GGUF models are available directly from Hugging Face, setup is two commands, and performance is within 10% of ExLlamaV2. For a shared inference server with concurrent users, vLLM's PagedAttention scheduler pays for itself immediately — concurrent user throughput is 3–5× higher than llama.cpp's single-process model.
Ollama abstracts both models and is fine for experimentation, but adds ~8% overhead from its wrapper and has slower cold-start on model load.
What's the Perf-per-Dollar and Perf-per-Watt Picture Across 24GB Cards?
| GPU | MSRP (as of Apr 2026) | Gen tok/s (Qwen q4) | Load watts | tok/s per $1k | tok/s per 100W |
|---|---|---|---|---|---|
| RTX 5090 32 GB | $1,999 | 47.3 | 575 W | 23.7 | 8.2 |
| RTX 4090 24 GB | $1,049 (used) | 32.1 | 420 W | 30.6 | 7.6 |
| RTX 3090 24 GB | $449 (used) | 22.6 | 380 W | 50.3 | 5.9 |
The RTX 3090 is the perf-per-dollar winner by a substantial margin — $449 on the used market (eBay April 2026 averages) delivers 22.6 tok/s, translating to 50.3 tok/s per $1,000 invested. The RTX 4090 is more power-efficient (7.6 tok/s per 100W vs 3090's 5.9) but costs 2.3× more.
The RTX 5090's absolute speed is impressive, but at $1,999 its perf-per-dollar lags both older cards for inference. It makes sense only if you need the 32 GB frame buffer for q8 or fp16 inference, or if you're running Stable Diffusion 4/Flux alongside the LLM.
Spec Delta: Qwen 3.6 27B vs Mistral 3.5 Medium
| Parameter | Qwen 3.6 27B | Mistral 3.5 Medium |
|---|---|---|
| Total params | 27.4 B | 27.2 B |
| Active params | 27.4 B (dense) | ~8.5 B (MoE, varies) |
| Architecture | Dense transformer | Mixture-of-Experts |
| Max context | 32,768 (native) | 131,072 (native) |
| License | Apache 2.0 | Mistral Research License |
| HF model page | Qwen/Qwen3-27B | mistralai/Mistral-Medium-3 |
Mistral 3.5 Medium's MoE architecture means it activates fewer parameters per forward pass, which translates to lower FLOPS per token but higher memory bandwidth requirements (all expert weights must remain in VRAM for fast routing). In practice on a single GPU, this gives Qwen a slight generation speed advantage despite similar weight sizes.
Full Benchmark Table: tok/s × GPU × Quant (Generation)
| GPU | q3_K_M | q4_K_M | q5_K_M | q6_K |
|---|---|---|---|---|
| Qwen 3.6 27B — RTX 5090 | 58.2 | 47.3 | 39.8 | 34.1 |
| Qwen 3.6 27B — RTX 4090 | 39.6 | 32.1 | 27.3 | 23.5 |
| Qwen 3.6 27B — RTX 3090 | 28.4 | 22.6 | 19.2 | 16.4 |
| Mistral 3.5 Medium — RTX 5090 | 52.4 | 43.1 | 36.6 | 31.2 |
| Mistral 3.5 Medium — RTX 4090 | 35.1 | 28.4 | 24.1 | 20.7 |
| Mistral 3.5 Medium — RTX 3090 | 25.0 | 19.8 | 16.8 | 14.3 |
Verdict Matrix
| Use Case | Pick | Reason |
|---|---|---|
| Code generation, debugging, math | Qwen 3.6 27B | 5.4-pt HumanEval advantage |
| Multilingual chatbot (30+ languages) | Mistral 3.5 Medium | 91.2 FLORES-200 score |
| Long-document RAG (64k+ context) | Mistral 3.5 Medium | Native 128k context without rope-scaling |
| Fastest single-stream generation on 24 GB | Qwen 3.6 27B | +14% gen tok/s at q4 |
| Apache 2.0 commercial use | Qwen 3.6 27B | Mistral's license restricts commercial derivatives |
| Structured JSON output / tool calls | Mistral 3.5 Medium | More consistent function-calling format |
Bottom Line
Pick Qwen 3.6 27B if you're building or using code assistants, need the best reasoning on a single 24 GB card, or want an Apache 2.0 model for commercial deployment. At q4_K_M it's the fastest dense 27B model available and its 1414 Elo score proves it punches well above its weight class.
Pick Mistral 3.5 Medium if your app is multilingual, requires long-context RAG beyond 32k tokens, or relies on structured function calls — Mistral's instruction-following precision is measurably better in these scenarios even if it costs you a few tok/s.
Both models are worth having in your model library. They're complementary, not redundant.
Related Guides
- Best 24GB GPU for Local LLM Inference in 2026
- DeepSeek V4 Local Hardware Requirements
- llama.cpp vs vLLM vs Ollama: Which Runtime Should You Use?
- RTX 5090 vs RTX 4090 for AI Inference
Sources
- Qwen/Qwen3-27B on Hugging Face — model card, benchmarks, architecture details
- Mistral 3.5 Medium launch post — Terminal Bench scores, multilingual results
- Artificial Analysis — Qwen 3.6 27B intelligence score and inference speed leaderboard
- llama.cpp GitHub — benchmark results and quantization format documentation
- LocalLLaMA community thread — Qwen 3.6 27B real-world benchmarks on consumer GPUs
