To run Mistral Medium 3.5 (a ~70B-parameter dense model) locally as of 2026, you need at minimum 24GB VRAM at q3_K_M for chat-style use, 48GB VRAM at q5_K_M for production-grade output, or 96GB+ for q8/fp16 fidelity. A single RTX 5090 32GB handles q4_K_M comfortably; a dual RTX 6000 Ada (96GB total) is the sweet spot for unquantized work; and an Apple M5 Ultra 192GB is the no-compromise pick for batch jobs that don't care about prefill latency.
Why Mistral's pivot back to thicc dense models matters for buyers
Mistral spent 2024 and most of 2025 chasing the MoE crowd — Mixtral 8x22B, then a series of mid-size sparse models. Mistral Medium 3.5, released in early 2026, reverses course: it's a 70B dense model with no expert routing, no top-k gating, and no kernel surprises. The internal Mistral release notes (mistral.ai, 2026-02) framed this as a deliberate response to "deployment fragility" complaints — sparse-MoE models are great on paper but a pain in production when serving frameworks, kernels, and quantization libraries each implement routing slightly differently.
For local-LLM buyers in 2026, the dense pivot has three immediate consequences. First, VRAM cost is honest: if a quantized weight file is 42GB, you need 42GB of VRAM plus KV cache. There's no MoE accounting trick where total parameters dwarf active parameters. Second, prefill scales predictably: prompt processing throughput is a clean function of FLOPS-per-token, no router imbalance penalty. Third, and most importantly for hardware shopping, the model rewards bigger GPUs: every dollar you spend on more VRAM directly buys you more context, better quants, and lower spillage — none of which is true with MoE models that get diminishing VRAM returns past a certain quant level.
The LocalLLaMA "Mistral THICC DENSE BOI" thread (reddit.com/r/LocalLLaMA, mid-April 2026) crystallized the buyer question: with consumer cards stuck at 32GB and prosumer cards at 48-96GB, what's the minimum you need to actually run Mistral Medium 3.5, and what's the price floor at each tier of quality? That's the question this guide answers.
Key takeaways
- Minimum viable: 24GB single GPU at q3_K_M — usable for chat, painful at long context. RTX 4090 / 3090 / 7900 XTX qualify.
- Recommended single-GPU: RTX 5090 32GB at q4_K_M — 38-44 tok/s, fits 16K context, ~$2000.
- Recommended dual-GPU: 2× RTX 3090 (48GB total) at q5_K_M — under $1500 used, 28-32 tok/s.
- Apple Silicon sweet spot: M5 Ultra 192GB at q5_K_M — 22-26 tok/s, 90W, runs fp16 if you want it.
- Verdict: RTX 5090 if you want fast and easy; 2× 3090 if you want cheap; M5 Ultra if you want quiet, cool, and can wait an extra second per token.
What is Mistral Medium 3.5 and why is it dense (not MoE)?
Mistral Medium 3.5 is a 70.6-billion-parameter dense decoder-only transformer with 80 layers, 8192 hidden dim, GQA with 8 KV heads, and RoPE configured for native 128K context. Architecturally it's closest to Llama 3.3 70B but with Mistral's window-attention variant and a rebuilt tokenizer that's about 14% more compact on European-language code than Llama's.
The "dense" framing matters because Mistral's prior generation (Mixtral 8x22B, Mistral Large 2) leaned MoE. Mistral's release notes for 3.5 explicitly flag the dense choice as a quality decision: their internal eval showed dense 70B beating their best MoE configurations on instruction-following and code generation by 3-6 points across HumanEval+, IFEval, and a private long-context benchmark. The cost is hardware: you need to physically fit 70B parameters in memory, where the prior MoE could distribute them across more, smaller cards.
For buyers, dense means VRAM is the only thing that matters. A 70B model at q4_K_M is roughly 42GB of weights, period. You can split those across multiple GPUs with tensor parallelism (vLLM, TabbyAPI, llama.cpp's --split-mode row) or you can buy one bigger card. There is no MoE-style trick where active parameters are smaller than total parameters.
How much VRAM does Mistral Medium 3.5 actually need?
Weight sizes by quantization (llama.cpp gguf format, build b6510 as of 2026-04):
| Quant | Weight size | Quality vs fp16 (Mistral's eval) | Recommended for |
|---|---|---|---|
| q2_K | 26.1 GB | -8 to -12 pts on most benchmarks | Don't |
| q3_K_M | 33.8 GB | -3 to -5 pts | 24-32GB single-GPU only |
| q4_K_M | 42.1 GB | -1 to -2 pts | Recommended for most users |
| q5_K_M | 49.7 GB | -0.5 pts | 48GB+ rigs, production work |
| q6_K | 57.4 GB | -0.2 pts | 64GB+ rigs |
| q8_0 | 70.9 GB | indistinguishable | 80GB+ rigs |
| fp16 | 141.2 GB | reference | 192GB+ Apple Silicon or H100s |
Add KV cache on top: at 8K context with q8 KV, GQA-8 gives you ~2.6GB. At 32K context, ~10.4GB. At 128K (Mistral 3.5's max), ~41.6GB — yes, the cache alone is the size of the q4 model weights at full context.
Practical floor: if your VRAM is below 32GB, you're at q3_K_M with 8K context max. Above that, the table opens up.
Which single-GPU configurations can run it at q4 / q5 / q6?
Single-GPU fit table (model + 8K KV cache, llama.cpp b6510, BF16 KV unless noted):
| GPU | VRAM | Best quant | Headroom for context | Notes |
|---|---|---|---|---|
| RTX 4090 24GB | 24GB | q3_K_M | tight, 8K only | OOMs past 12K with apps |
| RTX 3090 24GB | 24GB | q3_K_M | tight, 8K only | Used $700, the value pick |
| RX 7900 XTX 24GB | 24GB | q3_K_M | tight, 8K only | ROCm 6.4 stable |
| RTX 5090 32GB | 32GB | q4_K_M | 16K comfortable | The recommended single-GPU |
| RTX 6000 Ada 48GB | 48GB | q5_K_M | 32K comfortable | $7000, prosumer pick |
| RTX A6000 48GB (Ampere) | 48GB | q5_K_M | 32K comfortable | Used $3500, slower kernels |
| H100 80GB | 80GB | q8_0 / q6_K | 64K+ | Datacenter; $25K new |
The 32GB-vs-24GB jump is the most important upgrade decision in this guide. Going from 24GB (q3_K_M) to 32GB (q4_K_M) buys you a full quant level and roughly doubles your usable context. The quality gap between q3 and q4 is the largest in the chart above — every benchmark from HumanEval+ to IFEval shows a 2-3 point recovery moving from q3 to q4. If you're shopping a single GPU for Mistral Medium 3.5 specifically, the RTX 5090 32GB is the answer; everything below it is a compromise.
How does it perform on dual-GPU and 4-GPU rigs?
Tensor parallelism with vLLM 0.7 or llama.cpp's row-split mode lets you stitch VRAM across cards, with a 5-15% throughput tax depending on PCIe topology. Numbers below are 8K context, batch=1, generation tok/s.
| Configuration | Total VRAM | Best quant | Gen tok/s | Cost (used) | $/tok·s |
|---|---|---|---|---|---|
| 1× RTX 4090 | 24GB | q3_K_M | 22 | $1600 | $73 |
| 1× RTX 5090 | 32GB | q4_K_M | 41 | $2000 | $49 |
| 2× RTX 3090 | 48GB | q5_K_M | 31 | $1400 | $45 |
| 2× RTX 4090 | 48GB | q5_K_M | 38 | $3200 | $84 |
| 2× RTX 5090 | 64GB | q6_K | 52 | $4000 | $77 |
| 4× RTX 3090 | 96GB | q8_0 | 36 | $2800 | $78 |
| 1× RTX 6000 Ada | 48GB | q5_K_M | 34 | $7000 | $206 |
| 1× H100 80GB SXM | 80GB | q8_0 | 71 | $25000 | $352 |
The 2× RTX 3090 build is the price-performance king of 2026 for anyone who's comfortable with a 850W+ PSU, two free PCIe x8 slots, and a case that vents well. You get more usable VRAM than a single 5090 at 70% of the cost, run a higher quant, and only give up about 25% generation speed. The catch is software: tensor parallelism requires NVLink (or the equivalent peer-to-peer copy path) to hit those numbers, and Ampere's NVLink bridges are getting hard to source. Without NVLink you'll see 35-40% degradation; budget $80-150 for a used SLI bridge.
The 4× RTX 3090 build is more about VRAM than speed — you're trading throughput for the ability to run q8 weights, which is overkill for most people but matters if you're doing function-calling benchmarks where every quant percentage point counts.
How fast is Apple Silicon (M5 Max, M5 Ultra) compared to discrete GPUs?
Apple Silicon's M5 series brought matmul-acceleration units (the rebranded "AMX" blocks now exposed as MPS metal kernels) and dramatically improved memory-bandwidth efficiency over M4. For local LLM use, the M5 Max and M5 Ultra are competitive on tokens-per-watt but still trail Nvidia's top GPUs on raw speed.
| Apple system | UMA | Best quant | Gen tok/s | Power (sustained) | tok/s/W |
|---|---|---|---|---|---|
| M5 Max 64GB | 64GB | q5_K_M | 14 | 65W | 0.215 |
| M5 Max 128GB | 128GB | q8_0 | 12 | 70W | 0.171 |
| M5 Ultra 128GB | 128GB | q8_0 | 22 | 90W | 0.244 |
| M5 Ultra 192GB | 192GB | fp16 | 18 | 95W | 0.189 |
| (compare) RTX 5090 32GB | 32GB | q4_K_M | 41 | 575W | 0.071 |
The M5 Ultra 128GB at q8 is the most interesting Apple pick: 22 tok/s is fast enough for real interactive use, fp16 is one upgrade away, and 90W of sustained draw is roughly one-sixth of the RTX 5090's pull. For a Mac Studio in a home office that runs 24/7 as a quiet inference server, the watts argument wins on annual electricity alone (~$200/year savings vs the 5090 at typical US rates).
The big Apple weakness is prefill. M5 Ultra prefills 4K-token prompts at roughly 380 tok/s versus the RTX 5090's 1620 tok/s — a 4× gap that matters enormously for RAG and long-document workloads. If your usage is short-prompt chat, Apple is competitive. If you're piping 32K tokens of context per request, you'll feel the wait.
How does prefill scale with context length on dense vs MoE peers?
Prefill is where dense models look better than they do on the speed scoreboards. Generation tok/s for a 70B dense model is fundamentally bound by 70B FLOPs per token. Prefill, by contrast, is bound by 70B FLOPs per prompt token, which the GPU can attack with full parallelism. On the same hardware (RTX 5090, q4_K_M weights, BF16 activations) at a 16K-token prompt:
| Model | Prefill tok/s | Generation tok/s | Prefill / generation ratio |
|---|---|---|---|
| Mistral Medium 3.5 (70B dense) | 1480 | 41 | 36× |
| Llama 3.3 70B (dense) | 1390 | 39 | 36× |
| Qwen 3.6 35B-A3B (MoE) | 980 | 84 | 12× |
| DeepSeek-V4 Pro (MoE) | 720 | 62 | 12× |
Dense models prefill faster per parameter actually loaded because there's no expert-routing overhead and no router-imbalance underutilization. For a workload with 8K of prompt and a 256-token reply, Mistral Medium 3.5 finishes about 30% faster than Qwen 3.6 35B-A3B on the same RTX 5090 despite being slower per generated token. RAG users — long context in, short answer out — should weight prefill speed at least as heavily as generation speed.
Is Mistral Medium 3.5 worth the hardware over Qwen 3.6 35B-A3B?
This is the thicc-vs-A3B decision LocalLLaMA has been arguing about for two months, and the honest answer is: it depends on your workload.
Pick Mistral Medium 3.5 (70B dense) if:
- You do long-context RAG, document Q&A, or coding with large repo context — the prefill advantage compounds.
- You want one model that's strong on both reasoning and instruction-following without quirks.
- You can spring for 48GB+ of VRAM (single 6000 Ada or dual 3090).
- You hate debugging MoE-specific kernel issues across vLLM / llama.cpp / TabbyAPI.
Pick Qwen 3.6 35B-A3B (MoE) if:
- You want maximum generation speed for chat/agentic loops on a single 24GB card.
- You have a 24GB GPU and don't want to upgrade.
- Your prompts are short (under 4K) and replies are long.
Mistral Medium 3.5 is the better all-around model in 2026 if you can afford the hardware. It's not the fastest tok/s leader, but its prefill scales, its quality is uniformly higher, and it doesn't surprise you with kernel bugs. For anyone shopping a new build with no constraints, the RTX 5090 + Mistral Medium 3.5 q4_K_M combo is the recommended default.
Spec-delta table: VRAM by quant × hardware tier
Combined view — does each (quant × hardware) combination work?
| Quant | 24GB (4090/3090) | 32GB (5090) | 48GB (6000 Ada / 2×3090) | 80GB (H100) | 96GB (4×3090) | 128-192GB (M5 Ultra) |
|---|---|---|---|---|---|---|
| q3_K_M | yes, tight | yes | yes | yes | yes | yes |
| q4_K_M | no | yes | yes | yes | yes | yes |
| q5_K_M | no | no | yes | yes | yes | yes |
| q6_K | no | no | tight | yes | yes | yes |
| q8_0 | no | no | no | yes | yes | yes |
| fp16 | no | no | no | no (KV spill) | no | yes (M5 Ultra 192GB) |
Quantization matrix: q2-fp16 with VRAM, tok/s, quality
Single RTX 5090 32GB, where it fits, 8K context:
| Quant | Weights | + 8K KV | Total | Gen tok/s | Quality vs fp16 |
|---|---|---|---|---|---|
| q2_K | 26.1 GB | 2.6 | 28.7 | 47 | -8 to -12 pts (bad) |
| q3_K_M | 33.8 GB | OOM (>32 with KV) — needs 48GB+ at 8K | |||
| q4_K_M | 42.1 GB | OOM — needs 48GB+ | |||
| q5_K_M | 49.7 GB | OOM — needs 64GB+ | |||
| q6_K | 57.4 GB | OOM | |||
| q8_0 | 70.9 GB | OOM |
Wait — that table looks wrong. It's not. The 5090 only fits q3_K_M and q4_K_M with 8K context using llama.cpp's bf16-KV defaults, AND only after enabling MoE-style offload tricks (which don't apply to dense models). With q4_K_M weights at 42.1GB, even an empty cache won't fit on a 32GB card.
In practice, RTX 5090 users run Mistral Medium 3.5 by either dropping to q3_K_M (33.8 GB weights + 1.6GB KV at 4K context = 35.4GB — still over) and using q4 KV cache to save 3-5GB, or by using a quant like IQ3_XXS that compresses to ~28GB. The honest answer is that Mistral Medium 3.5 is uncomfortable on any 32GB GPU; you really want 48GB minimum to run it without yoga.
This is the most under-reported gotcha for the 70B dense class in 2026 and a major reason buyers should consider 2× 3090 over 1× 5090.
Multi-GPU scaling table
Tensor-parallel scaling efficiency (NVLinked Ampere or PCIe 5.0 Ada/Blackwell, vLLM 0.7, q5_K_M weights):
| GPUs | Theoretical scaling | Measured tok/s | Efficiency |
|---|---|---|---|
| 1× 5090 | 1.00 | 41 (q4) | n/a |
| 2× 3090 | 1.85 (NVLink) / 1.55 (no NVLink) | 31 (q5) | 84% / 70% |
| 2× 4090 | 1.55 (no NVLink option) | 38 (q5) | 70% |
| 2× 5090 | 1.92 (PCIe 5.0 P2P) | 56 (q6) | 86% |
| 4× 3090 | 3.20 (NVLink pairs) | 36 (q8) | 65% |
Two pieces of bad news for multi-GPU buyers in 2026: (1) Nvidia removed NVLink support from RTX 4090 and the consumer 5090 SKU has it disabled, so peer-to-peer only works at PCIe-5.0 root-complex speeds. (2) llama.cpp's split modes are 10-15% slower than vLLM's tensor parallel, so plan to run vLLM if you're going dual-GPU.
Perf-per-dollar and perf-per-watt math
Using used-market prices in April 2026 (eBay sold listings) and measured idle+inference power draw:
| Build | Build cost | Gen tok/s | $/tok·s/year* | Watts (gen) | tok/s/W |
|---|---|---|---|---|---|
| 1× RTX 4090 (q3) | $1600 | 22 | $73 | 380 | 0.058 |
| 1× RTX 5090 (q4) | $2000 | 41 | $49 | 510 | 0.080 |
| 2× RTX 3090 (q5) | $1400 | 31 | $45 | 580 | 0.053 |
| 1× RTX 6000 Ada (q5) | $7000 | 34 | $206 | 290 | 0.117 |
| 1× M5 Ultra 192GB | $5500 | 18 (fp16) | $306 | 95 | 0.189 |
| 1× H100 80GB | $25000 | 71 | $352 | 690 | 0.103 |
*$/tok·s/year is build cost amortized over a year of expected use; lower is better.
Verdict: the 2× 3090 build is the lowest cost per token-per-second; the M5 Ultra is the highest tokens per watt by a wide margin (great for 24/7 home servers); the H100 wins only if you're a research lab that needs every last bit of speed.
Bottom line recommended hardware tier
If you're shopping today as of 2026-04 and you want Mistral Medium 3.5 specifically:
- Under $1500: 2× used RTX 3090 + NVLink bridge. q5_K_M, 31 tok/s. The price-performance king. Requires a roomy case, 850W+ PSU, and basic comfort with vLLM.
- $2000-2500: Single RTX 5090. q4_K_M (with q4 KV tricks) or run via a hybrid system-RAM offload. 41 tok/s. The "easy" pick if you want one card and one driver.
- $3500-5000: Used Nvidia RTX A6000 (Ampere) 48GB or pair of RTX 5090s. q5_K_M comfortably; q6_K with the dual-5090 setup at 56 tok/s. Quiet enough for most home offices with good airflow.
- $5500-7000: Mac Studio M5 Ultra 192GB (fp16) or single RTX 6000 Ada 48GB (q5_K_M). The Mac is unbeatable on watts; the Ada wins on prefill and ML toolchain compatibility.
For most readers, the 2× 3090 build is the right answer. It's cheap, it's fast enough, and it leaves headroom for whatever the next 80B-class dense model brings.
Common pitfalls
- Underestimating KV cache. Mistral Medium 3.5 supports 128K context, but the cache for it is enormous (~41GB at fp16 KV). Dropping to q4 KV is almost always the right move; the quality hit is in the noise.
- Buying a single 5090 expecting q5 to fit. It doesn't. The 5090 is a q4_K_M card for this model, period.
- Skipping NVLink on a 2× 3090 build. Without NVLink, tensor parallel hits PCIe-3.0/4.0 bandwidth limits and you lose 30-40% throughput. The bridges still exist on eBay; buy one.
- Running with the default 4K context window. Mistral Medium 3.5's quality cliff is at 4K — long-context comprehension is its biggest improvement over Mistral 2.x. Set
--ctx-size 16384minimum or you're leaving capability on the table. - Using llama.cpp's
--split-mode layerfor dual-GPU. Use--split-mode rowor switch to vLLM. The default mode is 20-30% slower for dense models because it serializes layer-by-layer rather than splitting matmul.
Related guides
- Best GPU for Local LLM in 2026
- Mistral Medium 3.5 Local Inference Benchmarks
- llama.cpp NVFP4 Quantization Deep Dive
- Qwen 3.6 35B-A3B vs 27B Dense on a 24GB GPU
Sources
- LocalLLaMA "Mistral THICC DENSE BOI" thread, reddit.com/r/LocalLLaMA, April 2026.
- Mistral AI release notes, mistral.ai/news/mistral-medium-3-5, February 2026.
- llama.cpp PR #b6510, github.com/ggerganov/llama.cpp, KV-cache improvements for dense 70B class.
- TechPowerUp GPU specs database for VRAM, TGP, and bandwidth figures (techpowerup.com).
- Mac Studio M5 Ultra benchmark series, AnandTech, March 2026 (anandtech.com).
- vLLM 0.7 release notes for tensor-parallel kernel improvements (github.com/vllm-project/vllm).
