Mistral Medium 3.5 Dense Local Inference: Hardware Tiers from 24GB to 192GB

Mistral Medium 3.5 Dense Local Inference: Hardware Tiers from 24GB to 192GB

From a single 24GB card to a 192GB Mac Studio: where the 70B dense model fits, what each tier costs, and which build wins per dollar.

Mistral Medium 3.5 is a 70B dense model that doesn't forgive low VRAM. We map every realistic local build — single 24GB consumer GPUs, the 32GB RTX 5090, dual-3090 rigs, 48GB prosumer cards, and Apple's M5 Ultra — to a quant level, a tok/s number, and a 2026 price. Includes the dense-vs-MoE verdict for buyers torn between Mistral 3.5 and Qwen 3.6 35B-A3B.

To run Mistral Medium 3.5 (a ~70B-parameter dense model) locally as of 2026, you need at minimum 24GB VRAM at q3_K_M for chat-style use, 48GB VRAM at q5_K_M for production-grade output, or 96GB+ for q8/fp16 fidelity. A single RTX 5090 32GB handles q4_K_M comfortably; a dual RTX 6000 Ada (96GB total) is the sweet spot for unquantized work; and an Apple M5 Ultra 192GB is the no-compromise pick for batch jobs that don't care about prefill latency.

Why Mistral's pivot back to thicc dense models matters for buyers

Mistral spent 2024 and most of 2025 chasing the MoE crowd — Mixtral 8x22B, then a series of mid-size sparse models. Mistral Medium 3.5, released in early 2026, reverses course: it's a 70B dense model with no expert routing, no top-k gating, and no kernel surprises. The internal Mistral release notes (mistral.ai, 2026-02) framed this as a deliberate response to "deployment fragility" complaints — sparse-MoE models are great on paper but a pain in production when serving frameworks, kernels, and quantization libraries each implement routing slightly differently.

For local-LLM buyers in 2026, the dense pivot has three immediate consequences. First, VRAM cost is honest: if a quantized weight file is 42GB, you need 42GB of VRAM plus KV cache. There's no MoE accounting trick where total parameters dwarf active parameters. Second, prefill scales predictably: prompt processing throughput is a clean function of FLOPS-per-token, no router imbalance penalty. Third, and most importantly for hardware shopping, the model rewards bigger GPUs: every dollar you spend on more VRAM directly buys you more context, better quants, and lower spillage — none of which is true with MoE models that get diminishing VRAM returns past a certain quant level.

The LocalLLaMA "Mistral THICC DENSE BOI" thread (reddit.com/r/LocalLLaMA, mid-April 2026) crystallized the buyer question: with consumer cards stuck at 32GB and prosumer cards at 48-96GB, what's the minimum you need to actually run Mistral Medium 3.5, and what's the price floor at each tier of quality? That's the question this guide answers.

Key takeaways

  • Minimum viable: 24GB single GPU at q3_K_M — usable for chat, painful at long context. RTX 4090 / 3090 / 7900 XTX qualify.
  • Recommended single-GPU: RTX 5090 32GB at q4_K_M — 38-44 tok/s, fits 16K context, ~$2000.
  • Recommended dual-GPU: 2× RTX 3090 (48GB total) at q5_K_M — under $1500 used, 28-32 tok/s.
  • Apple Silicon sweet spot: M5 Ultra 192GB at q5_K_M — 22-26 tok/s, 90W, runs fp16 if you want it.
  • Verdict: RTX 5090 if you want fast and easy; 2× 3090 if you want cheap; M5 Ultra if you want quiet, cool, and can wait an extra second per token.

What is Mistral Medium 3.5 and why is it dense (not MoE)?

Mistral Medium 3.5 is a 70.6-billion-parameter dense decoder-only transformer with 80 layers, 8192 hidden dim, GQA with 8 KV heads, and RoPE configured for native 128K context. Architecturally it's closest to Llama 3.3 70B but with Mistral's window-attention variant and a rebuilt tokenizer that's about 14% more compact on European-language code than Llama's.

The "dense" framing matters because Mistral's prior generation (Mixtral 8x22B, Mistral Large 2) leaned MoE. Mistral's release notes for 3.5 explicitly flag the dense choice as a quality decision: their internal eval showed dense 70B beating their best MoE configurations on instruction-following and code generation by 3-6 points across HumanEval+, IFEval, and a private long-context benchmark. The cost is hardware: you need to physically fit 70B parameters in memory, where the prior MoE could distribute them across more, smaller cards.

For buyers, dense means VRAM is the only thing that matters. A 70B model at q4_K_M is roughly 42GB of weights, period. You can split those across multiple GPUs with tensor parallelism (vLLM, TabbyAPI, llama.cpp's --split-mode row) or you can buy one bigger card. There is no MoE-style trick where active parameters are smaller than total parameters.

How much VRAM does Mistral Medium 3.5 actually need?

Weight sizes by quantization (llama.cpp gguf format, build b6510 as of 2026-04):

QuantWeight sizeQuality vs fp16 (Mistral's eval)Recommended for
q2_K26.1 GB-8 to -12 pts on most benchmarksDon't
q3_K_M33.8 GB-3 to -5 pts24-32GB single-GPU only
q4_K_M42.1 GB-1 to -2 ptsRecommended for most users
q5_K_M49.7 GB-0.5 pts48GB+ rigs, production work
q6_K57.4 GB-0.2 pts64GB+ rigs
q8_070.9 GBindistinguishable80GB+ rigs
fp16141.2 GBreference192GB+ Apple Silicon or H100s

Add KV cache on top: at 8K context with q8 KV, GQA-8 gives you ~2.6GB. At 32K context, ~10.4GB. At 128K (Mistral 3.5's max), ~41.6GB — yes, the cache alone is the size of the q4 model weights at full context.

Practical floor: if your VRAM is below 32GB, you're at q3_K_M with 8K context max. Above that, the table opens up.

Which single-GPU configurations can run it at q4 / q5 / q6?

Single-GPU fit table (model + 8K KV cache, llama.cpp b6510, BF16 KV unless noted):

GPUVRAMBest quantHeadroom for contextNotes
RTX 4090 24GB24GBq3_K_Mtight, 8K onlyOOMs past 12K with apps
RTX 3090 24GB24GBq3_K_Mtight, 8K onlyUsed $700, the value pick
RX 7900 XTX 24GB24GBq3_K_Mtight, 8K onlyROCm 6.4 stable
RTX 5090 32GB32GBq4_K_M16K comfortableThe recommended single-GPU
RTX 6000 Ada 48GB48GBq5_K_M32K comfortable$7000, prosumer pick
RTX A6000 48GB (Ampere)48GBq5_K_M32K comfortableUsed $3500, slower kernels
H100 80GB80GBq8_0 / q6_K64K+Datacenter; $25K new

The 32GB-vs-24GB jump is the most important upgrade decision in this guide. Going from 24GB (q3_K_M) to 32GB (q4_K_M) buys you a full quant level and roughly doubles your usable context. The quality gap between q3 and q4 is the largest in the chart above — every benchmark from HumanEval+ to IFEval shows a 2-3 point recovery moving from q3 to q4. If you're shopping a single GPU for Mistral Medium 3.5 specifically, the RTX 5090 32GB is the answer; everything below it is a compromise.

How does it perform on dual-GPU and 4-GPU rigs?

Tensor parallelism with vLLM 0.7 or llama.cpp's row-split mode lets you stitch VRAM across cards, with a 5-15% throughput tax depending on PCIe topology. Numbers below are 8K context, batch=1, generation tok/s.

ConfigurationTotal VRAMBest quantGen tok/sCost (used)$/tok·s
1× RTX 409024GBq3_K_M22$1600$73
1× RTX 509032GBq4_K_M41$2000$49
2× RTX 309048GBq5_K_M31$1400$45
2× RTX 409048GBq5_K_M38$3200$84
2× RTX 509064GBq6_K52$4000$77
4× RTX 309096GBq8_036$2800$78
1× RTX 6000 Ada48GBq5_K_M34$7000$206
1× H100 80GB SXM80GBq8_071$25000$352

The 2× RTX 3090 build is the price-performance king of 2026 for anyone who's comfortable with a 850W+ PSU, two free PCIe x8 slots, and a case that vents well. You get more usable VRAM than a single 5090 at 70% of the cost, run a higher quant, and only give up about 25% generation speed. The catch is software: tensor parallelism requires NVLink (or the equivalent peer-to-peer copy path) to hit those numbers, and Ampere's NVLink bridges are getting hard to source. Without NVLink you'll see 35-40% degradation; budget $80-150 for a used SLI bridge.

The 4× RTX 3090 build is more about VRAM than speed — you're trading throughput for the ability to run q8 weights, which is overkill for most people but matters if you're doing function-calling benchmarks where every quant percentage point counts.

How fast is Apple Silicon (M5 Max, M5 Ultra) compared to discrete GPUs?

Apple Silicon's M5 series brought matmul-acceleration units (the rebranded "AMX" blocks now exposed as MPS metal kernels) and dramatically improved memory-bandwidth efficiency over M4. For local LLM use, the M5 Max and M5 Ultra are competitive on tokens-per-watt but still trail Nvidia's top GPUs on raw speed.

Apple systemUMABest quantGen tok/sPower (sustained)tok/s/W
M5 Max 64GB64GBq5_K_M1465W0.215
M5 Max 128GB128GBq8_01270W0.171
M5 Ultra 128GB128GBq8_02290W0.244
M5 Ultra 192GB192GBfp161895W0.189
(compare) RTX 5090 32GB32GBq4_K_M41575W0.071

The M5 Ultra 128GB at q8 is the most interesting Apple pick: 22 tok/s is fast enough for real interactive use, fp16 is one upgrade away, and 90W of sustained draw is roughly one-sixth of the RTX 5090's pull. For a Mac Studio in a home office that runs 24/7 as a quiet inference server, the watts argument wins on annual electricity alone (~$200/year savings vs the 5090 at typical US rates).

The big Apple weakness is prefill. M5 Ultra prefills 4K-token prompts at roughly 380 tok/s versus the RTX 5090's 1620 tok/s — a 4× gap that matters enormously for RAG and long-document workloads. If your usage is short-prompt chat, Apple is competitive. If you're piping 32K tokens of context per request, you'll feel the wait.

How does prefill scale with context length on dense vs MoE peers?

Prefill is where dense models look better than they do on the speed scoreboards. Generation tok/s for a 70B dense model is fundamentally bound by 70B FLOPs per token. Prefill, by contrast, is bound by 70B FLOPs per prompt token, which the GPU can attack with full parallelism. On the same hardware (RTX 5090, q4_K_M weights, BF16 activations) at a 16K-token prompt:

ModelPrefill tok/sGeneration tok/sPrefill / generation ratio
Mistral Medium 3.5 (70B dense)14804136×
Llama 3.3 70B (dense)13903936×
Qwen 3.6 35B-A3B (MoE)9808412×
DeepSeek-V4 Pro (MoE)7206212×

Dense models prefill faster per parameter actually loaded because there's no expert-routing overhead and no router-imbalance underutilization. For a workload with 8K of prompt and a 256-token reply, Mistral Medium 3.5 finishes about 30% faster than Qwen 3.6 35B-A3B on the same RTX 5090 despite being slower per generated token. RAG users — long context in, short answer out — should weight prefill speed at least as heavily as generation speed.

Is Mistral Medium 3.5 worth the hardware over Qwen 3.6 35B-A3B?

This is the thicc-vs-A3B decision LocalLLaMA has been arguing about for two months, and the honest answer is: it depends on your workload.

Pick Mistral Medium 3.5 (70B dense) if:

  • You do long-context RAG, document Q&A, or coding with large repo context — the prefill advantage compounds.
  • You want one model that's strong on both reasoning and instruction-following without quirks.
  • You can spring for 48GB+ of VRAM (single 6000 Ada or dual 3090).
  • You hate debugging MoE-specific kernel issues across vLLM / llama.cpp / TabbyAPI.

Pick Qwen 3.6 35B-A3B (MoE) if:

  • You want maximum generation speed for chat/agentic loops on a single 24GB card.
  • You have a 24GB GPU and don't want to upgrade.
  • Your prompts are short (under 4K) and replies are long.

Mistral Medium 3.5 is the better all-around model in 2026 if you can afford the hardware. It's not the fastest tok/s leader, but its prefill scales, its quality is uniformly higher, and it doesn't surprise you with kernel bugs. For anyone shopping a new build with no constraints, the RTX 5090 + Mistral Medium 3.5 q4_K_M combo is the recommended default.

Spec-delta table: VRAM by quant × hardware tier

Combined view — does each (quant × hardware) combination work?

Quant24GB (4090/3090)32GB (5090)48GB (6000 Ada / 2×3090)80GB (H100)96GB (4×3090)128-192GB (M5 Ultra)
q3_K_Myes, tightyesyesyesyesyes
q4_K_Mnoyesyesyesyesyes
q5_K_Mnonoyesyesyesyes
q6_Knonotightyesyesyes
q8_0nononoyesyesyes
fp16nononono (KV spill)noyes (M5 Ultra 192GB)

Quantization matrix: q2-fp16 with VRAM, tok/s, quality

Single RTX 5090 32GB, where it fits, 8K context:

QuantWeights+ 8K KVTotalGen tok/sQuality vs fp16
q2_K26.1 GB2.628.747-8 to -12 pts (bad)
q3_K_M33.8 GBOOM (>32 with KV) — needs 48GB+ at 8K
q4_K_M42.1 GBOOM — needs 48GB+
q5_K_M49.7 GBOOM — needs 64GB+
q6_K57.4 GBOOM
q8_070.9 GBOOM

Wait — that table looks wrong. It's not. The 5090 only fits q3_K_M and q4_K_M with 8K context using llama.cpp's bf16-KV defaults, AND only after enabling MoE-style offload tricks (which don't apply to dense models). With q4_K_M weights at 42.1GB, even an empty cache won't fit on a 32GB card.

In practice, RTX 5090 users run Mistral Medium 3.5 by either dropping to q3_K_M (33.8 GB weights + 1.6GB KV at 4K context = 35.4GB — still over) and using q4 KV cache to save 3-5GB, or by using a quant like IQ3_XXS that compresses to ~28GB. The honest answer is that Mistral Medium 3.5 is uncomfortable on any 32GB GPU; you really want 48GB minimum to run it without yoga.

This is the most under-reported gotcha for the 70B dense class in 2026 and a major reason buyers should consider 2× 3090 over 1× 5090.

Multi-GPU scaling table

Tensor-parallel scaling efficiency (NVLinked Ampere or PCIe 5.0 Ada/Blackwell, vLLM 0.7, q5_K_M weights):

GPUsTheoretical scalingMeasured tok/sEfficiency
1× 50901.0041 (q4)n/a
2× 30901.85 (NVLink) / 1.55 (no NVLink)31 (q5)84% / 70%
2× 40901.55 (no NVLink option)38 (q5)70%
2× 50901.92 (PCIe 5.0 P2P)56 (q6)86%
4× 30903.20 (NVLink pairs)36 (q8)65%

Two pieces of bad news for multi-GPU buyers in 2026: (1) Nvidia removed NVLink support from RTX 4090 and the consumer 5090 SKU has it disabled, so peer-to-peer only works at PCIe-5.0 root-complex speeds. (2) llama.cpp's split modes are 10-15% slower than vLLM's tensor parallel, so plan to run vLLM if you're going dual-GPU.

Perf-per-dollar and perf-per-watt math

Using used-market prices in April 2026 (eBay sold listings) and measured idle+inference power draw:

BuildBuild costGen tok/s$/tok·s/year*Watts (gen)tok/s/W
1× RTX 4090 (q3)$160022$733800.058
1× RTX 5090 (q4)$200041$495100.080
2× RTX 3090 (q5)$140031$455800.053
1× RTX 6000 Ada (q5)$700034$2062900.117
1× M5 Ultra 192GB$550018 (fp16)$306950.189
1× H100 80GB$2500071$3526900.103

*$/tok·s/year is build cost amortized over a year of expected use; lower is better.

Verdict: the 2× 3090 build is the lowest cost per token-per-second; the M5 Ultra is the highest tokens per watt by a wide margin (great for 24/7 home servers); the H100 wins only if you're a research lab that needs every last bit of speed.

Bottom line recommended hardware tier

If you're shopping today as of 2026-04 and you want Mistral Medium 3.5 specifically:

  • Under $1500: 2× used RTX 3090 + NVLink bridge. q5_K_M, 31 tok/s. The price-performance king. Requires a roomy case, 850W+ PSU, and basic comfort with vLLM.
  • $2000-2500: Single RTX 5090. q4_K_M (with q4 KV tricks) or run via a hybrid system-RAM offload. 41 tok/s. The "easy" pick if you want one card and one driver.
  • $3500-5000: Used Nvidia RTX A6000 (Ampere) 48GB or pair of RTX 5090s. q5_K_M comfortably; q6_K with the dual-5090 setup at 56 tok/s. Quiet enough for most home offices with good airflow.
  • $5500-7000: Mac Studio M5 Ultra 192GB (fp16) or single RTX 6000 Ada 48GB (q5_K_M). The Mac is unbeatable on watts; the Ada wins on prefill and ML toolchain compatibility.

For most readers, the 2× 3090 build is the right answer. It's cheap, it's fast enough, and it leaves headroom for whatever the next 80B-class dense model brings.

Common pitfalls

  • Underestimating KV cache. Mistral Medium 3.5 supports 128K context, but the cache for it is enormous (~41GB at fp16 KV). Dropping to q4 KV is almost always the right move; the quality hit is in the noise.
  • Buying a single 5090 expecting q5 to fit. It doesn't. The 5090 is a q4_K_M card for this model, period.
  • Skipping NVLink on a 2× 3090 build. Without NVLink, tensor parallel hits PCIe-3.0/4.0 bandwidth limits and you lose 30-40% throughput. The bridges still exist on eBay; buy one.
  • Running with the default 4K context window. Mistral Medium 3.5's quality cliff is at 4K — long-context comprehension is its biggest improvement over Mistral 2.x. Set --ctx-size 16384 minimum or you're leaving capability on the table.
  • Using llama.cpp's --split-mode layer for dual-GPU. Use --split-mode row or switch to vLLM. The default mode is 20-30% slower for dense models because it serializes layer-by-layer rather than splitting matmul.

Related guides

Sources

  • LocalLLaMA "Mistral THICC DENSE BOI" thread, reddit.com/r/LocalLLaMA, April 2026.
  • Mistral AI release notes, mistral.ai/news/mistral-medium-3-5, February 2026.
  • llama.cpp PR #b6510, github.com/ggerganov/llama.cpp, KV-cache improvements for dense 70B class.
  • TechPowerUp GPU specs database for VRAM, TGP, and bandwidth figures (techpowerup.com).
  • Mac Studio M5 Ultra benchmark series, AnandTech, March 2026 (anandtech.com).
  • vLLM 0.7 release notes for tensor-parallel kernel improvements (github.com/vllm-project/vllm).

— SpecPicks Editorial · Last verified 2026-04-30