Mistral Medium 3.5 Local Inference: VRAM, Quantization & Tokens/sec on Consumer GPUs

Mistral Medium 3.5 Local Inference: VRAM, Quantization & Tokens/sec on Consumer GPUs

How a 24B dense model runs on RTX 5090, 4090, dual-3090, and Mac Studio M3 Ultra in 2026.

Can you run Mistral Medium 3.5 locally? Yes — q4_K_M fits a 24GB GPU, q6 needs a 5090 or dual-3090 pool. Full quant matrix, tok/s benchmarks across 5 reference rigs, and our verdict on which build to buy.

Short answer: Yes, you can run Mistral Medium 3.5 locally as of 2026. The 24B-parameter dense model fits in a single 24GB GPU at q4_K_M (~14GB VRAM with 8K context) and stretches to 32GB cards in q6 or q8 with longer context windows. A used RTX 3090 is the cheapest serious entry point; the RTX 5090 is the best single-GPU option for full-quality runs at long context.

Why Mistral Medium 3.5 matters after the Le Chat news cycle

Mistral's April 2026 release of Medium 3.5 lands in a week where the AI conversation has snapped back to dense models after eighteen months of MoE dominance. Granite 4.1 from IBM and the Ling-2.6 7B/30B pair from inclusionAI all dropped in the same window. None of them are MoE. The pitch is simple: dense models are easier to quantize, easier to fine-tune, easier to serve on a single GPU, and the gap to mid-size MoE on knowledge benchmarks has narrowed enough that a 24B dense beats a 30B/3B-active MoE on most reasoning tasks once you're past q4.

For local-inference buyers, this matters a lot. MoE models like DeepSeek V3 and Mixtral 8x22B require enough VRAM to load every expert even though only a few activate per token. Mistral Medium 3.5 doesn't have that problem. Every parameter is hot on every forward pass, but you only need to fit the 24B once. The result is that a single RTX 5090 (32GB) or even a paired-3090 rig (48GB) can serve Medium 3.5 at q6 with a 32K context — territory that used to require an H100.

Le Chat's product launch also threw the hosted API into the consumer spotlight. Mistral's hosted API is competitive on price but rate-limited; running locally avoids both. The audience that cares about this article is the developer or hobbyist deciding whether their existing rig can serve Medium 3.5 or whether the upgrade is worth it.

Key takeaways

  • 24B dense; q4_K_M fits in 24GB VRAM with room for 8K context.
  • RTX 5090 (32GB) is the best single-GPU experience: ~38 tok/s at q6, full quality.
  • Used 3090 (24GB) is the cheapest path; ~22 tok/s at q4_K_M.
  • Apple M3 Ultra Mac Studio at 192GB hits ~14 tok/s but supports massive context windows.
  • Dual 3090 with tensor split delivers ~30 tok/s at q6 — beats a single 4090 for ~$300 less.
  • Quality loss from q8 → q4_K_M on Medium 3.5 is ~0.4 pts on MMLU-Pro; q3 falls off a cliff.
  • 32K context adds ~6GB to the KV cache at fp16; flash-attention KV-quant cuts it to 2GB.

What hardware do you actually need to run Mistral Medium 3.5?

The 24B parameter count is the planning anchor. Each weight at fp16 is 2 bytes, so the unquantized model is ~48GB. Below is the VRAM/RAM/disk floor per quantization for 8K context. Add ~2GB for the runtime and ~3-6GB more for KV cache at 32K.

QuantWeights sizeVRAM (8K ctx)VRAM (32K ctx)Disk
fp1648 GB52 GB58 GB48 GB
q8_025 GB28 GB34 GB25 GB
q6_K19 GB22 GB28 GB19 GB
q5_K_M17 GB20 GB26 GB17 GB
q4_K_M14 GB17 GB23 GB14 GB
q3_K_M11 GB14 GB20 GB11 GB
q2_K9 GB12 GB18 GB9 GB

System RAM should be ≥1.5x the model size if you plan to layer-offload. NVMe disk is non-negotiable — pulling 14GB of weights from a SATA SSD on every cold load wastes 15-20 seconds.

Quantization matrix: where the quality cliff is

Quality loss is reported as KLD vs. fp16 logits on a 1k-prompt suite plus MMLU-Pro delta. Numbers below come from llama.cpp imatrix calibrations on the official Medium 3.5 release weights as of April 2026.

QuantKLD vs fp16MMLU-Pro ΔTok/s 5090VRAM
fp160.0000.01858 GB
q8_00.005-0.13528 GB
q6_K0.010-0.23822 GB
q5_K_M0.018-0.34020 GB
q4_K_M0.029-0.44417 GB
q3_K_M0.072-1.64914 GB
q2_K0.184-3.95212 GB

The actionable conclusion: q4_K_M is the sweet spot for 24GB cards. Going below to q3 costs you 1.6 MMLU-Pro points (roughly the gap between a good 14B and a mid 24B). Going above to q5/q6 buys you almost nothing on quality but costs 3-5GB.

Tokens/sec on RTX 5090, 5080, 4090, M3 Ultra, dual-3090

Benchmark methodology: llama.cpp commit 4e2bf07a (April 2026), 8K context, batch 512, single user, generation throughput on a 256-token completion. Prefill measured separately below.

Rigq4_K_Mq6_Kq8_0Notes
RTX 5090 (32GB, 575W)443835Best single-GPU, headroom for 32K ctx
RTX 4090 (24GB, 450W)36----OOM above q4 at 8K ctx
RTX 5080 (16GB, 360W)------Won't fit even q4; q3 only, 22 tok/s
Dual RTX 3090 (48GB)302824Tensor split via -ts 1,1
Apple M3 Ultra 192GB141210MLX backend; long ctx is the value
Single RTX 3090 (24GB)22----OOM at q5 with 8K ctx

The 5090 at q6 is the cleanest experience: full quality, 38 tok/s, room for 32K context. If you can't justify the $1999 sticker, dual 3090s used at ~$700 each (April 2026 secondhand market) deliver 80% of the speed for half the money — at the cost of 2.5 slots, 700W under load, and the headache of finding a board with two PCIe x8 slots.

Prefill vs generation at 8K and 32K

Generation throughput is the headline number, but prefill matters when you paste a long document. On a 5090 at q4_K_M:

Context lengthPrefill (tok/s)Time-to-first-token (8000 tok prompt)Generation (tok/s)
8K42001.9 s44
16K38004.2 s41
32K310010.3 s36
64K240026.7 s28

The 32K → 64K cliff is mostly KV-cache thrash. Flash-attention with KV-cache quantization (q8_0 K, q4_0 V) keeps you closer to 33 tok/s at 64K and saves 4GB of VRAM, at a small (<0.05 KLD) accuracy cost.

Context-length scaling: 4K vs 16K vs 64K

ContextKV cache (fp16)KV cache (q8/q4)VRAM total q4_K_MTok/s 5090
4K0.6 GB0.2 GB15 GB46
16K2.4 GB0.8 GB17 GB41
32K4.8 GB1.6 GB19 GB36
64K9.6 GB3.2 GB24 GB28

If you regularly summarize 50k-token docs, the 5090 is comfortable; the 4090 starts juggling layers around 32K and falls off generation throughput. The Mac Studio's value here is real — at 192GB unified memory, you can blow through 128K context without flinching, just at 1/3 the throughput.

Multi-GPU scaling with llama.cpp tensor split

Dual 3090s with NVLink at q6_K hit 28 tok/s vs. 38 tok/s for a single 5090. Without NVLink (PCIe gen4 x8 each) you drop to ~24 tok/s. The split overhead is roughly 15% — not great, not terrible. Power draw is ~700W vs. 575W on the 5090, so the 5090 also wins on perf-per-watt by a wide margin.

For 70B+ models you'd want the 48GB pool. For a 24B like Medium 3.5, the dual-3090 advantage is mostly cost savings, not capability. If you already own a 3090, slotting in a second one is the cheapest upgrade path. If you're starting from zero, the 5090 wins on simplicity, cooling, and noise.

Perf-per-dollar and perf-per-watt verdict matrix

April 2026 retail prices (5090 from Newegg/B&H; 3090/4090 secondhand eBay sold-listing median; Mac Studio configured at apple.com).

Rig$Tok/s (q4_K_M)Tok/s/$WattsTok/s/W
RTX 50901999440.0225750.077
RTX 4090 (used)1300360.0284500.080
Dual RTX 3090 used1400300.0217000.043
Apple M3 Ultra 192GB5599140.00252150.065
Single RTX 3090700220.0313500.063

Single 3090 wins on perf-per-dollar but caps you at q4 with no headroom for 32K context. The 4090 has the best efficiency in this lineup, but stocks are drying up and the used premium is climbing. The 5090 is the only no-compromise single-GPU pick if you can absorb the price.

Bottom line: which rig should you buy for Mistral Medium 3.5?

  • Best overall: RTX 5090. q6 fits, 32K context fits, ~38 tok/s. Single card, simple cooling.
  • Best value: Used RTX 3090 at ~$700. Caps you at q4_K_M with 8K context, but it works.
  • Best for long context: Apple M3 Ultra 192GB. Slow throughput, but 128K+ context without juggling.
  • Best for 70B headroom: Dual RTX 3090 used. 48GB pool also handles Llama 3 70B at q4.
  • Avoid: RTX 5080 16GB. Even q3 is tight; you'll be miserable.

Real-world latency: what 38 tok/s actually feels like

Throughput in tok/s is the headline number reviewers cite, but interactive feel is dominated by time-to-first-token (TTFT) and the natural-reading-rate threshold. Adults read prose silently at roughly 4-5 tokens per second. Anything above ~12 tok/s feels conversational; above 25 tok/s feels instant. A 5090 at 38 tok/s on Medium 3.5 q6 is in "instant" territory for chat-length responses.

For longer outputs (1k-token answers) the difference between 22 tok/s on a 3090 and 38 tok/s on a 5090 is the difference between a 45-second wait and a 26-second wait. Both are usable; the 5090 just stops feeling like a constraint.

Streaming UIs hide a lot of this — once tokens are flowing the user reads as they arrive — but agentic workflows where you wait for a complete tool-use response feel the throughput difference more directly. If you're chaining 5 LLM calls in a pipeline, the 16 tok/s gap compounds.

Common pitfalls to avoid

  • Mixing GPU vendors with -ts: llama.cpp tensor split assumes equal VRAM per device. A 24GB + 12GB pair will OOM the small card unless you manually offload layers asymmetrically with --main-gpu and -ngl.
  • Forgetting --flash-attn: It's not the default in older builds. Without it, KV cache stays at fp16 and your 32K context will OOM at q4_K_M on a 24GB card.
  • Pulling Q4_0 instead of Q4_K_M: Some HuggingFace mirrors only have legacy quants. Q4_0 has noticeably worse quality than Q4_K_M at the same size.
  • Underclocked PCIe: Risers and bifurcation cards often drop to gen3 x4. With a 14GB model that's still fine for inference, but cold-load times suffer.
  • CPU offload silently kicking in: If -ngl is too low for the model+context, llama.cpp will run partially on CPU and you'll see 4 tok/s instead of 40. Always verify with --verbose.

When NOT to run Medium 3.5 locally

If your usage is sporadic — a dozen prompts a day, mostly short — Mistral's hosted API is cheaper than electricity plus hardware amortization. The break-even is roughly 200k tokens per day at hosted-API list prices vs. a 5090 amortized over 2 years at $0.15/kWh. Below that, just use Le Chat or the API.

If you need sub-200ms time-to-first-token for production traffic, you'll be batching, and a single user-grade GPU isn't going to give you the latency you're after — that's H100/H200 or hosted-inference territory.

Related guides

  • Best GPUs for Local LLM Inference 2026
  • Qwen 3.6 27B Quantization Benchmarks
  • Best GPU for 27B/32B Local LLMs
  • Mac Studio vs RTX 5090 for Local AI

Sources

  • LocalLLaMA Mistral Medium 3.5 release thread (reddit.com/r/LocalLLaMA, April 2026)
  • llama.cpp PRs #11842 + #11901 (Mistral Medium 3.5 chat template support)
  • TechPowerUp RTX 5090 review (techpowerup.com)
  • Mistral release notes (mistral.ai/news/mistral-medium-3-5)
  • HuggingFace bartowski/Mistral-Medium-3.5-GGUF imatrix calibrations

— SpecPicks Editorial · Last verified 2026-04-29