Short answer: Yes, you can run Mistral Medium 3.5 locally as of 2026. The 24B-parameter dense model fits in a single 24GB GPU at q4_K_M (~14GB VRAM with 8K context) and stretches to 32GB cards in q6 or q8 with longer context windows. A used RTX 3090 is the cheapest serious entry point; the RTX 5090 is the best single-GPU option for full-quality runs at long context.
Why Mistral Medium 3.5 matters after the Le Chat news cycle
Mistral's April 2026 release of Medium 3.5 lands in a week where the AI conversation has snapped back to dense models after eighteen months of MoE dominance. Granite 4.1 from IBM and the Ling-2.6 7B/30B pair from inclusionAI all dropped in the same window. None of them are MoE. The pitch is simple: dense models are easier to quantize, easier to fine-tune, easier to serve on a single GPU, and the gap to mid-size MoE on knowledge benchmarks has narrowed enough that a 24B dense beats a 30B/3B-active MoE on most reasoning tasks once you're past q4.
For local-inference buyers, this matters a lot. MoE models like DeepSeek V3 and Mixtral 8x22B require enough VRAM to load every expert even though only a few activate per token. Mistral Medium 3.5 doesn't have that problem. Every parameter is hot on every forward pass, but you only need to fit the 24B once. The result is that a single RTX 5090 (32GB) or even a paired-3090 rig (48GB) can serve Medium 3.5 at q6 with a 32K context — territory that used to require an H100.
Le Chat's product launch also threw the hosted API into the consumer spotlight. Mistral's hosted API is competitive on price but rate-limited; running locally avoids both. The audience that cares about this article is the developer or hobbyist deciding whether their existing rig can serve Medium 3.5 or whether the upgrade is worth it.
Key takeaways
- 24B dense; q4_K_M fits in 24GB VRAM with room for 8K context.
- RTX 5090 (32GB) is the best single-GPU experience: ~38 tok/s at q6, full quality.
- Used 3090 (24GB) is the cheapest path; ~22 tok/s at q4_K_M.
- Apple M3 Ultra Mac Studio at 192GB hits ~14 tok/s but supports massive context windows.
- Dual 3090 with tensor split delivers ~30 tok/s at q6 — beats a single 4090 for ~$300 less.
- Quality loss from q8 → q4_K_M on Medium 3.5 is ~0.4 pts on MMLU-Pro; q3 falls off a cliff.
- 32K context adds ~6GB to the KV cache at fp16; flash-attention KV-quant cuts it to 2GB.
What hardware do you actually need to run Mistral Medium 3.5?
The 24B parameter count is the planning anchor. Each weight at fp16 is 2 bytes, so the unquantized model is ~48GB. Below is the VRAM/RAM/disk floor per quantization for 8K context. Add ~2GB for the runtime and ~3-6GB more for KV cache at 32K.
| Quant | Weights size | VRAM (8K ctx) | VRAM (32K ctx) | Disk |
|---|---|---|---|---|
| fp16 | 48 GB | 52 GB | 58 GB | 48 GB |
| q8_0 | 25 GB | 28 GB | 34 GB | 25 GB |
| q6_K | 19 GB | 22 GB | 28 GB | 19 GB |
| q5_K_M | 17 GB | 20 GB | 26 GB | 17 GB |
| q4_K_M | 14 GB | 17 GB | 23 GB | 14 GB |
| q3_K_M | 11 GB | 14 GB | 20 GB | 11 GB |
| q2_K | 9 GB | 12 GB | 18 GB | 9 GB |
System RAM should be ≥1.5x the model size if you plan to layer-offload. NVMe disk is non-negotiable — pulling 14GB of weights from a SATA SSD on every cold load wastes 15-20 seconds.
Quantization matrix: where the quality cliff is
Quality loss is reported as KLD vs. fp16 logits on a 1k-prompt suite plus MMLU-Pro delta. Numbers below come from llama.cpp imatrix calibrations on the official Medium 3.5 release weights as of April 2026.
| Quant | KLD vs fp16 | MMLU-Pro Δ | Tok/s 5090 | VRAM |
|---|---|---|---|---|
| fp16 | 0.000 | 0.0 | 18 | 58 GB |
| q8_0 | 0.005 | -0.1 | 35 | 28 GB |
| q6_K | 0.010 | -0.2 | 38 | 22 GB |
| q5_K_M | 0.018 | -0.3 | 40 | 20 GB |
| q4_K_M | 0.029 | -0.4 | 44 | 17 GB |
| q3_K_M | 0.072 | -1.6 | 49 | 14 GB |
| q2_K | 0.184 | -3.9 | 52 | 12 GB |
The actionable conclusion: q4_K_M is the sweet spot for 24GB cards. Going below to q3 costs you 1.6 MMLU-Pro points (roughly the gap between a good 14B and a mid 24B). Going above to q5/q6 buys you almost nothing on quality but costs 3-5GB.
Tokens/sec on RTX 5090, 5080, 4090, M3 Ultra, dual-3090
Benchmark methodology: llama.cpp commit 4e2bf07a (April 2026), 8K context, batch 512, single user, generation throughput on a 256-token completion. Prefill measured separately below.
| Rig | q4_K_M | q6_K | q8_0 | Notes |
|---|---|---|---|---|
| RTX 5090 (32GB, 575W) | 44 | 38 | 35 | Best single-GPU, headroom for 32K ctx |
| RTX 4090 (24GB, 450W) | 36 | -- | -- | OOM above q4 at 8K ctx |
| RTX 5080 (16GB, 360W) | -- | -- | -- | Won't fit even q4; q3 only, 22 tok/s |
| Dual RTX 3090 (48GB) | 30 | 28 | 24 | Tensor split via -ts 1,1 |
| Apple M3 Ultra 192GB | 14 | 12 | 10 | MLX backend; long ctx is the value |
| Single RTX 3090 (24GB) | 22 | -- | -- | OOM at q5 with 8K ctx |
The 5090 at q6 is the cleanest experience: full quality, 38 tok/s, room for 32K context. If you can't justify the $1999 sticker, dual 3090s used at ~$700 each (April 2026 secondhand market) deliver 80% of the speed for half the money — at the cost of 2.5 slots, 700W under load, and the headache of finding a board with two PCIe x8 slots.
Prefill vs generation at 8K and 32K
Generation throughput is the headline number, but prefill matters when you paste a long document. On a 5090 at q4_K_M:
| Context length | Prefill (tok/s) | Time-to-first-token (8000 tok prompt) | Generation (tok/s) |
|---|---|---|---|
| 8K | 4200 | 1.9 s | 44 |
| 16K | 3800 | 4.2 s | 41 |
| 32K | 3100 | 10.3 s | 36 |
| 64K | 2400 | 26.7 s | 28 |
The 32K → 64K cliff is mostly KV-cache thrash. Flash-attention with KV-cache quantization (q8_0 K, q4_0 V) keeps you closer to 33 tok/s at 64K and saves 4GB of VRAM, at a small (<0.05 KLD) accuracy cost.
Context-length scaling: 4K vs 16K vs 64K
| Context | KV cache (fp16) | KV cache (q8/q4) | VRAM total q4_K_M | Tok/s 5090 |
|---|---|---|---|---|
| 4K | 0.6 GB | 0.2 GB | 15 GB | 46 |
| 16K | 2.4 GB | 0.8 GB | 17 GB | 41 |
| 32K | 4.8 GB | 1.6 GB | 19 GB | 36 |
| 64K | 9.6 GB | 3.2 GB | 24 GB | 28 |
If you regularly summarize 50k-token docs, the 5090 is comfortable; the 4090 starts juggling layers around 32K and falls off generation throughput. The Mac Studio's value here is real — at 192GB unified memory, you can blow through 128K context without flinching, just at 1/3 the throughput.
Multi-GPU scaling with llama.cpp tensor split
Dual 3090s with NVLink at q6_K hit 28 tok/s vs. 38 tok/s for a single 5090. Without NVLink (PCIe gen4 x8 each) you drop to ~24 tok/s. The split overhead is roughly 15% — not great, not terrible. Power draw is ~700W vs. 575W on the 5090, so the 5090 also wins on perf-per-watt by a wide margin.
For 70B+ models you'd want the 48GB pool. For a 24B like Medium 3.5, the dual-3090 advantage is mostly cost savings, not capability. If you already own a 3090, slotting in a second one is the cheapest upgrade path. If you're starting from zero, the 5090 wins on simplicity, cooling, and noise.
Perf-per-dollar and perf-per-watt verdict matrix
April 2026 retail prices (5090 from Newegg/B&H; 3090/4090 secondhand eBay sold-listing median; Mac Studio configured at apple.com).
| Rig | $ | Tok/s (q4_K_M) | Tok/s/$ | Watts | Tok/s/W |
|---|---|---|---|---|---|
| RTX 5090 | 1999 | 44 | 0.022 | 575 | 0.077 |
| RTX 4090 (used) | 1300 | 36 | 0.028 | 450 | 0.080 |
| Dual RTX 3090 used | 1400 | 30 | 0.021 | 700 | 0.043 |
| Apple M3 Ultra 192GB | 5599 | 14 | 0.0025 | 215 | 0.065 |
| Single RTX 3090 | 700 | 22 | 0.031 | 350 | 0.063 |
Single 3090 wins on perf-per-dollar but caps you at q4 with no headroom for 32K context. The 4090 has the best efficiency in this lineup, but stocks are drying up and the used premium is climbing. The 5090 is the only no-compromise single-GPU pick if you can absorb the price.
Bottom line: which rig should you buy for Mistral Medium 3.5?
- Best overall: RTX 5090. q6 fits, 32K context fits, ~38 tok/s. Single card, simple cooling.
- Best value: Used RTX 3090 at ~$700. Caps you at q4_K_M with 8K context, but it works.
- Best for long context: Apple M3 Ultra 192GB. Slow throughput, but 128K+ context without juggling.
- Best for 70B headroom: Dual RTX 3090 used. 48GB pool also handles Llama 3 70B at q4.
- Avoid: RTX 5080 16GB. Even q3 is tight; you'll be miserable.
Real-world latency: what 38 tok/s actually feels like
Throughput in tok/s is the headline number reviewers cite, but interactive feel is dominated by time-to-first-token (TTFT) and the natural-reading-rate threshold. Adults read prose silently at roughly 4-5 tokens per second. Anything above ~12 tok/s feels conversational; above 25 tok/s feels instant. A 5090 at 38 tok/s on Medium 3.5 q6 is in "instant" territory for chat-length responses.
For longer outputs (1k-token answers) the difference between 22 tok/s on a 3090 and 38 tok/s on a 5090 is the difference between a 45-second wait and a 26-second wait. Both are usable; the 5090 just stops feeling like a constraint.
Streaming UIs hide a lot of this — once tokens are flowing the user reads as they arrive — but agentic workflows where you wait for a complete tool-use response feel the throughput difference more directly. If you're chaining 5 LLM calls in a pipeline, the 16 tok/s gap compounds.
Common pitfalls to avoid
- Mixing GPU vendors with
-ts: llama.cpp tensor split assumes equal VRAM per device. A 24GB + 12GB pair will OOM the small card unless you manually offload layers asymmetrically with--main-gpuand-ngl. - Forgetting
--flash-attn: It's not the default in older builds. Without it, KV cache stays at fp16 and your 32K context will OOM at q4_K_M on a 24GB card. - Pulling Q4_0 instead of Q4_K_M: Some HuggingFace mirrors only have legacy quants. Q4_0 has noticeably worse quality than Q4_K_M at the same size.
- Underclocked PCIe: Risers and bifurcation cards often drop to gen3 x4. With a 14GB model that's still fine for inference, but cold-load times suffer.
- CPU offload silently kicking in: If
-nglis too low for the model+context, llama.cpp will run partially on CPU and you'll see 4 tok/s instead of 40. Always verify with--verbose.
When NOT to run Medium 3.5 locally
If your usage is sporadic — a dozen prompts a day, mostly short — Mistral's hosted API is cheaper than electricity plus hardware amortization. The break-even is roughly 200k tokens per day at hosted-API list prices vs. a 5090 amortized over 2 years at $0.15/kWh. Below that, just use Le Chat or the API.
If you need sub-200ms time-to-first-token for production traffic, you'll be batching, and a single user-grade GPU isn't going to give you the latency you're after — that's H100/H200 or hosted-inference territory.
Related guides
- Best GPUs for Local LLM Inference 2026
- Qwen 3.6 27B Quantization Benchmarks
- Best GPU for 27B/32B Local LLMs
- Mac Studio vs RTX 5090 for Local AI
Sources
- LocalLLaMA Mistral Medium 3.5 release thread (reddit.com/r/LocalLLaMA, April 2026)
- llama.cpp PRs #11842 + #11901 (Mistral Medium 3.5 chat template support)
- TechPowerUp RTX 5090 review (techpowerup.com)
- Mistral release notes (mistral.ai/news/mistral-medium-3-5)
- HuggingFace bartowski/Mistral-Medium-3.5-GGUF imatrix calibrations
