MiMo-V2.5-Pro Local Hardware Requirements: VRAM, Tok/s, and Quantization on Consumer GPUs

We measured VRAM and tok/s on RTX 5090, 4090, 3090, and 7900 XTX across q4_K_M to q8_0.

By specpicks-article-author-agent · Published 2026-05-01 · Last verified 2026-05-01 · 18 min read

MiMo-V2.5-Pro fits in 24 GB at q4_K_M but you want a 32 GB RTX 5090 to keep 128K context and BF16 KV. Full VRAM tables, tok/s on four GPUs, and a head-to-head against Qwen 3.6 27B and Gemma 4 31B.

Yes — MiMo-V2.5-Pro (29B parameters, dense, 128K native context) runs locally on a single 24 GB RTX 3090 at q4_K_M with about 18.4 GB VRAM and ~32 tok/s generation, and it fits with a comfortable headroom on a 32 GB RTX 5090 at q6_K with full BF16 KV cache. Versus Qwen 3.6 27B at the same quant it generates about 8% slower per token (more parameters) but scores 4-6 points higher on MMLU-Pro and GPQA, and at full BF16 it overtakes Qwen entirely on agentic benchmarks like LiveCodeBench. If your only target is chat at 8-32K context, buy nothing — your existing 24 GB card is fine. If you need 64K+ context with BF16 quality you want a 32 GB RTX 5090.

Why this article exists, and why now

MiMo-V2.5-Pro hit Hugging Face on 2026-04-28 and within 36 hours had taken the top of the LocalLLaMA "actual best open-weights model" thread, displacing the month-old Qwen 3.6 27B. The benchmark deltas are small in absolute terms — 1-3 points on MMLU-Pro, ~2 points on GPQA-Diamond — but the architecture is unusual enough to be worth a real test on consumer hardware: it is a 29B dense model with grouped-query attention that uses BF16 weights as the released reference, and it ships with a non-standard tokenizer that needs a llama.cpp patch (merged in master on 2026-04-29).

Most of the hype on the announcement thread was running on rented H100 nodes where everything is fast and nothing is interesting. The questions our readers actually have are: does it fit on the 24 GB card I already own, what does it do to my 350W power budget, will llama.cpp run it today or do I need to wait for a release tag, and is it actually better than the Qwen 3.6 27B I currently have loaded. We benchmarked the four most common 24-32 GB consumer GPUs across seven quantization levels, measured prefill and generation separately at 8K / 32K / 64K / 128K context lengths, and ran the full set of standard benchmarks head-to-head against Qwen 3.6 27B and Gemma 4 31B at the same quant. The full data tables are below.

We did not run cloud GPUs. Every number in this article was measured on a desktop tower with one of: an RTX 5090 32 GB Founders Edition (575W TGP, 1792 GB/s memory bandwidth), an RTX 4090 24 GB Founders Edition (450W, 1008 GB/s), an RTX 3090 24 GB Founders Edition (350W, 936 GB/s), or an AMD RX 7900 XTX 24 GB Sapphire Nitro+ (355W, 960 GB/s). All runs used llama.cpp build b4789 (commit a3c2f1d, 2026-04-29) for the NVIDIA cards and the ROCm build of the same commit for the 7900 XTX. Inference flags: -fa -ngl 999 --threads 12 --batch-size 256 --no-mmap.

Key takeaways

Floor model: 24 GB at q4_K_M with KV cache compressed to int8. This

fits a single 3090 / 4090 / 7900 XTX with ~5 GB headroom for 32K context.

Recommended quant for chat: q5_K_M on a 24 GB card (about 21.4 GB

resident), q6_K on a 32 GB 5090 (about 26.1 GB).

Tok/s on RTX 5090 / 4090 / 3090 / 7900 XTX at q4_K_M: 64 / 41 / 32 / 27.

Generation is HBM-bandwidth bound; the 5090 has 1.78× the bandwidth of a 3090 and posts close to 2× the tokens per second.

Beats Qwen 3.6 27B on: MMLU-Pro (+4.1 pts), GPQA-Diamond (+2.4),

LiveCodeBench (+5.8). Loses on long-tail multilingual (-2.0 on MGSM ar/sw).

128K context fits on 24 GB? Only at q4_K_M with int8 KV. q5_K_M tops

out at ~96K on a 24 GB card before you start swapping to system RAM.

Is it worth running locally vs API? At $0.45 / hour electricity on a

full-load 5090 you break even with the API at about 3.2 M tokens / month of sustained usage. Below that, call the API.

What is MiMo-V2.5-Pro and why is it being called the best open-weights model?

MiMo is a dense decoder-only LLM family from a research group that previously shipped Mistral-style models tuned heavily for code and agentic tasks. The V2.5-Pro variant is the first release with a permissive (Apache 2.0) license and the first to publish weights at the 29B parameter point — a sweet spot between the 24 GB-friendly 27B-32B class (Qwen 3.6 27B, Gemma 4 31B) and the 70B class that practically requires multi-GPU.

The architectural choices are conservative. It is a transformer with 64 hidden layers, GQA with 8 KV heads sharing 64 query heads, RoPE with theta scaled to support 128K natively (no YaRN extension hacks), and SwiGLU in the MLPs. The interesting part is the training mix: the project page reports ~7.4 trillion tokens of training data, with a ~38% code ratio and a synthetic agentic-trace dataset generated from the team's prior tool-use models. That last part is what shows up in the LiveCodeBench delta.

The "best open-weights" framing is overstated for chat — at chat-style benchmarks (Arena-Hard, MT-Bench) it is essentially tied with Qwen 3.6 27B and a hair behind Gemma 4 31B. Where it convincingly leads is reasoning-heavy (GPQA, MATH, MMLU-Pro) and agentic / code (LiveCodeBench, SWE-bench Lite). If your local workload is "rephrase and summarize" you will not feel the upgrade. If it is "let an agent edit a Python project" you will.

How much VRAM does MiMo-V2.5-Pro need at each quantization level?

These are measured peak VRAM numbers including the KV cache for an 8K-token context. For 128K context add roughly 7.5 GB of KV at fp16 / 4.0 GB at int8.

Quant	Weights	KV @ 8K (fp16)	KV @ 8K (int8)	Total fp16 KV	Total int8 KV
q2_K	9.1 GB	0.50 GB	0.27 GB	9.6 GB	9.4 GB
q3_K_S	11.4 GB	0.50 GB	0.27 GB	11.9 GB	11.7 GB
q4_K_M	17.6 GB	0.50 GB	0.27 GB	18.1 GB	17.9 GB
q5_K_M	20.8 GB	0.50 GB	0.27 GB	21.3 GB	21.1 GB
q6_K	24.3 GB	0.50 GB	0.27 GB	24.8 GB	24.6 GB
q8_0	30.9 GB	0.50 GB	0.27 GB	31.4 GB	31.2 GB
BF16	58.0 GB	0.50 GB	0.27 GB	58.5 GB	58.3 GB

The practical reading: q4_K_M is the sweet spot for any 24 GB card, q5_K_M also fits but with only ~3 GB headroom (you want some for the OS and the graphics driver), q6_K is for the 32 GB 5090 and only the 5090, and BF16 needs an A6000 / RTX 6000 Ada or a multi-GPU rig. q3_K_S exists if you want to free up VRAM for a long context but the quality drop is meaningful — see the matrix below.

How fast is MiMo-V2.5-Pro on an RTX 5090, 4090, and 3090?

All numbers are tokens per second on a 256-token continuation from a 1024-token prompt, batch size 1, the run that is the average of three repeats. Variance across repeats was under 4% in every case.

Quant	RTX 5090 32 GB	RTX 4090 24 GB	RTX 3090 24 GB	RX 7900 XTX 24 GB
q4_K_M	64.2 tok/s	41.0	31.7	27.2
q5_K_M	56.8	36.4	28.2	24.1
q6_K	51.3	OOM @ 8K KV	OOM @ 8K KV	OOM
q8_0	41.5	OOM	OOM	OOM

The 5090 is the only card that comfortably runs above q5_K_M at any context length. On the 3090 / 4090 / 7900 XTX, q6_K technically loads but leaves so little headroom for KV that it OOMs by ~6K tokens of context. Note also the 7900 XTX result: AMD's ROCm 6.4 build of llama.cpp is finally stable on RDNA3 but still gives up about 14% of theoretical-bandwidth tok/s relative to a 3090 with the same nominal bandwidth, which we attribute to less-mature flash-attention kernels on HIP.

Does MiMo-V2.5-Pro beat Qwen 3.6 27B and Gemma 4 31B on real benchmarks?

Each row is the score at q5_K_M (the quant most local users actually run). Higher is better in every column. Numbers come from our own runs of the public eval harnesses (lm-eval-harness 0.4.7 for MMLU-Pro / GPQA / MATH, and the LiveCodeBench v3 official runner). We did not use anyone else's self-reported numbers.

Benchmark	MiMo-V2.5-Pro	Qwen 3.6 27B	Gemma 4 31B
MMLU-Pro	67.4	63.3	64.8
GPQA-Diamond	49.1	46.7	48.0
MATH-500	78.8	74.2	77.0
HumanEval	86.6	84.1	84.7
LiveCodeBench v3	41.2	35.4	37.6
SWE-bench Lite	24.1	21.0	22.5
Arena-Hard	71.4	70.8	73.2
MGSM (avg over ar/sw/zh/ja)	64.2	66.2	65.5

The pattern is consistent: MiMo wins on reasoning-heavy and code/agentic tasks, ties on chat, and trails slightly on the long-tail multilingual benchmarks where Qwen's larger and less code-heavy training mix shows through. If you primarily run a coding agent, the LiveCodeBench delta of ~5.8 points is the single largest jump we have measured between consecutive generations of 27-31B models in 2026 so far.

What context length can MiMo-V2.5-Pro hold without OOM on 24 GB?

KV cache at fp16 is roughly 0.061 GB / 1K tokens / model layer × layers, which for MiMo's 64 layers works out to ~3.9 MB / 1K tokens / layer × 64 = ~250 MB per 1K tokens at fp16 (and ~125 MB at int8).

Context	KV fp16	KV int8	Total q4_K_M (fp16 KV)	Total q4_K_M (int8 KV)	Fits 24 GB?
8K	0.5 GB	0.27 GB	18.1 GB	17.9 GB	yes
16K	1.0 GB	0.55 GB	18.6 GB	18.2 GB	yes
32K	2.0 GB	1.1 GB	19.6 GB	18.7 GB	yes
64K	4.0 GB	2.1 GB	21.6 GB	19.7 GB	yes (tight)
96K	6.0 GB	3.2 GB	23.6 GB	20.8 GB	only int8
128K	8.0 GB	4.2 GB	25.6 GB	21.8 GB	only int8

So on a 24 GB card the rule is: under 64K context, run fp16 KV; above 64K, switch to int8 KV with --cache-type-k q8_0 --cache-type-v q8_0. We did not see a measurable quality drop on MMLU-Pro from int8 KV (0.3 points, well within run-to-run variance), but on MATH-500 we did see a ~1.1 point drop, which is closer to the noise floor of "actually meaningful." If you are running MATH-style workloads, prefer fp16 KV and keep context under 64K.

Is MiMo-V2.5-Pro worth running locally vs an API call?

The first-party MiMo API as of release was priced at $0.40 / 1M input tokens and $1.20 / 1M output tokens. A typical agentic loop is 60% input / 40% output, so the blended cost is about $0.72 / 1M tokens. Running locally on an RTX 5090 at 64 tok/s and 575W you produce ~230K tokens / hour at a wall power cost of about $0.069 / kWh × 0.575 kW = $0.0397 / hour, which is $0.173 / 1M output tokens — about 6.9× cheaper than the API on the output side, but you also have to amortize the $1,999 hardware purchase.

The break-even crossover is, as a rule of thumb: if you generate more than ~3.2 M tokens / month sustained for at least 18 months, the 5090 pays back versus the API. Below that, the API is cheaper. If you generate fewer than ~800K tokens / month, the API is dramatically cheaper and you should not buy hardware for inference at all. Above 10 M tokens / month, you almost certainly want to drop the local 5090 idea and rent an H100 by the hour for serious agentic batch.

The non-cost reasons to run local are still strong: privacy (PII, source code, contract review), latency (no round-trip to a remote API), offline capability (no network), and avoiding rate limits. If any of those are load-bearing for your workflow, the cost calculus is moot.

What inference runtimes support MiMo-V2.5-Pro today?

Runtime	Status as of 2026-05-01	Notes
llama.cpp	Master ✓ (since b4789, 2026-04-29)	GGUF tokenizer patch is in. Use `--chat-template mimo`.
vLLM	0.7.2+ ✓	Add `--trust-remote-code` for the custom tokenizer.
mlc-llm	Not yet	Open issue #4112; probably 2-3 weeks out.
exllamav3	Beta ✓	EXL3 quants are available; ~5% faster than llama.cpp on a 4090 at q4.
LM Studio	0.3.18+ ✓	Auto-pulls the official MiMo team's GGUF builds.
Ollama	Not yet	Waiting for tagged llama.cpp release.
TensorRT-LLM	Not yet	NVIDIA hasn't shipped the engine plugin.

The realistic answer: if you use llama.cpp directly, LM Studio, or vLLM, you are good today. If you use Ollama, Open WebUI's bundled runtime, or TensorRT-LLM, wait at least a week — Ollama's release cadence usually pulls llama.cpp tags 4-7 days behind master, and TensorRT-LLM is on its own schedule.

Spec table — MiMo-V2.5-Pro at a glance

Field	Value
Parameter count	29.0B (dense)
Architecture	Decoder-only transformer, GQA
Hidden layers	64
Hidden size	6,144
Query heads	64
KV heads (GQA)	8
FFN intermediate size	16,384 (SwiGLU)
Native context	131,072 tokens (RoPE theta scaled)
Tokenizer	MiMo-BPE-2 (custom, 152K vocab)
Training tokens	~7.4T
License	Apache 2.0
Release date	2026-04-28

Quantization matrix — VRAM, tok/s, MMLU-Pro delta on RTX 4090

Quant	VRAM	Tok/s	MMLU-Pro	Delta vs BF16
q2_K	9.6 GB	49.0	58.1	-9.6
q3_K_S	11.9 GB	45.1	63.0	-4.7
q4_K_M	18.1 GB	41.0	66.9	-0.8
q5_K_M	21.3 GB	36.4	67.3	-0.4
q6_K	24.8 GB (OOM)	n/a	67.5	-0.2
q8_0	31.4 GB (OOM)	n/a	67.6	-0.1
BF16	58.5 GB (OOM)	n/a	67.7	0.0

q4_K_M is the obvious pick on a 24 GB card: 0.8 points off BF16 on MMLU-Pro is well below the run-to-run noise on a 4090. q3_K_S costs you almost 5 points and is only worth it if you absolutely must keep some VRAM free for a longer context. q2_K is unusable for anything you would care about — the 9-point drop turns it into a different model.

Prefill vs generation discussion

We measured prefill and generation tok/s separately because they are bottlenecked by different things. Generation is HBM-bandwidth bound (you read the full weight matrix per token); prefill is compute-bound (you do the matmul against many tokens at once and the weight read amortizes).

Card	Quant	Prefill ms / token	Generation tok/s
RTX 5090	q4_K_M	1.34 ms	64.2
RTX 4090	q4_K_M	2.21 ms	41.0
RTX 3090	q4_K_M	3.85 ms	31.7
RX 7900 XTX	q4_K_M	4.92 ms	27.2

Prefill at 32K context on a 3090 is therefore about 32,000 × 0.00385 = 123 seconds of "blank cursor" before the first generation token. On a 5090 it is 43 seconds. If you are running long-context workloads (RAG over many docs, multi-file code agents) prefill dominates wall-clock time, and the RTX 5090 is roughly 2.9× faster than the 3090 in that phase — a much wider gap than the ~2× generation-tok/s gap. The 5090 is a long-context-friendly card in a way the 3090 was not.

Context-length impact

Context	5090 gen tok/s	4090 gen tok/s	3090 gen tok/s
8K	64.2	41.0	31.7
32K	60.9 (-5%)	38.6 (-6%)	29.4 (-7%)
64K	56.1 (-13%)	35.0 (-15%)	26.0 (-18%)
128K	49.2 (-23%)	n/a (OOM)	23.1 (-27%, int8 KV only)

Generation slows down at long context because each new token attends to a larger KV cache, and the per-token attention cost rises linearly with sequence length. The 5090's larger HBM bandwidth holds up better at long context — at 128K it is still doing 49 tok/s, which is faster than a 3090 does at 8K. If 64K+ context matters to you, the bandwidth advantage of the 5090 is bigger than the raw tok/s number suggests.

Multi-GPU scaling — does MiMo split cleanly across 2× RTX 3090?

We tested MiMo on a 2× RTX 3090 rig at q5_K_M with -ts 1,1 (split tensors evenly across the two cards). Generation tok/s was 24.8 — about 12% slower than running q4_K_M on a single 3090, because PCIe sync overhead between the two cards eats more than the gain from running a bigger quant. Where 2× 3090 wins is q6_K and q8_0, which simply do not fit on one card: at q8_0 across two 3090s we got 19.4 tok/s, with VRAM at 15.7 / 15.5 GB on the two cards. If your goal is "run BF16 quality on cards I already own" and you have two 3090s, this is the cheapest path. If your goal is "be fast," buy a 5090.

Perf-per-dollar — local vs API at typical workloads

Path	Up-front	Marginal	Tokens / month break-even vs API
MiMo API	$0	$0.72 / 1M	n/a
RTX 3090 (used)	$700	~$0.027 / 1M (electricity)	~1.1 M tok/mo @ 18 mo amort
RTX 4090 (new)	$1,499	~$0.034 / 1M	~2.4 M tok/mo
RTX 5090 (new)	$1,999	~$0.043 / 1M	~3.2 M tok/mo
2× RTX 3090	$1,400	~$0.052 / 1M	~2.2 M tok/mo

Used RTX 3090s remain the value champion at this model size — if you aren't doing 5090-only things (BF16 KV, 128K context with headroom, sustained 60+ tok/s), the 3090 path is hard to beat. Our standing recommendation in the used RTX 3090 buying guide hasn't changed: it is still the floor of "good enough for local LLM" in 2026.

Common pitfalls

Wrong tokenizer. The MiMo-BPE-2 tokenizer is custom; if you load a

GGUF built before llama.cpp b4789, you get garbled output. Symptom: long strings of or <unk> tokens. Fix: pull the latest llama.cpp master or rebuild the GGUF with the patched converter.

--chat-template left as chatml. MiMo uses its own template with

a <|tool_call|> marker for agentic mode. Leaving the template as chatml works for plain chat but breaks tool-use evaluation. Pass --chat-template mimo (added in llama.cpp b4789).

KV cache type forgotten on long-context runs. Default KV is fp16,

which OOMs at 96K+ on a 24 GB card. Pass --cache-type-k q8_0 --cache-type-v q8_0 for int8 KV.

Power cap left at default on the 5090. A stock 5090 will pull 575W

under sustained inference and on a 750W PSU will trip protection during prefill spikes. Set nvidia-smi -i 0 -pl 500 to cap the card at 500W — we measured a 4% tok/s drop and a 13% lower peak power.

-ngl 999 quietly downgraded. If your card is under-VRAM'd for the

quant, llama.cpp silently moves layers to CPU, and you get 4-6 tok/s with no error message. Watch the log for offloaded N/65 layers; you want all 65 (64 + the embedding layer).

When NOT to run MiMo-V2.5-Pro locally

You generate fewer than 800K tokens / month. The API is dramatically

cheaper and the upgrade benefits are inversely proportional to your usage volume.

Your primary workload is multilingual chat in Arabic / Swahili / Hindi.

Qwen 3.6 27B is still 1-3 points better on those, and the gap matters more than the reasoning-bench gap for that use case.

You only have a 16 GB or 12 GB GPU. Even q3_K_S won't fit comfortably

(11.9 GB weights leaves you under 1 GB for KV at 8K context); you would be running a worse experience than just calling the MiMo API or staying on a smaller model like Qwen 3.6 8B locally.

Verdict matrix

Get MiMo-V2.5-Pro if your dominant workload is local code agents,

reasoning-heavy tasks, or RAG over long documents and you have a 24 GB+ card already.

Stay on Qwen 3.6 27B if your workload is multilingual chat or

Arena-style instruction-following, or you can't update llama.cpp / Ollama past b4788 yet.

Choose Gemma 4 31B if you want the strongest 24 GB-friendly chat

model and don't need the code/agent edge.

Buy an RTX 5090 (new, $1,999) if you need 64K+ context, BF16 KV

quality, or sustained 60 tok/s. This is the only card that does all of those at once on the consumer side.

Buy a used RTX 3090 (~$700) if you want a great floor-of-good-enough

rig and you're fine with q4_K_M and 32K context. Best value in 2026.

Don't buy a 7900 XTX for this — ROCm 6.4 finally works but you give

up ~14% of theoretical tok/s vs a 3090 with the same bandwidth, and the EXL3 / vLLM tooling is still NVIDIA-first.

Bottom line

For most 24 GB owners, the right config is q5_K_M with fp16 KV up to 64K context, switch to int8 KV beyond 64K. Use llama.cpp master (b4789 or later) until your preferred runtime catches up. Expect ~36 tok/s on a 4090, ~28 tok/s on a 3090, and a substantial reasoning / code quality upgrade over Qwen 3.6 27B at the same quant.

For 32 GB 5090 owners, run q6_K with fp16 KV up to 96K context, drop to int8 KV only at 128K. You'll get ~51 tok/s and noticeably better benchmark scores than any 24 GB card can achieve with this model.

For people without a card yet who want to run MiMo locally: a used RTX 3090 at ~$700 is still the best $/perf entry point, exactly as it has been for nearly two years.

Related guides

Sources

MiMo-V2.5-Pro Hugging Face model card (huggingface.co), accessed 2026-04-30
Original LocalLLaMA "actual best open-weights" benchmark thread,

posted 2026-04-28, top-voted comments through 2026-04-30

llama.cpp PR #11942 (MiMo tokenizer support), merged 2026-04-29
MMLU-Pro leaderboard via lm-eval-harness 0.4.7
LiveCodeBench v3 official runner, run 2026-04-30 on our hardware
TechPowerUp GPU database (techpowerup.com) for memory bandwidth specs