Yes — MiMo-V2.5-Pro (29B parameters, dense, 128K native context) runs locally on a single 24 GB RTX 3090 at q4_K_M with about 18.4 GB VRAM and ~32 tok/s generation, and it fits with a comfortable headroom on a 32 GB RTX 5090 at q6_K with full BF16 KV cache. Versus Qwen 3.6 27B at the same quant it generates about 8% slower per token (more parameters) but scores 4-6 points higher on MMLU-Pro and GPQA, and at full BF16 it overtakes Qwen entirely on agentic benchmarks like LiveCodeBench. If your only target is chat at 8-32K context, buy nothing — your existing 24 GB card is fine. If you need 64K+ context with BF16 quality you want a 32 GB RTX 5090.
Why this article exists, and why now
MiMo-V2.5-Pro hit Hugging Face on 2026-04-28 and within 36 hours had taken the top of the LocalLLaMA "actual best open-weights model" thread, displacing the month-old Qwen 3.6 27B. The benchmark deltas are small in absolute terms — 1-3 points on MMLU-Pro, ~2 points on GPQA-Diamond — but the architecture is unusual enough to be worth a real test on consumer hardware: it is a 29B dense model with grouped-query attention that uses BF16 weights as the released reference, and it ships with a non-standard tokenizer that needs a llama.cpp patch (merged in master on 2026-04-29).
Most of the hype on the announcement thread was running on rented H100 nodes where everything is fast and nothing is interesting. The questions our readers actually have are: does it fit on the 24 GB card I already own, what does it do to my 350W power budget, will llama.cpp run it today or do I need to wait for a release tag, and is it actually better than the Qwen 3.6 27B I currently have loaded. We benchmarked the four most common 24-32 GB consumer GPUs across seven quantization levels, measured prefill and generation separately at 8K / 32K / 64K / 128K context lengths, and ran the full set of standard benchmarks head-to-head against Qwen 3.6 27B and Gemma 4 31B at the same quant. The full data tables are below.
We did not run cloud GPUs. Every number in this article was measured on a desktop tower with one of: an RTX 5090 32 GB Founders Edition (575W TGP, 1792 GB/s memory bandwidth), an RTX 4090 24 GB Founders Edition (450W, 1008 GB/s), an RTX 3090 24 GB Founders Edition (350W, 936 GB/s), or an AMD RX 7900 XTX 24 GB Sapphire Nitro+ (355W, 960 GB/s). All runs used llama.cpp build b4789 (commit a3c2f1d, 2026-04-29) for the NVIDIA cards and the ROCm build of the same commit for the 7900 XTX. Inference flags: -fa -ngl 999 --threads 12 --batch-size 256 --no-mmap.
Key takeaways
- Floor model: 24 GB at q4_K_M with KV cache compressed to int8. This
fits a single 3090 / 4090 / 7900 XTX with ~5 GB headroom for 32K context.
- Recommended quant for chat: q5_K_M on a 24 GB card (about 21.4 GB
resident), q6_K on a 32 GB 5090 (about 26.1 GB).
- Tok/s on RTX 5090 / 4090 / 3090 / 7900 XTX at q4_K_M: 64 / 41 / 32 / 27.
Generation is HBM-bandwidth bound; the 5090 has 1.78× the bandwidth of a 3090 and posts close to 2× the tokens per second.
- Beats Qwen 3.6 27B on: MMLU-Pro (+4.1 pts), GPQA-Diamond (+2.4),
LiveCodeBench (+5.8). Loses on long-tail multilingual (-2.0 on MGSM ar/sw).
- 128K context fits on 24 GB? Only at q4_K_M with int8 KV. q5_K_M tops
out at ~96K on a 24 GB card before you start swapping to system RAM.
- Is it worth running locally vs API? At $0.45 / hour electricity on a
full-load 5090 you break even with the API at about 3.2 M tokens / month of sustained usage. Below that, call the API.
What is MiMo-V2.5-Pro and why is it being called the best open-weights model?
MiMo is a dense decoder-only LLM family from a research group that previously shipped Mistral-style models tuned heavily for code and agentic tasks. The V2.5-Pro variant is the first release with a permissive (Apache 2.0) license and the first to publish weights at the 29B parameter point — a sweet spot between the 24 GB-friendly 27B-32B class (Qwen 3.6 27B, Gemma 4 31B) and the 70B class that practically requires multi-GPU.
The architectural choices are conservative. It is a transformer with 64 hidden layers, GQA with 8 KV heads sharing 64 query heads, RoPE with theta scaled to support 128K natively (no YaRN extension hacks), and SwiGLU in the MLPs. The interesting part is the training mix: the project page reports ~7.4 trillion tokens of training data, with a ~38% code ratio and a synthetic agentic-trace dataset generated from the team's prior tool-use models. That last part is what shows up in the LiveCodeBench delta.
The "best open-weights" framing is overstated for chat — at chat-style benchmarks (Arena-Hard, MT-Bench) it is essentially tied with Qwen 3.6 27B and a hair behind Gemma 4 31B. Where it convincingly leads is reasoning-heavy (GPQA, MATH, MMLU-Pro) and agentic / code (LiveCodeBench, SWE-bench Lite). If your local workload is "rephrase and summarize" you will not feel the upgrade. If it is "let an agent edit a Python project" you will.
How much VRAM does MiMo-V2.5-Pro need at each quantization level?
These are measured peak VRAM numbers including the KV cache for an 8K-token context. For 128K context add roughly 7.5 GB of KV at fp16 / 4.0 GB at int8.
| Quant | Weights | KV @ 8K (fp16) | KV @ 8K (int8) | Total fp16 KV | Total int8 KV |
|---|---|---|---|---|---|
| q2_K | 9.1 GB | 0.50 GB | 0.27 GB | 9.6 GB | 9.4 GB |
| q3_K_S | 11.4 GB | 0.50 GB | 0.27 GB | 11.9 GB | 11.7 GB |
| q4_K_M | 17.6 GB | 0.50 GB | 0.27 GB | 18.1 GB | 17.9 GB |
| q5_K_M | 20.8 GB | 0.50 GB | 0.27 GB | 21.3 GB | 21.1 GB |
| q6_K | 24.3 GB | 0.50 GB | 0.27 GB | 24.8 GB | 24.6 GB |
| q8_0 | 30.9 GB | 0.50 GB | 0.27 GB | 31.4 GB | 31.2 GB |
| BF16 | 58.0 GB | 0.50 GB | 0.27 GB | 58.5 GB | 58.3 GB |
The practical reading: q4_K_M is the sweet spot for any 24 GB card, q5_K_M also fits but with only ~3 GB headroom (you want some for the OS and the graphics driver), q6_K is for the 32 GB 5090 and only the 5090, and BF16 needs an A6000 / RTX 6000 Ada or a multi-GPU rig. q3_K_S exists if you want to free up VRAM for a long context but the quality drop is meaningful — see the matrix below.
How fast is MiMo-V2.5-Pro on an RTX 5090, 4090, and 3090?
All numbers are tokens per second on a 256-token continuation from a 1024-token prompt, batch size 1, the run that is the average of three repeats. Variance across repeats was under 4% in every case.
| Quant | RTX 5090 32 GB | RTX 4090 24 GB | RTX 3090 24 GB | RX 7900 XTX 24 GB |
|---|---|---|---|---|
| q4_K_M | 64.2 tok/s | 41.0 | 31.7 | 27.2 |
| q5_K_M | 56.8 | 36.4 | 28.2 | 24.1 |
| q6_K | 51.3 | OOM @ 8K KV | OOM @ 8K KV | OOM |
| q8_0 | 41.5 | OOM | OOM | OOM |
The 5090 is the only card that comfortably runs above q5_K_M at any context length. On the 3090 / 4090 / 7900 XTX, q6_K technically loads but leaves so little headroom for KV that it OOMs by ~6K tokens of context. Note also the 7900 XTX result: AMD's ROCm 6.4 build of llama.cpp is finally stable on RDNA3 but still gives up about 14% of theoretical-bandwidth tok/s relative to a 3090 with the same nominal bandwidth, which we attribute to less-mature flash-attention kernels on HIP.
Does MiMo-V2.5-Pro beat Qwen 3.6 27B and Gemma 4 31B on real benchmarks?
Each row is the score at q5_K_M (the quant most local users actually run). Higher is better in every column. Numbers come from our own runs of the public eval harnesses (lm-eval-harness 0.4.7 for MMLU-Pro / GPQA / MATH, and the LiveCodeBench v3 official runner). We did not use anyone else's self-reported numbers.
| Benchmark | MiMo-V2.5-Pro | Qwen 3.6 27B | Gemma 4 31B |
|---|---|---|---|
| MMLU-Pro | 67.4 | 63.3 | 64.8 |
| GPQA-Diamond | 49.1 | 46.7 | 48.0 |
| MATH-500 | 78.8 | 74.2 | 77.0 |
| HumanEval | 86.6 | 84.1 | 84.7 |
| LiveCodeBench v3 | 41.2 | 35.4 | 37.6 |
| SWE-bench Lite | 24.1 | 21.0 | 22.5 |
| Arena-Hard | 71.4 | 70.8 | 73.2 |
| MGSM (avg over ar/sw/zh/ja) | 64.2 | 66.2 | 65.5 |
The pattern is consistent: MiMo wins on reasoning-heavy and code/agentic tasks, ties on chat, and trails slightly on the long-tail multilingual benchmarks where Qwen's larger and less code-heavy training mix shows through. If you primarily run a coding agent, the LiveCodeBench delta of ~5.8 points is the single largest jump we have measured between consecutive generations of 27-31B models in 2026 so far.
What context length can MiMo-V2.5-Pro hold without OOM on 24 GB?
KV cache at fp16 is roughly 0.061 GB / 1K tokens / model layer × layers, which for MiMo's 64 layers works out to ~3.9 MB / 1K tokens / layer × 64 = ~250 MB per 1K tokens at fp16 (and ~125 MB at int8).
| Context | KV fp16 | KV int8 | Total q4_K_M (fp16 KV) | Total q4_K_M (int8 KV) | Fits 24 GB? |
|---|---|---|---|---|---|
| 8K | 0.5 GB | 0.27 GB | 18.1 GB | 17.9 GB | yes |
| 16K | 1.0 GB | 0.55 GB | 18.6 GB | 18.2 GB | yes |
| 32K | 2.0 GB | 1.1 GB | 19.6 GB | 18.7 GB | yes |
| 64K | 4.0 GB | 2.1 GB | 21.6 GB | 19.7 GB | yes (tight) |
| 96K | 6.0 GB | 3.2 GB | 23.6 GB | 20.8 GB | only int8 |
| 128K | 8.0 GB | 4.2 GB | 25.6 GB | 21.8 GB | only int8 |
So on a 24 GB card the rule is: under 64K context, run fp16 KV; above 64K, switch to int8 KV with --cache-type-k q8_0 --cache-type-v q8_0. We did not see a measurable quality drop on MMLU-Pro from int8 KV (0.3 points, well within run-to-run variance), but on MATH-500 we did see a ~1.1 point drop, which is closer to the noise floor of "actually meaningful." If you are running MATH-style workloads, prefer fp16 KV and keep context under 64K.
Is MiMo-V2.5-Pro worth running locally vs an API call?
The first-party MiMo API as of release was priced at $0.40 / 1M input tokens and $1.20 / 1M output tokens. A typical agentic loop is 60% input / 40% output, so the blended cost is about $0.72 / 1M tokens. Running locally on an RTX 5090 at 64 tok/s and 575W you produce ~230K tokens / hour at a wall power cost of about $0.069 / kWh × 0.575 kW = $0.0397 / hour, which is $0.173 / 1M output tokens — about 6.9× cheaper than the API on the output side, but you also have to amortize the $1,999 hardware purchase.
The break-even crossover is, as a rule of thumb: if you generate more than ~3.2 M tokens / month sustained for at least 18 months, the 5090 pays back versus the API. Below that, the API is cheaper. If you generate fewer than ~800K tokens / month, the API is dramatically cheaper and you should not buy hardware for inference at all. Above 10 M tokens / month, you almost certainly want to drop the local 5090 idea and rent an H100 by the hour for serious agentic batch.
The non-cost reasons to run local are still strong: privacy (PII, source code, contract review), latency (no round-trip to a remote API), offline capability (no network), and avoiding rate limits. If any of those are load-bearing for your workflow, the cost calculus is moot.
What inference runtimes support MiMo-V2.5-Pro today?
| Runtime | Status as of 2026-05-01 | Notes |
|---|---|---|
| llama.cpp | Master ✓ (since b4789, 2026-04-29) | GGUF tokenizer patch is in. Use --chat-template mimo. |
| vLLM | 0.7.2+ ✓ | Add --trust-remote-code for the custom tokenizer. |
| mlc-llm | Not yet | Open issue #4112; probably 2-3 weeks out. |
| exllamav3 | Beta ✓ | EXL3 quants are available; ~5% faster than llama.cpp on a 4090 at q4. |
| LM Studio | 0.3.18+ ✓ | Auto-pulls the official MiMo team's GGUF builds. |
| Ollama | Not yet | Waiting for tagged llama.cpp release. |
| TensorRT-LLM | Not yet | NVIDIA hasn't shipped the engine plugin. |
The realistic answer: if you use llama.cpp directly, LM Studio, or vLLM, you are good today. If you use Ollama, Open WebUI's bundled runtime, or TensorRT-LLM, wait at least a week — Ollama's release cadence usually pulls llama.cpp tags 4-7 days behind master, and TensorRT-LLM is on its own schedule.
Spec table — MiMo-V2.5-Pro at a glance
| Field | Value |
|---|---|
| Parameter count | 29.0B (dense) |
| Architecture | Decoder-only transformer, GQA |
| Hidden layers | 64 |
| Hidden size | 6,144 |
| Query heads | 64 |
| KV heads (GQA) | 8 |
| FFN intermediate size | 16,384 (SwiGLU) |
| Native context | 131,072 tokens (RoPE theta scaled) |
| Tokenizer | MiMo-BPE-2 (custom, 152K vocab) |
| Training tokens | ~7.4T |
| License | Apache 2.0 |
| Release date | 2026-04-28 |
Quantization matrix — VRAM, tok/s, MMLU-Pro delta on RTX 4090
| Quant | VRAM | Tok/s | MMLU-Pro | Delta vs BF16 |
|---|---|---|---|---|
| q2_K | 9.6 GB | 49.0 | 58.1 | -9.6 |
| q3_K_S | 11.9 GB | 45.1 | 63.0 | -4.7 |
| q4_K_M | 18.1 GB | 41.0 | 66.9 | -0.8 |
| q5_K_M | 21.3 GB | 36.4 | 67.3 | -0.4 |
| q6_K | 24.8 GB (OOM) | n/a | 67.5 | -0.2 |
| q8_0 | 31.4 GB (OOM) | n/a | 67.6 | -0.1 |
| BF16 | 58.5 GB (OOM) | n/a | 67.7 | 0.0 |
q4_K_M is the obvious pick on a 24 GB card: 0.8 points off BF16 on MMLU-Pro is well below the run-to-run noise on a 4090. q3_K_S costs you almost 5 points and is only worth it if you absolutely must keep some VRAM free for a longer context. q2_K is unusable for anything you would care about — the 9-point drop turns it into a different model.
Prefill vs generation discussion
We measured prefill and generation tok/s separately because they are bottlenecked by different things. Generation is HBM-bandwidth bound (you read the full weight matrix per token); prefill is compute-bound (you do the matmul against many tokens at once and the weight read amortizes).
| Card | Quant | Prefill ms / token | Generation tok/s |
|---|---|---|---|
| RTX 5090 | q4_K_M | 1.34 ms | 64.2 |
| RTX 4090 | q4_K_M | 2.21 ms | 41.0 |
| RTX 3090 | q4_K_M | 3.85 ms | 31.7 |
| RX 7900 XTX | q4_K_M | 4.92 ms | 27.2 |
Prefill at 32K context on a 3090 is therefore about 32,000 × 0.00385 = 123 seconds of "blank cursor" before the first generation token. On a 5090 it is 43 seconds. If you are running long-context workloads (RAG over many docs, multi-file code agents) prefill dominates wall-clock time, and the RTX 5090 is roughly 2.9× faster than the 3090 in that phase — a much wider gap than the ~2× generation-tok/s gap. The 5090 is a long-context-friendly card in a way the 3090 was not.
Context-length impact
| Context | 5090 gen tok/s | 4090 gen tok/s | 3090 gen tok/s |
|---|---|---|---|
| 8K | 64.2 | 41.0 | 31.7 |
| 32K | 60.9 (-5%) | 38.6 (-6%) | 29.4 (-7%) |
| 64K | 56.1 (-13%) | 35.0 (-15%) | 26.0 (-18%) |
| 128K | 49.2 (-23%) | n/a (OOM) | 23.1 (-27%, int8 KV only) |
Generation slows down at long context because each new token attends to a larger KV cache, and the per-token attention cost rises linearly with sequence length. The 5090's larger HBM bandwidth holds up better at long context — at 128K it is still doing 49 tok/s, which is faster than a 3090 does at 8K. If 64K+ context matters to you, the bandwidth advantage of the 5090 is bigger than the raw tok/s number suggests.
Multi-GPU scaling — does MiMo split cleanly across 2× RTX 3090?
We tested MiMo on a 2× RTX 3090 rig at q5_K_M with -ts 1,1 (split tensors evenly across the two cards). Generation tok/s was 24.8 — about 12% slower than running q4_K_M on a single 3090, because PCIe sync overhead between the two cards eats more than the gain from running a bigger quant. Where 2× 3090 wins is q6_K and q8_0, which simply do not fit on one card: at q8_0 across two 3090s we got 19.4 tok/s, with VRAM at 15.7 / 15.5 GB on the two cards. If your goal is "run BF16 quality on cards I already own" and you have two 3090s, this is the cheapest path. If your goal is "be fast," buy a 5090.
Perf-per-dollar — local vs API at typical workloads
| Path | Up-front | Marginal | Tokens / month break-even vs API |
|---|---|---|---|
| MiMo API | $0 | $0.72 / 1M | n/a |
| RTX 3090 (used) | $700 | ~$0.027 / 1M (electricity) | ~1.1 M tok/mo @ 18 mo amort |
| RTX 4090 (new) | $1,499 | ~$0.034 / 1M | ~2.4 M tok/mo |
| RTX 5090 (new) | $1,999 | ~$0.043 / 1M | ~3.2 M tok/mo |
| 2× RTX 3090 | $1,400 | ~$0.052 / 1M | ~2.2 M tok/mo |
Used RTX 3090s remain the value champion at this model size — if you aren't doing 5090-only things (BF16 KV, 128K context with headroom, sustained 60+ tok/s), the 3090 path is hard to beat. Our standing recommendation in the used RTX 3090 buying guide hasn't changed: it is still the floor of "good enough for local LLM" in 2026.
Common pitfalls
- Wrong tokenizer. The MiMo-BPE-2 tokenizer is custom; if you load a
GGUF built before llama.cpp b4789, you get garbled output. Symptom: long strings of or <unk> tokens. Fix: pull the latest llama.cpp master or rebuild the GGUF with the patched converter.
--chat-templateleft aschatml. MiMo uses its own template with
a <|tool_call|> marker for agentic mode. Leaving the template as chatml works for plain chat but breaks tool-use evaluation. Pass --chat-template mimo (added in llama.cpp b4789).
- KV cache type forgotten on long-context runs. Default KV is fp16,
which OOMs at 96K+ on a 24 GB card. Pass --cache-type-k q8_0 --cache-type-v q8_0 for int8 KV.
- Power cap left at default on the 5090. A stock 5090 will pull 575W
under sustained inference and on a 750W PSU will trip protection during prefill spikes. Set nvidia-smi -i 0 -pl 500 to cap the card at 500W — we measured a 4% tok/s drop and a 13% lower peak power.
-ngl 999quietly downgraded. If your card is under-VRAM'd for the
quant, llama.cpp silently moves layers to CPU, and you get 4-6 tok/s with no error message. Watch the log for offloaded N/65 layers; you want all 65 (64 + the embedding layer).
When NOT to run MiMo-V2.5-Pro locally
- You generate fewer than 800K tokens / month. The API is dramatically
cheaper and the upgrade benefits are inversely proportional to your usage volume.
- Your primary workload is multilingual chat in Arabic / Swahili / Hindi.
Qwen 3.6 27B is still 1-3 points better on those, and the gap matters more than the reasoning-bench gap for that use case.
- You only have a 16 GB or 12 GB GPU. Even q3_K_S won't fit comfortably
(11.9 GB weights leaves you under 1 GB for KV at 8K context); you would be running a worse experience than just calling the MiMo API or staying on a smaller model like Qwen 3.6 8B locally.
Verdict matrix
- Get MiMo-V2.5-Pro if your dominant workload is local code agents,
reasoning-heavy tasks, or RAG over long documents and you have a 24 GB+ card already.
- Stay on Qwen 3.6 27B if your workload is multilingual chat or
Arena-style instruction-following, or you can't update llama.cpp / Ollama past b4788 yet.
- Choose Gemma 4 31B if you want the strongest 24 GB-friendly chat
model and don't need the code/agent edge.
- Buy an RTX 5090 (new, $1,999) if you need 64K+ context, BF16 KV
quality, or sustained 60 tok/s. This is the only card that does all of those at once on the consumer side.
- Buy a used RTX 3090 (~$700) if you want a great floor-of-good-enough
rig and you're fine with q4_K_M and 32K context. Best value in 2026.
- Don't buy a 7900 XTX for this — ROCm 6.4 finally works but you give
up ~14% of theoretical tok/s vs a 3090 with the same bandwidth, and the EXL3 / vLLM tooling is still NVIDIA-first.
Bottom line
For most 24 GB owners, the right config is q5_K_M with fp16 KV up to 64K context, switch to int8 KV beyond 64K. Use llama.cpp master (b4789 or later) until your preferred runtime catches up. Expect ~36 tok/s on a 4090, ~28 tok/s on a 3090, and a substantial reasoning / code quality upgrade over Qwen 3.6 27B at the same quant.
For 32 GB 5090 owners, run q6_K with fp16 KV up to 96K context, drop to int8 KV only at 128K. You'll get ~51 tok/s and noticeably better benchmark scores than any 24 GB card can achieve with this model.
For people without a card yet who want to run MiMo locally: a used RTX 3090 at ~$700 is still the best $/perf entry point, exactly as it has been for nearly two years.
Related guides
- Qwen 3.6 27B vs Gemma 4 31B local inference
- Gemma 4 26B-A4B NVFP4 vs Qwen 3.6 27B q4_K_M local
- Used RTX 3090 for local LLM in 2026
Sources
- MiMo-V2.5-Pro Hugging Face model card (huggingface.co), accessed 2026-04-30
- Original LocalLLaMA "actual best open-weights" benchmark thread,
posted 2026-04-28, top-voted comments through 2026-04-30
- llama.cpp PR #11942 (MiMo tokenizer support), merged 2026-04-29
- MMLU-Pro leaderboard via lm-eval-harness 0.4.7
- LiveCodeBench v3 official runner, run 2026-04-30 on our hardware
- TechPowerUp GPU database (techpowerup.com) for memory bandwidth specs
