MTP in llama.cpp is working correctly as of current main — the regression was patched within 48 hours of the LocalLLaMA PSA. If you're on a build pulled in the last week, enable MTP and you'll see 30–60% decode throughput gains at batch=1 on supported models (DeepSeek V3, Qwen 3.6). Quantizing the MTP KV cache to Q8_0 costs no measurable quality at ≤16K context and halves KV memory usage — genuinely a free lunch.
What MTP Is and Why DeepSeek and Qwen Rely on It
Multi-token prediction (MTP) changes the fundamental inference loop. Standard autoregressive generation predicts one token per forward pass. MTP models — introduced by DeepSeek V3 and adopted by Qwen 3.6 — include auxiliary prediction heads that speculate 2–4 tokens ahead during the same forward pass, then verify the speculative tokens in a batched step.
Per llama.cpp release notes, when MTP is active and the speculation is accurate (which it typically is at 70–85% rates for natural-language continuation), each accepted speculative token is essentially free — you paid for one forward pass and got 2–4 tokens. At batch=1 and short context (where decode, not prefill, dominates), this translates to 1.3–1.8× overall throughput.
MTP heads are trained weights embedded in the model checkpoint. Models that don't include them (Llama 3.x, Mistral, Phi-3) simply don't support MTP and the flag is a no-op — you won't get speedup or a crash, just no effect.
Key Takeaways
- MTP regression patched: An intermediate llama.cpp commit broke MTP for several model families; the fix landed within 48 hours. Pull current main before benchmarking.
- Speedup: 30–60% decode throughput improvement at batch=1 on DeepSeek V3 and Qwen 3.6 per cited measurements.
- KV cache quantization is a free lunch: Q8_0 KV quantization halves memory usage with no measurable quality loss at ≤16K context.
- Q4_0 KV: Works at 64K+ context (where KV memory pressure is acute) but introduces a perplexity tax that grows with context length.
- Supported models as of May 2026: DeepSeek V3, Qwen 3.6 27B (dense), Qwen 3.6 35B-A3B (MoE), select community models. Llama/Mistral: no MTP heads.
- Verify MTP is firing:
--verbose+ grep for "mtp" in startup log. If no lines, MTP isn't loading.
What the llama.cpp MTP Regression Actually Broke
Per the LocalLLaMA PSA thread, an intermediate commit in llama.cpp main — landed during active MTP development — disabled or mis-routed the MTP dispatch path for several model families including Qwen 3.6 and some DeepSeek V3 variants. The symptom was decode throughput dropping 20–40% from prior measurements, with no error message — the regression looked exactly like "normal" non-MTP inference to users who didn't have a pre-regression benchmark to compare.
The regression affected users who pulled main between the problematic commit and the fix. Per llama.cpp's pull request log, the fix was identified within hours of the PSA post and merged within 48 hours. Anyone running current main is on the fixed version.
This class of regression is common in active inference projects: MTP dispatch involves conditional kernel selection based on model metadata, and a path-selection bug can silently disable the acceleration without producing any visible error.
Which Commits Fixed It, and When?
Per the cited llama.cpp release notes and PR log, the regression fix was merged in the same release window as several other MTP-related improvements:
- The root-cause fix: corrected MTP head dispatch for GGUFs where the auxiliary head dimensions didn't match the expected shape validator.
- A follow-up: added an explicit log line ("MTP: loaded N heads") so users can verify MTP is active at startup without running a throughput benchmark.
- A third change: extended MTP support to additional quantized formats (IQ3_XS, IQ4_NL) that were previously excluded.
The safe approach: pull current main, run ./llama-cli --model your_model.gguf --verbose 2>&1 | grep -i mtp and confirm you see head-count output before benchmarking.
MTP Speedup: Benchmark Table
Per community measurements cited in the LocalLLaMA threads, on RTX 3090 24 GB with Qwen 3.6 27B Q4_K_M at 8K context (DeepSeek V3 at 671B does not fit on a single 24 GB card):
| Configuration | Decode tok/s (batch=1) | vs no-MTP baseline |
|---|---|---|
| No MTP (pre-regression baseline) | ~28 | 1.00× |
| MTP regression (broken build) | ~18 | 0.64× |
| MTP working (current main) | ~42 | 1.50× |
| MTP + Q8_0 KV cache | ~44 | 1.57× |
At batch=4, MTP's advantage narrows significantly — the speculative execution overhead matters less when the batch itself provides more compute utilization:
| Configuration | Decode tok/s (batch=4) | vs no-MTP |
|---|---|---|
| No MTP | ~88 | 1.00× |
| MTP working | ~105 | 1.19× |
Takeaway: MTP is primarily a batch=1 (single-user interactive) optimization. For server workloads with multiple concurrent users, it's still worth enabling but the headline 1.5× figure won't hold.
Quantizing the MTP KV Cache: Free Lunch or Quality Tax?
The recent r/LocalLLaMA "quantized MTP KV cache = free lunch?" experiment — which ran Qwen 3.6 27B through a battery of standard LLM evals — found:
| KV Cache Quant | VRAM saving vs FP16 | Quality delta (perplexity) | Recommended? |
|---|---|---|---|
| FP16 (default) | — | Reference | Baseline |
| Q8_0 | ~50% | ~0% at ≤16K context | ✅ Always |
| Q4_0 | ~75% | Small but real at 16K+; grows at 32K+ | ✅ Only if VRAM-constrained at 32K+ |
| Q4_1 | ~75% | Slightly better than Q4_0 | ✅ Prefer over Q4_0 if VRAM forces Q4 |
Per r/LocalLLaMA community reporting, Q8_0 KV quantization is now the recommended default for any model that fits in VRAM — it's strictly better than FP16 from a VRAM perspective and indistinguishable from FP16 on quality benchmarks (MMLU, HumanEval, MT-Bench) at up to 16K context.
The memory-saving math at 8K context with Qwen 3.6 27B Q4_K_M:
- FP16 KV cache: ~1.5 GB
- Q8_0 KV cache: ~0.75 GB (saves 750 MB)
- Q4_0 KV cache: ~0.375 GB (saves 1.125 GB)
For 32K context:
- FP16 KV: ~6 GB (pushes total to ~23 GB — barely fits 24 GB card)
- Q8_0 KV: ~3 GB (fits comfortably)
- Q4_0 KV: ~1.5 GB (headroom for Q5_K_S weights)
Enable via --cache-type-k q8_0 --cache-type-v q8_0 in llama.cpp CLI.
Which Models Benefit Most from MTP?
As of May 2026, per llama.cpp release notes:
| Model | MTP Support | Notes |
|---|---|---|
| DeepSeek V3 | ✅ Full | Original MTP implementation; best-validated |
| Qwen 3.6 27B (dense) | ✅ Full | Best-supported Qwen MTP path in current main |
| Qwen 3.6 35B-A3B (MoE) | ✅ Full | Same MTP head architecture, additional MoE routing path |
| Llama 3.x | ❌ None | No MTP heads in checkpoint |
| Mistral / Mixtral | ❌ None | No MTP heads |
| Phi-3 / Phi-4 | ❌ None | No MTP heads |
| Community models (MTP-fine-tuned) | ⚠️ Varies | Check model card for MTP head presence |
For models without MTP heads, the --mtp-n-predict flag (or equivalent) is silently ignored — no speedup, no crash. Worth verifying with --verbose on any new model to avoid mistaking the no-op for a working optimization.
How to Verify MTP Is Actually Firing
Three-step verification:
Step 1: Check the startup log.
Expected output: llm_load_tensors: MTP heads: 1, head dim: 7168 If you see no MTP lines: your model lacks MTP heads or your build predates the fix.
Step 2: Benchmark decode tokens/second. Run with MTP on vs off:
On DeepSeek V3 / Qwen 3.6, expect a 30–60% difference. If the numbers match, MTP isn't loading.
Step 3: Watch GPU utilization. With MTP active, the GPU compute utilization per token generated is slightly higher (more compute per forward pass for the speculative heads) but total inference time drops. If utilization dropped and throughput stayed flat, you're likely running without MTP.
Bottom Line: What to Update, What to Enable, What to Leave Off
Do:
- Pull current llama.cpp main (MTP regression fixed, Q8_0 KV widely tested)
- Enable Q8_0 KV cache quantization on any model — it's free
- Verify MTP is loading before benchmarking
Don't:
- Enable MTP on Llama/Mistral models — no-op that adds confusion
- Use Q4_0 KV at context lengths below 32K — the quality hit is real and there's no VRAM benefit you can't get from Q8_0
- Trust throughput benchmarks from builds older than ~2 weeks — the regression window was within that range
Related Guides
Citations and Sources
- llama.cpp releases — GitHub
- llama.cpp pull requests — GitHub
- r/LocalLLaMA community benchmarks and PSA threads
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
