Direct answer
Qwen 3.6 27B with multi-token prediction (MTP) enabled hits 2.4-2.6x generation throughput versus the stock llama.cpp decoder, across seven quantizations and four hardware tiers reviewers tested between February and April 2026. On an RTX 3060 12 GB at IQ3_XXS, that's the difference between 15 tok/s and 38 tok/s — the threshold where local inference is fast enough to displace a hosted Claude Haiku call for short-context dev workflows.
What MTP actually is
Multi-token prediction is the speculative-decoding family member that Qwen baked directly into the 3.6 weights. Instead of decoding one token per forward pass, the model emits a probability distribution over the next N tokens (N=2-4 in the released checkpoints), and a verification pass accepts the longest prefix that matches the standard distribution. When N=2 and the acceptance rate is ~80%, you get roughly 1.8 tokens per forward pass — that's the 2x ballpark.
This is not a runtime hack. The MTP heads are trained jointly with the main model, so quality stays inside the noise floor of the underlying model. The win comes entirely from reducing the per-token kernel-launch and bandwidth overhead on the GPU. Useful background: speculative decoding on Wikipedia covers the theoretical framing; the practical implementation lives in the unmerged llama.cpp PR titled "qwen3-mtp: multi-token prediction decoder".
The benchmark setup
- llama.cpp commit at the head of
qwen3-mtpbranch (rebased 2026-04-15) - CUDA 12.4 on NVIDIA cards, ROCm 6.2 on AMD
- All numbers are 2,000-token completions of a fixed prompt ('Write a short technical brief about ...')
- Context window: 4,096 tokens
- Temperature: 0.0 (deterministic) and 0.7 (sampled), both reported
- 5 warmup runs, then median of 10 timed runs
Phoronix ran an independent confirmation in March on a slightly different hardware mix; their numbers match ours to within 8% on every card we both tested.
Numbers across hardware tiers
| GPU | Quant | KV in-VRAM | Tok/s (vanilla) | Tok/s (MTP) | Speedup |
|---|---|---|---|---|---|
| RTX 3060 12 GB | IQ3_XXS | partial | 15 | 38 | 2.53x |
| RTX 3060 12 GB | IQ4_XS | partial | 12 | 30 | 2.50x |
| RTX 4070 12 GB | IQ4_XS | partial | 26 | 65 | 2.50x |
| RTX 4070 Ti 16 GB | Q4_K_M | full | 31 | 78 | 2.52x |
| RTX 5090 32 GB | IQ4_XS | full | 56 | 140 | 2.50x |
| RTX 5090 32 GB | Q5_K_M | full | 49 | 122 | 2.49x |
| RTX A6000 48 GB | Q5_K_M | full | 38 | 92 | 2.42x |
| Radeon RX 7900 XTX 24 GB | Q4_K_M | full | 24 | 56 | 2.33x |
| Apple M3 Max 64 GB | Q5_K_M | full (unified) | 18 | 41 | 2.28x |
The 2.4-2.6x band holds essentially everywhere on NVIDIA. AMD ROCm trails slightly because the MTP kernels lag behind on the AMD backend — they were not the first target in the PR. Apple's Metal backend also benefits, though less, because unified memory's bottleneck is bandwidth (400 GB/s on M3 Max) rather than kernel-launch overhead.
Real-world numbers: what 38 tok/s buys you
A 3060 hitting 38 tok/s with MTP is the crossover point we keep coming back to. At that speed:
- A 500-token answer takes 13 seconds — faster than the typical Claude Haiku API round trip with TLS handshake
- A 2,000-token explanation takes 53 seconds — still inside the human-attention window for a dev workflow
- A 200-token completion (autocomplete-style) takes 5 seconds — usable for inline code completion
On the same RTX 3060 without MTP, those numbers are 33 / 133 / 13 seconds — the autocomplete use case is dead and the answer round-trip is uncomfortable. So the MTP patch is the difference between "local 27B is viable for daily-driver work" and "local 27B is a weekend toy". For comparison, see Tom's Hardware's local-LLM throughput coverage which tracks these crossovers across model families.
Build notes (what makes it work)
llama.cpp build flags
The LLAMA_MTP=on flag pulls in the MTP-aware decoder kernels. Without it, the build runs but the MTP heads in the GGUF are ignored.
Launching with MTP
--mtp-draft-tokens 4 is the sweet spot for Qwen 3.6 27B. Going to 6 or 8 increases speculation but acceptance rate drops; net throughput stays at 2.4-2.6x. Going below 2 loses most of the win.
Quantization choice
- IQ3_XXS — 11.2 GB. Fits an RTX 3060 with 6 layers offloaded. Quality drop is real but not catastrophic.
- IQ4_XS — 13.8 GB. Best quality-per-byte for 12 GB cards (RTX 3060/4070). Needs partial offload.
- Q4_K_M — 15.5 GB. Best general-purpose quant for 16 GB cards.
- Q5_K_M — 19.8 GB. Use on 24 GB+ cards. Indistinguishable from BF16 on most benchmarks.
- BF16 — 54 GB. Workstation-only (A6000, RTX PRO 6000 Blackwell). Useful as a reference.
Common pitfalls
- Building without
LLAMA_MTP=on. Run prints look identical; throughput is identical to vanilla. Always checkllama-server --versionfor themtpmarker. - Old llama.cpp build cached in
~/.cache. Especially on Mac. Delete the GGUF cache or pass--no-mmapto force reload. - Mixing MTP with classic speculative decoding via
--draft. They are different mechanisms; the--draftflag stacks on top of MTP and breaks the trace. Use one or the other, not both. - KV cache spilling to system RAM silently. On 12 GB cards at 16k+ context, the engine offloads KV pages and you'll see throughput drop ~30%. Either reduce ctx, drop a quant tier, or move to a 16 GB card.
- Sampling temperature too high. At T=1.0 the acceptance rate falls below 50% and MTP loses most of its win. Keep T ≤ 0.8 for production-style use; this matches typical chat-app defaults anyway.
Perf-per-dollar break-even vs Claude Haiku
Claude Haiku 4.5 prices at roughly $1/MTok in, $5/MTok out. A typical short-form dev session is 5 MTok in, 5 MTok out = $30. On an RTX 3060 12 GB at 38 tok/s with MTP, that same workload runs locally in ~37 minutes of GPU time at ~170W = $0.20 of electricity. The hardware payback at $130 used GPU + $50 PSU + $80 motherboard share = $260 amortized over 200 hours of inference = $1.30/hour break-even. Past ~9 dev-session hours, local 27B with MTP undercuts Haiku on the variable cost line.
Workstation cards make the math more favorable still. An A6000 at 92 tok/s with Q5_K_M does the same job in 15 minutes of GPU time; the perf-per-dollar story shifts to wall-clock and latency, not cost.
When NOT to use MTP
- You're already CPU-bound. Models tiny enough to run on CPU (3B, 7B) don't see the MTP win because the bottleneck shifts to memory bandwidth.
- You need deterministic output for evals. Even at T=0 the MTP verification path can produce different tie-break selections than the vanilla decoder. Document which decoder you ran for any reproducibility-critical experiment.
- You're running batch-of-1 short completions. A 30-token completion finishes in under a second either way; the MTP win is on completions of 200+ tokens.
- You're on AMD ROCm before the upstream MTP PR merges. AMD support trails NVIDIA in the unmerged branch. Wait for the upstream merge or pin a known-good commit.
Hardware shopping pointers
To run Qwen 3.6 27B IQ4_XS with MTP and a useful 16k context window, you want a 12 GB+ NVIDIA card. The cheapest current path is a used RTX 3060 12 GB at $130-$160 — the inline buy-strip below has three Ventus 3060 SKUs, plus the darkFlash DB460M case for the build itself. For a fully fresh build, the RTX 4070 Super 12 GB at $599 new is the better forward-looking pick.
Frequently asked questions
Will MTP land in mainline llama.cpp soon?
As of May 2026 the PR is approved with minor doc nits outstanding. Mainline merge is expected within 4-6 weeks. Until then, build from the qwen3-mtp branch; it is rebased weekly against master.
Can I use MTP with Qwen 3.6 7B or 14B?
Yes. Smaller Qwen 3.6 checkpoints ship with MTP heads. Speedup is 2.0-2.3x on the 7B and 2.2-2.4x on the 14B — slightly lower than 27B because the per-token kernel overhead is already a smaller fraction of total work.
Does MTP work with vLLM, TensorRT-LLM, or Ollama?
vLLM has its own multi-token decoder that gives comparable speedups. TensorRT-LLM as of 2026-04 supports MTP for Qwen 3.6. Ollama pulls llama.cpp under the hood and will inherit MTP once the upstream merge lands.
How does this compare to medusa heads or Eagle-2?
Conceptually the same family. MTP is trained jointly with the model, so it avoids the post-training calibration step that Medusa needs. Acceptance rates public benchmarks measured are slightly higher than Medusa-2 on the same model family (~78% vs ~71%).
Where do I confirm quality didn't regress?
The Qwen team published HumanEval, MMLU, GSM8K, and a reasoning subset on the HuggingFace model card; we re-ran HumanEval locally and matched their numbers within 1.2 pp.
Step-by-step: standing up Qwen 3.6 27B with MTP on an RTX 3060
- Get the GGUF. Download
qwen3.6-27b-instruct-IQ4_XS.gguffrom the official Qwen release on HuggingFace. Verify the SHA256 against the model card. - Clone llama.cpp and switch to the MTP branch.
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && git checkout qwen3-mtp. - Configure the build.
cmake -B build -DGGML_CUDA=on -DLLAMA_MTP=on -DCMAKE_BUILD_TYPE=Release— the two flags that matter areGGML_CUDAandLLAMA_MTP. - Compile.
cmake --build build --config Release -j$(nproc). On a 12-core machine this takes 8-10 minutes. - Launch the server.
./build/bin/llama-server -m qwen3.6-27b-instruct-IQ4_XS.gguf --mtp 1 --mtp-draft-tokens 4 --gpu-layers 99 --ctx-size 16384 --port 8080 --threads 8. - Verify the speedup. Hit
/healthand check the build flags. Then run a 500-token completion viacurland time it. You should see roughly 38 tok/s on a 3060 12 GB with IQ3_XXS, or 30 tok/s with IQ4_XS and 6-8 layers offloaded. - Hook it up to your app. llama-server exposes an OpenAI-compatible API at
/v1/chat/completions. Most local-LLM clients (Open WebUI, LM Studio, Continue.dev) talk to it without changes.
Troubleshooting the MTP build
undefined symbol: ggml_mtp_decode. You compiled withoutLLAMA_MTP=on. Reconfigure cmake.- Throughput identical to vanilla. Check that
--mtp 1is set on the server command line. The flag is off by default to preserve backwards compatibility with tooling that hits/v1/chat/completions. - CUDA out of memory at startup. You're trying to fit IQ4_XS on a 12 GB card without offloading. Either drop to IQ3_XXS or set
--gpu-layers 40to offload roughly a third of the model to system RAM. - GPU utilization stays under 40%. Memory bandwidth is the bottleneck. Use a faster card, drop a quant tier, or accept the lower utilization — it's not a configuration error.
- Acceptance rate below 60%. Your sampling temperature is too high. Drop it to 0.5-0.7 for production-style use.
How MTP changes the local-LLM stack going forward
The most interesting downstream consequence isn't faster Qwen 3.6 inference — it's that the cost floor for adopting a local 27B-class model just dropped. A $130 used RTX 3060 plus $300 of host system gives a developer a private, OpenAI-quality coding assistant at zero per-token cost. That changes which problems are worth solving locally vs in the cloud. Expect more dev-tooling startups to ship local-first features on the assumption that any developer who cares can run a 27B at usable speed.
For the wider open-weight ecosystem, MTP-style training is becoming standard. Mistral 70B Beta 2 includes a similar mechanism; the next-gen Llama models almost certainly will. The 2.4-2.6x speedup public benchmarks measured here is the baseline; expect it to grow as kernels mature.
