llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

SpecPicks News — summary + source link

By SpecPicks News Desk · Published 2026-05-18 · Last verified 2026-05-20 · 10 min read

In brief — 2026-05-19 · PR #22673 landed Multi-Token Prediction (MTP) speculative decoding in mainline llama.cpp on May 16, delivering verified throughput gains of up to 2.44× on Qwen3.6 27B without touching model quality. The speedup is hardware- and task-dependent — AMD Strix Halo and NVIDIA RTX 3090 users both report meaningful lifts, but Apple Silicon users and anyone running tool-heavy agentic pipelines should benchmark their specific workload before expecting headline numbers.

What happened

Multi-Token Prediction speculative decoding arrived in the mainline llama.cpp repository on May 16, 2026, via PR #22673 (commit 4f13cb7). MTP is a capability baked into Qwen3.5 and Qwen3.6 model weights — the models are trained to predict multiple future tokens simultaneously, and llama.cpp can now exploit those predictions as draft candidates for speculative decoding, accelerating inference without any accuracy trade-off in greedy or near-greedy decoding.

Community benchmarks measured Qwen3.6 27B in single-stream chat mode at temperature 0, reporting median results across five runs on two distinct machines. On a Framework Desktop running Strix Halo silicon under ROCm 7.0.2, Q4_K_M quantization climbed from 11.7 tok/s to 21.2 tok/s (1.81×), while the heavier Q8_0 quantization jumped from 7.4 tok/s to 18.1 tok/s — a 2.44× uplift. A single RTX 3090 running at 450 W under CUDA 12.9 and driver 590.26 produced a 2.17× peak gain, with Q4_K_M rising from 38.7 tok/s to 59.5 tok/s.

One important caveat emerged almost immediately after launch: users who benchmarked very early builds saw poor or no gains. A widely upvoted r/LocalLLaMA post warned the community that the initial implementation had a prompt-processing regression that was patched within days; those who updated to the fixed build saw 1.5–1.8× improvements that had been invisible on the launch-day binary. If MTP looks like a non-event on a given system, pulling the latest commit is the correct first step before drawing conclusions.

Enabling MTP in llama-server requires two startup flags: --spec-type draft-mtp and --spec-draft-n-max 2. An optimization noted in community threads is quantizing the MTP layer's own KV cache — the draft head consumes additional VRAM, but passing -cache-type-k-draft q8_0 -cache-type-v-draft q8_0 recovers memory without measurable quality loss. A follow-on PR (#23269) has since landed further MTP refinements, continuing a rapid iteration cadence in the days after the initial merge.

Why it matters for builders

MTP speculative decoding is among the most significant free throughput improvements to land in llama.cpp in recent memory — it requires no hardware upgrade, no model swap, and no quality compromise, provided the right conditions are met.

For anyone running Qwen3.6 27B as a local assistant, coding co-pilot, or summarization engine, a 1.8–2.4× throughput gain translates directly to a more responsive experience. At 18 tok/s on a Q8_0 model running entirely on unified Strix Halo memory — no discrete GPU required — the gap between local and cloud-hosted inference narrows in a meaningful way.

The gains are not universal, and the nuance matters for purchasing decisions. Community benchmarks show draft token acceptance rates vary sharply by task: code generation lands in the 79–89% range, where MTP shines, while factual and structured-output tasks sit at 62–70%. Tool-calling pipelines — structured JSON, constrained formats — likely fall at or below the factual range, meaning agentic frameworks that rely heavily on function calls may see smaller improvements or, in some configurations, a net slowdown from draft overhead. Builders running multi-step agentic workflows should benchmark their specific task mix before treating headline speedups as guaranteed.

Apple Silicon users are reporting a different picture. M2 Max 96 GB configurations running Qwen3.6 27B found MTP throughput below their baseline — reports of 9–10 tok/s under MTP versus roughly 12 tok/s without. The root cause has not been confirmed in community discussion, but the behavior is consistent with speculative decoding overhead dominating on backends where draft verification is relatively expensive or where the baseline is already compute-bound.

Hardware angle

The Strix Halo results — nearly 2.5× on Q8_0 — are the headline finding for AMD's integrated-graphics APU lineup. AMD ROCm 7.13, released within the same week as this MTP landing, explicitly expands support for Ryzen AI APUs on both Linux and Windows WSL. The two updates compound: builders who bought a high-RAM Strix Halo machine as a self-contained local LLM host now have concrete software evidence those machines continue to improve. Lemonade v10.5.1 has packaged MTP together with ROCm 7.13 into a one-command quick-start for Strix Halo and Radeon 9700 AI Pro hardware — MTP arguments are applied automatically without manual flag configuration.

For NVIDIA users, the RTX 3090's 2.17× peak figure is particularly relevant because the 3090 remains one of the most common 24 GB GPUs in the local LLM community. Updating llama.cpp is a zero-cost upgrade for this install base.

Desktop builders pairing a capable CPU with a GPU-based local LLM setup may find the AMD Ryzen 9 7900X 12-Core, 24-Thread a strong match for host-side orchestration workloads — its 24-thread throughput handles agentic preprocessing and multi-model server management without becoming the bottleneck in GPU-accelerated inference pipelines.

What other coverage is saying

A dedicated Lemonade v10.5.1 thread on r/LocalLLaMA confirmed 2× gains on Strix Halo hardware with ROCm 7.13, positioning the packaged release as the lowest-friction path to MTP for AMD APU users who prefer not to build llama.cpp from source. A separate llama-server compatibility thread clarified that models without MTP layers — Gemma and most non-Qwen3 families — will error out if MTP flags are passed at startup, which means operators running mixed model environments need separate server invocations per model type. Community analysis of why MTP can turn net-negative for agentic flows surfaced the acceptance-rate breakdown by task category — code and prose outperform factual retrieval and structured output by a meaningful margin. Phoronix's contemporaneous reporting on AMD ROCm 7.13 adds context: the same release that widens Strix Halo compatibility also brings Instinct MI350P support, signaling AMD's continued push on the ROCm stack that local LLM inference increasingly depends on.