Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo

SpecPicks News — summary + source link

By SpecPicks News Desk · Published 2026-05-18 · Last verified 2026-05-20 · 10 min read

The Lemonade SDK team tagged v10.5.1 on May 18, 2026, positioning the release as an MTP + ROCm 7.13 quick-start aimed at AMD Strix Halo users. The pitch is three shell commands: bash lemonade pull Qwen3.6-27B-MTP-GGUF le

In brief — 2026-05-19 · Lemonade v10.5.1 ships a three-command ROCm 7.13 quick-start that gets Qwen3.6-27B running with Multi-Token Prediction speculative decoding on AMD Strix Halo. Community benchmarks measure inference at 2.44× baseline on Strix Halo and 2.17× on an RTX 3090 — a throughput multiplier that meaningfully changes the math for builders weighing an AMD APU against a discrete-GPU rig for local LLM work.

What happened

The Lemonade SDK team tagged v10.5.1 on May 18, 2026, positioning the release as an MTP + ROCm 7.13 quick-start aimed at AMD Strix Halo users. The pitch is three shell commands:

bash

lemonade pull Qwen3.6-27B-MTP-GGUF
lemonade backends install llamacpp:rocm
lemonade load Qwen3.6-27B-MTP-GGUF --llamacpp rocm --ctx-size 0

--ctx-size 0 auto-sizes context to available unified memory, and MTP draft arguments are applied automatically — no manual flag hunting.

The backdrop matters. llama.cpp PR #22673 (commit 4f13cb7) landed Multi-Token Prediction speculative decoding in mainline on May 16, 2026. MTP is baked into Qwen3.6-27B's architecture: the model emits multiple draft tokens per forward pass, and a verification step accepts or rejects them. When acceptance rates are high — community measurements on Qwen3.6-27B put code tasks at 79–89% and factual queries at 62–70% — throughput gains are substantial. A benchmark posted to r/LocalLLaMA measured Qwen3.6-27B at 2.44× baseline on a Framework Desktop (Strix Halo, ROCm 7.0.2) and 2.17× on an RTX 3090 rig, both in single-stream chat at temperature 0, median of 5 runs.

A follow-up post flagged that llama.cpp's MTP implementation iterated rapidly: early benchmarkers who saw disappointing numbers were advised to pull the latest build, with one reporter measuring a 1.5–1.8× token-throughput gain just from updating a few days later. A separate thread tracks PR #23269, containing additional MTP inference improvements beyond the initial landing.

ROCm 7.13, released around the same time, adds Instinct MI350P support and expands compatibility to additional Ryzen AI APU configurations — including broader Windows WSL coverage — widening the hardware base that can run the Lemonade ROCm backend without custom patches.

Why it matters for builders

MTP changes the calculus for anyone sizing up an AMD APU for local inference work. Strix Halo pools CPU and GPU onto a single die with a large unified memory pool — a design that trades raw compute throughput for memory capacity. Until recently, that trade-off meant slower token generation than a discrete high-VRAM GPU at the same price point. A sustained 2× or better throughput multiplier from MTP narrows that gap meaningfully.

The caveat is workload-dependent. Community data puts code generation at 79–89% draft-token acceptance, where MTP delivers its best gains. Tool-heavy agentic flows — structured function calls or constrained JSON — likely land in the 62–70% factual range or lower, because the draft distribution is less predictable. Builders shipping autonomous agents should benchmark their actual pipeline rather than extrapolating from chat numbers.

Apple Silicon is a cautionary counterpoint: one r/LocalLLaMA report on an M2 Max (96 GB) found MTP actually degraded throughput to 9–10 t/s from a baseline near 12 t/s, suggesting unified-memory architecture alone isn't enough — the bandwidth profile matters.

A practical tip surfacing in community discussion: the MTP draft layer carries its own KV cache, which consumes additional VRAM. Quantizing it with -cache-type-k-draft q8_0 and -cache-type-v-draft q8_0 recovers headroom at negligible quality cost. For llama-server users bypassing Lemonade, MTP requires explicit flags (--spec-type draft-mtp --spec-draft-n-max 2) — but those flags cause non-MTP models like Gemma to fail at load time, so launch scripts need to gate them on model type.

Hardware angle

The Strix Halo platform — used in devices like the Framework Desktop AMD and various mini-PCs — is the primary beneficiary. Its large unified memory pool supports the full 27B parameter model without offloading, and the ROCm 7.13 backend makes setup a one-liner rather than a source-compile exercise. AMD's Radeon 9700 AI Pro, also cited in community MTP benchmarks, shows similar gains, suggesting the speedup generalizes across Strix Halo variants rather than being specific to one SKU.

For Strix Halo builds where unified memory bandwidth is the primary bottleneck, fast DDR5 matters. The GSKILL Trident Z5 Royal 32 GB DDR5-6400 illustrates the tier of memory that makes sense in a system where the GPU draws from the same pool — bandwidth at this frequency directly affects inference throughput in APU configurations.

Discrete GPU users are not locked out: the 2.17× result on an RTX 3090 shows MTP delivers gains on CUDA hardware as well, and Lemonade's backend abstraction (llamacpp:rocm vs llamacpp:cuda) means the same three-command workflow applies once the appropriate backend is installed.

What other coverage is saying

Phoronix's reporting frames ROCm 7.13 as part of AMD's push to extend ROCm beyond datacenter Instinct parts to consumer and prosumer APUs, with Instinct MI350P support and expanded Ryzen AI APU coverage on both Linux and Windows WSL. A separate Phoronix item notes AMD is actively broadening WSL ROCm compatibility, which matters for Windows hobbyists who want to run llama.cpp backends without a full Linux install. The r/LocalLLaMA community has been running parallel threads dissecting MTP's performance profile — one tracking PR #23269 with incremental gains from keeping builds current, another documenting that the speedup does not translate to Apple Silicon. The consensus emerging is that MTP is a genuine throughput win for Qwen3.6-27B specifically, with diminishing returns on models not trained with the MTP head, and that the ROCm path is now stable enough for daily use on Strix Halo.