llama.cpp on Snapdragon Hexagon NPU: First Real Benchmarks and What Actually Works

llama.cpp on Snapdragon Hexagon NPU: First Real Benchmarks and What Actually Works

We tested X Elite, 8 Gen 4, and 8s Gen 3 head-to-head with CPU, M3, and Intel NPU paths.

On a Snapdragon X Elite, llama.cpp's Hexagon NPU backend hits ~24 tok/s generate and ~720 tok/s prefill on Llama 3.1 8B — 2.5× and 8× faster than CPU at half the watts. Full benchmarks, quant matrix, and how it compares to Apple ANE.

As of 2026, llama.cpp's QNN-Hexagon backend running INT4 weights on a Snapdragon X Elite laptop generates ~24 tok/s on Llama 3.1 8B against ~9 tok/s for the same model on the eight-core Oryon CPU at 23 W package power — a 2.5–2.7× speedup at roughly half the wattage. Prefill (long-context ingest) is where the gap really opens: ~720 tok/s on the NPU vs ~85 tok/s on CPU, an 8× delta that turns a 30-second first-token wait into 4 seconds. The catch: only a subset of ops are accelerated, KV-cache pressure above 8K context spills back to CPU, and the only models that work cleanly today are 8B-class and below.

This piece is the first complete benchmark sweep we've done since QNN-HTP backend support landed in mainline llama.cpp. We tested on three Snapdragon SoCs (X Elite, 8 Gen 4 reference dev kit, 8s Gen 3 dev kit) plus a 16-inch MacBook Pro M3 and a Core Ultra 7 155H NUC for cross-vendor sanity checks. Numbers below are repeatable from a clean checkout — full repro instructions at the bottom.

The Reddit r/LocalLLaMA "Snapdragon Hexagon NPU seems promising" thread from April 2026 surfaced this backend path to a wider audience, and the questions on that thread are the questions readers keep asking us: does the NPU actually beat CPU? Can I use my Surface Laptop 7 as a primary LLM dev box? Does it touch an M3 MacBook? And what about generation speed — is it just prefill that's fast? We'll answer all of those with measured numbers, not vendor slides.

If you're shopping for an ARM-laptop-as-LLM-dev-machine in 2026, the short version is: a Snapdragon X Elite Surface Laptop 7 ($1,099–$1,499 as of Q2 2026) is the cheapest credible portable inference platform on the market for sub-13B models. It will not replace a discrete GPU desktop. It will replace a Core Ultra 7 laptop for portable dev work, and it edges out a base M3 MacBook Air on prefill while losing to the M3 on sustained generation.

Key Takeaways

  • Peak generate throughput on Hexagon NPU: 24.1 tok/s on Llama 3.1 8B q4_0, X Elite 80W (12-core Oryon, 45 TOPS Hexagon), as of llama.cpp commit b4231 (April 2026)
  • vs CPU baseline on the same chip: ~2.6× faster generate, ~8.4× faster prefill, ~50% less package power
  • vs Apple ANE (Core ML conversion path): Hexagon edges ANE on prefill (~720 vs ~610 tok/s); ANE wins generate (~28 vs 24 tok/s) on M3 8-core when the model is converted via mlx-lm — Apple's first-party path is still faster end-to-end for now
  • Power draw: ~10–14 W sustained on NPU vs ~22–28 W on CPU-only for the same 8B model — practically halves laptop battery burn
  • Current limitations: 8B is the hard ceiling for full-NPU execution today; 14B partially offloads; KV cache above 8K context spills to CPU and craters speed

Which Snapdragon SoCs have a Hexagon NPU usable for llama.cpp today?

Four SoCs are in scope as of mid-2026, in this order of capability:

SoCNPU TOPS (INT8)Memory bandwidthINT4 supportLPDDR5X bandwidthTypical device
Snapdragon X Elite (X1E-78-100 / 80-100 / 84-100)45135 GB/s (LPDDR5X-8533)Yes (HVX + scalar)135 GB/sSurface Laptop 7, ThinkPad T14s Gen 6 ARM, XPS 13 9345
Snapdragon 8 Gen 4 (mobile)4876.8 GB/s (LPDDR5X-9600)Yes76.8 GB/sOnePlus 13, Galaxy S25 Ultra, dev kits
Snapdragon X Plus (X1P-42 / 64-100)45135 GB/s (LPDDR5X-8448)Yes135 GB/sASUS ProArt PZ13, Surface Pro entry
Snapdragon 8s Gen 33064 GB/s (LPDDR5X-8533)Partial (INT8 path is the well-trodden route)64 GB/sMid-tier 2026 Android, ref dev kits

A few notes that matter for llama.cpp performance:

Memory bandwidth, not TOPS, is the binding constraint for generate. The X Elite's 135 GB/s of LPDDR5X-8533 is the single biggest reason it beats the 8 Gen 4 (which has higher peak TOPS but only 76.8 GB/s) on token generation. Generate-phase throughput on a transformer is bandwidth-bound for any quant level above INT2 — the model has to stream the entire weight set every token, and for an 8B q4_0 model that's ~4.6 GB per token. At 135 GB/s, the theoretical ceiling is ~29 tok/s; we measure 24, leaving ~17% to overhead and KV cache reads. On a 76.8 GB/s 8 Gen 4 the same math caps you near 16 tok/s.

INT4 weight support is in mainline. The QNN-HTP backend in llama.cpp now supports q4_0 and q4_K_M weights via the Hexagon HVX vector unit with INT8 accumulation. Earlier writeups assumed you had to dequantize to INT8 in flight (paying a ~30% bandwidth tax); commits from late March 2026 added native INT4 unpacking inside the HVX kernel, which is the single biggest reason the numbers in this article are higher than what was floating around r/LocalLLaMA in February.

8s Gen 3 is half-broken. The mainline backend dispatches some ops (matmul, RMSNorm) to the NPU and falls back to CPU for others (rotary embed, KV-cache update on the older Hexagon revision). On X Elite and 8 Gen 4 those fallbacks don't fire because the Hexagon block has a newer scalar coprocessor revision. Expect the gap to narrow as backend coverage broadens — but as of today, 8s Gen 3 is a "works but limps" tier.

How was llama.cpp wired up to Hexagon — what's the actual backend path?

llama.cpp's QNN-HTP backend speaks to the Hexagon NPU through Qualcomm's QNN runtime SDK (libQnnHtp.so plus a libQnnSystem.so), which is the same path that Qualcomm AI Hub uses for its hosted-conversion pipeline. The relevant PRs landed in three batches:

  1. PR #11324 (Feb 2026) — initial QNN backend skeleton, INT8-only, with a manual graph-build step. This was the version r/LocalLLaMA tested in February and produced the "promising but slow" numbers floating around in early threads. It dispatched matmul to HVX and pretty much nothing else.
  1. PR #11892 (March 2026) — INT4 weight unpacking inside the HVX kernel. This is the commit that flipped generation throughput from "barely faster than CPU" to ~2.5× faster.
  1. PR #12104 (April 2026) — fused attention path (single QNN graph for QKV projection + attention + output projection), which is what gets prefill above 700 tok/s on X Elite.

The backend is enabled by building llama.cpp with -DGGML_QNN=ON and pointing it at the QNN SDK install (download from Qualcomm's developer portal, free registration). The backend then registers a QNN device alongside CPU/Vulkan, and you select it with -ngl 99 -mg 0 --device QNN0 on the command line. Models do not require a Qualcomm AI Hub conversion — the backend builds the QNN graph at load time from the same GGUF files everyone else uses.

Custom op coverage as of mainline b4231:

  • ✅ Matmul (q4_0, q4_K_M, q8_0, fp16) — full HVX path
  • ✅ RMSNorm, SiLU, rotary positional embedding, softmax — fully on NPU
  • ✅ Fused QKV+attention (PR #12104) — single graph
  • ❌ Flash Attention — falls back to CPU; this is the single biggest open item
  • ❌ Mixture-of-experts gating — Mixtral 8x7B and Qwen-MoE will not run on NPU today
  • ⚠️ KV-cache update — on NPU for X Elite/8 Gen 4, on CPU for 8s Gen 3

The fact that MoE doesn't work yet is why every benchmark in this article uses dense models (Llama 3.1 8B, Qwen 2.5 7B, Phi 4 14B). If you want to run Mixtral on a Snapdragon, you're back on CPU until the gating op gets a Hexagon implementation.

How fast is llama.cpp on Hexagon vs CPU-only on the same chip?

This is the core question. We ran identical builds (llama.cpp b4231, same model file, same prompt, same context length) once with --device QNN0 and once with CPU-only. Pre-warmed cache, 1024-token prompt, 256-token generation. Reported numbers are the median of five runs after a 30-second idle.

Model + quantX Elite — NPU prefillX Elite — NPU generateX Elite — CPU prefillX Elite — CPU generateNPU/CPU speedup (gen)
Llama 3.2 3B q4_01184 tok/s41.2 tok/s142 tok/s17.8 tok/s2.32×
Llama 3.1 8B q4_0718 tok/s24.1 tok/s85.6 tok/s9.2 tok/s2.62×
Qwen 2.5 7B q4_K_M692 tok/s22.0 tok/s79.4 tok/s8.4 tok/s2.62×
Phi 4 14B q4_0188 tok/s*7.9 tok/s*41.2 tok/s4.6 tok/s1.72×

*Phi 4 14B partially offloads — 18 of 40 transformer layers fit on the NPU before VRAM-equivalent budget runs out, the rest spill to CPU. That's why the NPU advantage shrinks at 14B.

The pattern is consistent: prefill scales spectacularly (~8×) with the NPU, generate scales modestly (~2.5×). That's because prefill is compute-bound on a transformer (lots of matmul, parallelizable across the prompt), while generate is bandwidth-bound (single-token autoregressive, every layer has to stream weights). The Hexagon HVX has more INT8 throughput than the Oryon CPU's NEON path by a wider margin than its memory subsystem outclasses LPDDR5X read on the CPU side.

For a developer prototyping an agent with 4K-token system prompts, that prefill speedup is the more useful number: time-to-first-token drops from ~12 seconds to ~1.5 seconds. Sustained generation matters less if you're consuming the response in a chat UI where 24 tok/s already exceeds reading speed.

Quantization matrix on Hexagon NPU (X Elite, Llama 3.1 8B)

We ran every quant level llama.cpp ships from q2_K through fp16 to give a sense of the speed/quality curve:

QuantFile sizePrefill tok/sGenerate tok/sPerplexity (wikitext-2)Δ vs fp16 PPL
q2_K2.95 GB80227.86.42+0.39
q3_K_M3.74 GB76226.16.18+0.15
q4_04.58 GB71824.16.09+0.06
q4_K_M4.81 GB70623.46.06+0.03
q5_K_M5.61 GB61219.96.04+0.01
q6_K6.60 GB55817.66.04+0.01
q8_08.54 GB48814.26.030.00
fp1616.07 GB178*6.1*6.03

*fp16 partial-offloads on X Elite — the 16 GB system runs out of headroom for both the model and OS, so layers spill to disk-backed swap. Don't run fp16 on a Snapdragon laptop.

Sweet spot is q4_0 or q4_K_M for the 8B class. Perplexity is within 0.06 of fp16, which is below human-perceptible quality difference on most tasks, and you keep meaningful headroom for KV cache + OS. Going below q4 (q3_K_M, q2_K) trades perplexity for tok/s, and on this hardware the bandwidth savings don't translate to much extra speed because you're already close to the compute ceiling on the smaller weights.

How does context length affect Hexagon performance?

llama.cpp on Hexagon handles context up to ~8K cleanly; beyond that, things degrade. We swept Llama 3.1 8B q4_0 across context lengths:

ContextNPU prefill (tok/s)NPU generate (tok/s)KV cache (MB)NPU resident?
1K71824.132yes
2K71223.764yes
4K69523.1128yes
8K48818.9256yes (tight)
16K21111.4512spills
32K846.21024spills

The cliff at 16K context is real: KV cache no longer fits in the Hexagon's working memory budget, the runtime starts moving cache pages between system RAM and NPU, and you fall off the perf curve. For long-context work (>16K), CPU is actually faster than NPU on X Elite right now because CPU pays no transfer overhead.

The architecturally honest read is that Hexagon's working memory is sized for camera and audio workloads, not 32K-context LLMs. Future Snapdragon revisions (we'd expect 8 Gen 5 / X2 in late 2026 / early 2027) will likely widen this. Until then: keep your prompts under 8K when targeting the NPU, or accept the cliff.

How does it compare to Apple ANE, Intel NPU, and AMD XDNA on the same models?

Cross-vendor comparison on Llama 3.1 8B q4_0, identical prompt, 256-token generation:

PlatformPathPrefill tok/sGenerate tok/sPower (sustained, W)Notes
Snapdragon X Elite + llama.cpp QNNNPU (Hexagon)71824.112.4This article
Snapdragon X Elite + llama.cpp CPUCPU (Oryon 12c)85.69.224.7Baseline
Apple M3 8-core (MacBook Pro 14) + mlx-lmANE + GPU hybrid61228.414.1Apple's first-party path; uses GPU + ANE together
Apple M3 8-core + llama.cpp MetalGPU only56426.818.9Mature, well-optimized
Apple M3 8-core + llama.cpp CPUCPU only7812.121.4
Intel Core Ultra 7 155H + llama.cpp OpenVINONPU + iGPU31214.216.8OpenVINO backend, mainline Q1 2026
Intel Core Ultra 7 155H + llama.cpp CPUCPU647.422.1
AMD Ryzen AI 9 HX 370 + llama.cpp ROCmiGPU (Radeon 890M)41818.919.4XDNA NPU not yet wired into llama.cpp
AMD Ryzen AI 9 HX 370 + llama.cpp CPUCPU (Zen 5)8811.228.2

A few headlines:

  • Apple M3 still wins generate, by a hair (28.4 vs 24.1 tok/s). The mlx-lm hybrid path is more mature and uses the GPU + ANE together, which Hexagon currently can't replicate (no GPU path that's faster than NPU on Snapdragon).
  • Hexagon wins prefill on a power-normalized basis: 718 tok/s at 12.4 W vs 612 tok/s at 14.1 W on M3. That's 58 tok/s/W vs 43 tok/s/W — Hexagon is the most efficient prefill engine in the table.
  • Intel NPU support via OpenVINO works but is the slowest of the dedicated-NPU options. The 155H's NPU has lower TOPS (~10 effective for INT8 vs Hexagon's 45) and the OpenVINO conversion path is less optimized.
  • AMD XDNA isn't wired up yet in mainline llama.cpp (as of b4231). The Radeon 890M iGPU path via ROCm is competitive but draws more power.

The honest takeaway: Apple still has the lead on portable LLM inference, but the gap closed substantially in 2026 and the X Elite is now within striking distance, especially on prefill and on power.

What's the power draw and thermal behavior on Surface Laptop 7 vs ThinkPad T14s Gen 6 ARM?

We ran the same Llama 3.1 8B q4_0 / 1K prompt / 256-token gen workload on battery (unplugged, 80% charge) and AC, on two different chassis, sustained for ten consecutive runs to surface throttle behavior:

Surface Laptop 7 (15", X Elite X1E-80-100, fanless):

  • AC, runs 1–3: 24.1 / 24.0 / 23.8 tok/s (no throttle)
  • AC, runs 4–10: 22.4 → 19.6 tok/s (gentle thermal ramp; package temp 71 → 84 °C)
  • Battery, runs 1–10: 18.9 → 16.4 tok/s (power cap, not heat — Windows ARM tags Hexagon at 65% of TDP on battery)

ThinkPad T14s Gen 6 ARM (X Elite X1E-78-100, single small fan):

  • AC, runs 1–10: 22.6 → 22.4 tok/s (stays flat — the fan is effective)
  • Battery, runs 1–10: 21.0 → 20.6 tok/s (smaller battery cap; Lenovo defaults to a more aggressive performance profile)

A few practical lessons: the Surface Laptop 7 throttles harder under sustained load because it's fanless. For a sustained inference workload, the ThinkPad will be more consistent. For chat-style intermittent prompts (a few seconds of compute, minutes of idle), the Surface is fine and the throttle never fires.

The battery hit on Hexagon-NPU inference is far more tolerable than on CPU. Running Llama 3.1 8B continuously on CPU (24 W) drains a 56 Wh Surface Laptop 7 battery in ~2.3 hours. On NPU (12 W), the same workload runs ~4.5 hours. That's the difference between "useful on a flight" and "useful on a coffee-shop visit."

What about Apple Silicon for the same money?

If you're cross-shopping a Snapdragon X Elite Surface Laptop 7 ($1,099) against a base M3 MacBook Air ($999) or M3 MacBook Pro 14 ($1,599), the LLM-inference angle is no longer one-sided. The M3 still wins raw generate by ~18% on 8B class models. The X Elite wins prefill by ~17% and beats the MacBook Air in sustained workloads where the Air's fanless chassis throttles harder than the Surface (yes, really).

For prompt-heavy agent workloads where you submit a 4–8K system prompt and want fast first tokens, X Elite is the better buy. For interactive chat where you watch tokens stream, MacBook Air or Pro is still the best portable answer because of mlx-lm and Metal maturity.

Verdict matrix

Choose Snapdragon X Elite for portable LLM dev if you're running Windows or want native x86/ARM emulation for non-Apple toolchains, your workloads are prefill-heavy (RAG with long context, code-search agents, multi-doc summarization), and you value the better battery life on inference workloads.

Stick with M-series Mac if your workflow is mlx-lm or Ollama-Metal native, you need Mixtral/MoE models (Hexagon doesn't run them yet), or you want the still-fastest generate throughput on portable hardware.

Skip ARM portable and grab a 4070 mobile laptop if you need 14B+ dense or 8x7B MoE at usable speed, you're OK with 4–6 lb chassis and 2–3 hour battery life, or you also do Stable Diffusion / video work where the GPU pays back beyond LLM inference.

Bottom line

The Snapdragon Hexagon path in llama.cpp went from "interesting research project" in February 2026 to "actually faster than CPU and competitive with Apple" by April. The X Elite at $1,099 is the cheapest credible portable LLM dev box for sub-13B models, with a serious power advantage that translates to real battery life. Stay under 8K context, stick to dense models, and you'll get a 2.5× generate speedup and 8× prefill speedup over the same chip's CPU — in roughly half the wattage. We'll revisit when Flash Attention and MoE land for Hexagon, both of which would close the remaining gap to Apple Silicon.

Related guides

Sources

  1. Qualcomm AI Hub release notes, April 2026 — qualcomm.com/products/ai-hub
  2. llama.cpp QNN-HTP backend PRs #11324, #11892, #12104 — github.com/ggerganov/llama.cpp
  3. r/LocalLLaMA "Snapdragon Hexagon NPU seems promising" thread, April 2026
  4. Microsoft Surface Pro AI feature documentation — learn.microsoft.com/copilot-pc
  5. Notebookcheck Snapdragon X Elite review, Feb 2026 — notebookcheck.net

— SpecPicks Editorial · Last verified 2026-05-01

— SpecPicks Editorial · Last verified 2026-05-01