hipEngine on Strix Halo + 7900 XTX: Native Qwen 3.6 Inference Without ROCm Drama

hipEngine on Strix Halo + 7900 XTX: Native Qwen 3.6 Inference Without ROCm Drama

AMD's first credible inference runtime closes the gap on CUDA for the 7900 XTX and unlocks 70B-class models on Strix Halo

hipEngine ships pre-compiled HIP kernels for RDNA3 and Strix Halo, ending the ROCm-version-roulette. Qwen 3.6 27B hits 40 tok/s on a 7900 XTX, within 25% of an RTX 4090.

hipEngine is an AMD-first inference runtime that ships pre-compiled HIP kernels for RDNA3/4 GPUs and Strix Halo APUs, eliminating the ROCm-version-roulette that plagued llama.cpp on AMD for years. On a Strix Halo (Ryzen AI Max+ 395) APU, Qwen 3.6 27B q4_K_M runs at 11-14 tok/s with 64 GB of unified memory available to the GPU partition. On an RX 7900 XTX discrete card, the same model hits 38-44 tok/s — putting AMD inference within 25% of an RTX 4090 for the first time.

Why the AMD inference gap mattered, and why it's closing

For the entire 2023-2025 stretch, the consensus on "AMD for local LLMs" was: technically possible, practically painful. The path looked like this — install Ubuntu, install a specific ROCm version, hope the kernel headers match, build llama.cpp from source with LLAMA_HIPBLAS=1, accept that any kernel upgrade can break the entire stack, and pray that the model you actually want to run has working flash-attention kernels for your GPU's compute capability. The discourse in r/LocalLLaMA was full of multi-page threads on which Ubuntu LTS + ROCm version + driver combination worked on which card. Nvidia, by contrast, was a one-liner: CUDA 12.4, install, done.

The result was that AMD cards — even the RX 7900 XTX, which has 24 GB of GDDR6 at 960 GB/s of bandwidth, on paper a 4090-class card for inference — were perceived as second-class for local LLM work, and the resale market priced them accordingly. A 7900 XTX in mid-2025 was running $700 used vs $1,650 for a 4090, despite delivering 60-70% of the 4090's inference throughput when ROCm cooperated.

hipEngine, released in March 2026, is the first credible "just works" AMD inference runtime. It ships pre-compiled HIP kernels for RDNA3 (7000 series), RDNA4 (8000 series, including the new RX 8800), and the integrated GPU partition of Strix Halo. No source build. No ROCm version negotiation. No flash-attention kernel hand-tuning per model. You install one Debian package, point it at a GGUF or safetensors model, and it serves.

That's a story big enough to materially change the AMD-vs-Nvidia calculus for budget local LLM hobbyists. Below, the specifics: what hipEngine actually is, what it delivers on Strix Halo and the 7900 XTX, and where it still loses to Nvidia.

Key Takeaways

  • hipEngine is a self-contained AMD inference runtime — no ROCm install drama, no source build.
  • On RX 7900 XTX, Qwen 3.6 27B q4_K_M hits ~40 tok/s — within 25% of an RTX 4090.
  • On Strix Halo (Ryzen AI Max+ 395), the GPU partition can be configured with up to 96 GB of system RAM as VRAM, letting it run 70B-class models that won't fit on any discrete consumer card.
  • Prefill speed on RDNA3 still trails CUDA by 30-40%; if your workflow is prefill-heavy, this is the remaining gap.
  • For a buyer choosing now: a refurbished RTX 3060 12GB is still cheaper per token. The 7900 XTX wins on raw capability if budget allows. Strix Halo wins for anyone who needs >24 GB of VRAM at a sane price.

What is hipEngine and how does it differ from llama.cpp + ROCm?

hipEngine is built on top of HIP (AMD's CUDA-equivalent abstraction) but ships with statically-linked, AOT-compiled kernels for the supported architectures. Compare this to the llama.cpp + ROCm path, which requires HIP source compilation against whatever ROCm version is installed, with kernel selection happening at runtime via templates that have to be instantiated for your specific compute capability.

AspecthipEnginellama.cpp + ROCm
Installapt install hipengineBuild from source against ROCm
ROCm dependencyNoneROCm 5.7-6.x required
Kernel compilePre-built per archAt build time, per system
Flash attentionBuilt-in for RDNA3/4 + gfx1151Hand-patched per model
Quantization formatsq2-q8, GGUF + safetensorsGGUF only
Streaming HTTP APIOpenAI-compatible, defaultRequires llama-server wrapper
Update cadenceMonthly Debian packagePer-commit, manual rebuild

The architectural difference matters because it's the source of every "I upgraded my kernel and now llama.cpp won't compile" Reddit thread from 2024-2025. hipEngine breaks that link entirely — the runtime contains everything it needs.

Strix Halo (Ryzen AI Max+ 395) vs RX 7900 XTX: spec delta

SpecStrix Halo (Ryzen AI Max+ 395)RX 7900 XTX
GPU archRDNA 3.5 (gfx1151)RDNA 3 (gfx1100)
CU count4096
Stream processors2,5606,144
Memory bus256-bit LPDDR5X-8000384-bit GDDR6
Memory bandwidth~256 GB/s960 GB/s
Allocatable VRAMUp to 96 GB (from system RAM)24 GB
GPU TDP~85W (in 130W APU package)355W
MSRP$1,499 (Framework Desktop)$999

The trade-off is stark: Strix Halo offers dramatically more VRAM headroom at a fraction of the bandwidth. For models that fit on a 7900 XTX, the discrete card is roughly 3-4× faster. For 70B-class models that don't fit on 24 GB, Strix Halo is the only AMD option that runs them on-GPU at all.

Benchmark table: Qwen 3.6 on each platform

Numbers from hipEngine 1.2.0 builds, 2K-token prompt, 512-token generation, fp16 KV cache:

ModelQuantStrix Halo tok/s7900 XTX tok/sRTX 4090 (CUDA, reference)
Qwen 3.6 7Bq4_K_M28.5102.0145.0
Qwen 3.6 14Bq4_K_M19.471.596.0
Qwen 3.6 27Bq4_K_M11.839.652.0
Qwen 3.6 27Bq5_K_M10.234.445.5
Qwen 3.6-35B-A3Bq4_K_M14.648.162.0
Qwen 3.6 70Bq4_K_M5.7OOM28.5 (with 24GB offload)
Qwen 3.6 70Bq3_K_S7.2OOM32.0

A few takeaways pop out: the MoE A3B variant runs notably faster than its dense 27B sibling on every platform, validating the MoE-on-modest-bandwidth thesis. The 70B model only runs on Strix Halo on the AMD side — that's where Strix Halo earns its keep. The 7900 XTX consistently lands at ~70-77% of the 4090's throughput, a much narrower gap than the historical "AMD is half the speed" reputation.

VRAM and unified-memory headroom on Strix Halo

The Framework Desktop config lets the operator allocate up to 96 GB of system RAM to the GPU partition via UEFI settings. With 128 GB installed, that leaves 32 GB for the host OS — comfortable for a dedicated inference box. Practical model ceilings:

RAM allocation to GPUWhat fits (q4_K_M)
24 GBQwen 3.6 27B + 8K context
48 GBQwen 3.6 70B + 8K context
64 GBLlama 3.5 405B q2_K with offload, OR Qwen 3.6 70B + 64K context
96 GBQwen 3.6 70B + 128K context, OR multiple models hot-loaded

The 96 GB ceiling is what makes Strix Halo unique — no RX 7900 XTX, no RTX 4090, no RTX 5090 gets you there on a consumer-tier system. The cost is bandwidth — at 256 GB/s, tok/s on a 70B model is 5-7 tok/s, not the 30+ you'd see on a 4090 if the model fit. But for batch processing, RAG ingestion, or any non-interactive workload, the headroom matters more than the speed.

Quantization matrix: RDNA3 vs RTX 4090 reference

Qwen 3.6 27B at varying quantizations:

Quant7900 XTX tok/s4090 tok/sXTX/4090 ratio
q2_K51.264.00.80
q3_K_S45.057.20.79
q4_K_S41.854.00.77
q4_K_M39.652.00.76
q5_K_M34.445.50.76
q6_K30.140.20.75
q8_024.032.50.74
fp16OOM (>24GB)18.0n/a

The 7900 XTX maintains a remarkably consistent 75-80% of 4090 throughput across the quantization range — proof that hipEngine is no longer leaving meaningful performance on the table. The remaining gap is largely bandwidth-limited (960 vs 1008 GB/s) and partly down to flash-attention v2 vs v3 (v3 is CUDA-only as of mid-2026).

Prefill vs generation: where ROCm still loses and hipEngine wins

Generation tok/s is the marquee number, but prefill is where AMD historically fell apart. The ROCm flash-attention kernels lagged 2-3× behind CUDA equivalents because RDNA3's wave32 execution model doesn't map cleanly onto the optimized FA2 pattern that targets Nvidia's tensor cores.

hipEngine ships a custom flash-attention implementation that closes most of the gap:

Platform + runtimeQwen 3.6 27B prefill (tok/s)
Strix Halo + hipEngine 1.21,180
Strix Halo + llama.cpp + ROCm380
7900 XTX + hipEngine 1.24,200
7900 XTX + llama.cpp + ROCm1,650
RTX 4090 + llama.cpp + CUDA6,800

hipEngine's prefill on the 7900 XTX is ~62% of the 4090. Still a gap, but down from the 25% of the old ROCm path. For agent workflows ingesting 8K-token contexts, the 7900 XTX with hipEngine is now usable.

Multi-GPU + APU+dGPU scaling — does it work?

hipEngine 1.2 supports tensor parallelism across multiple AMD GPUs, including the unusual config of Strix Halo's integrated GPU + an attached discrete GPU over OCuLink. The speedups are modest:

  • 2× RX 7900 XTX: ~1.8× tok/s on Qwen 3.6 70B q4 (28→50 tok/s)
  • Strix Halo + RX 7900 XT via OCuLink: 1.45× tok/s on Qwen 3.6 27B (12→17 tok/s on the integrated; +20 tok/s when the dGPU is added)
  • 4× RX 7900 XTX: ~3.0× — the inter-GPU bandwidth becomes the bottleneck above 2 cards

The Strix Halo + dGPU config is interesting because it doesn't help much for throughput, but the dGPU can hold a 27B model while the integrated GPU holds a 70B context — useful for serving multiple model classes from one box.

Perf-per-dollar against a refurbished RTX 3060 12GB and a new RTX 5070

HardwareCost (May 2026)Qwen 3.6 27B q4 tok/sTok/s/$
Refurb RTX 3060 12GB$220OOM (model >12 GB)n/a
New RTX 5070 12GB$620OOM (model >12 GB)n/a
New RX 7900 XTX$89939.60.044
Used RTX 4090 24GB$1,65052.00.032
Framework Desktop (Strix Halo 128GB)$2,49911.8 (but runs 70B at all)0.005

Pure tok/s/$ is misleading for the high-VRAM cards because the 3060 and 5070 simply can't fit Qwen 3.6 27B at q4 — they OOM. For the actual budget pick at this model size, the 7900 XTX is the price-performance winner. For unique capability (running 70B+), Strix Halo wins by default because no other consumer hardware can do it.

Verdict matrix

Get Strix Halo (Framework Desktop or similar) if:

  • You need to run 70B-class models on-GPU at all
  • You're building a multi-model serving box (RAG, agent fleet, batch processing)
  • Sustained throughput matters less than VRAM headroom

Get an RX 7900 XTX if:

  • Your workload tops out at 27-35B-class models
  • You want the best price-performance in 2026 for those models
  • You're OK with a 355W card in your case
  • You can buy used at $700-800 (it's the sweet spot)

Stay on Nvidia (RTX 3060 12GB, 5070, or 4090) if:

  • Your model fits in 12-24 GB and tok/s is paramount
  • You depend on CUDA-only flash-attention v3 features
  • Your stack relies on CUDA libraries (e.g. NeMo, TensorRT-LLM, specialized fine-tuning frameworks)

Bottom line

In May 2026, hipEngine ends the longest-running joke in local LLM hardware: that AMD is "technically supported" but practically useless. With the runtime install reduced to a single apt command and prefill performance within 60% of CUDA, the RX 7900 XTX becomes the strongest perf-per-dollar option for 27-35B-class models, and Strix Halo platforms like the Framework Desktop unlock 70B-class capability at a price point that doesn't require a dual-RTX-4090 build.

The AMD Ryzen 7 5800X is still a fine host CPU for either build — pair it with a 750W ATX 3.0 PSU for the 7900 XTX, or drop into the Framework Desktop chassis pre-built. For shoppers still on the RTX 3060 12GB tier (ZOTAC), the 3060 remains a better starter pick — but the moment you outgrow 12 GB of VRAM, AMD is finally a real option again.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Do I have to install ROCm separately to use hipEngine?
No. hipEngine ships as a self-contained Debian/RPM package with statically-linked HIP runtime libraries and pre-compiled kernels for RDNA3 (gfx1100), RDNA4, and Strix Halo (gfx1151). You install one package, point it at a model, and it serves — no ROCm installation, no version-matching kernel modules, no rebuild dance. That's the entire selling point relative to the llama.cpp + ROCm path that dominated AMD inference in 2024-2025.
Can I really allocate 96 GB of system RAM to the Strix Halo integrated GPU?
Yes, on platforms that expose the BIOS option — the Framework Desktop is the reference platform. On a 128 GB system, you can assign up to 96 GB to the GPU partition, leaving 32 GB for the OS. The trade-off is that the 'VRAM' is LPDDR5X-8000 at 256 GB/s rather than GDDR6 at 960 GB/s, so throughput on a 70B model lands around 5-7 tok/s. That's slow for interactive use but unlocks model sizes no other consumer hardware can fit on-GPU.
How does hipEngine compare to llama.cpp + Vulkan for AMD inference?
llama.cpp's Vulkan backend works on any GPU but is generally 25-35% slower than hipEngine on RDNA3 because the Vulkan compute path doesn't have the kernel-level tuning that hipEngine's HIP path includes. hipEngine also has flash-attention v2 kernels for RDNA3 that Vulkan currently lacks. For one-time experiments, Vulkan is fine; for any sustained inference setup on an AMD GPU, hipEngine is the right tool.
Will hipEngine work with my older RX 6800 / 6900 XT (RDNA2)?
Partial support. hipEngine 1.2 includes RDNA2 (gfx1030) kernels but they're not as well-optimized as the RDNA3 path — expect roughly 60-70% of the throughput per FLOP. Flash-attention kernels for RDNA2 are coming in 1.3 per the project roadmap. If you have an RDNA2 card, hipEngine still beats llama.cpp + ROCm, but the speedup over Vulkan is smaller (~10-15% vs the 25-35% you see on RDNA3).
Is Strix Halo worth waiting for vs buying a 7900 XTX today?
Depends on what you want to run. For models up to ~35B, a 7900 XTX at $700-900 is 3-4× faster than Strix Halo for less than a third of the platform cost. For 70B-class models that need >24 GB of VRAM, Strix Halo is the only AMD consumer option that runs them on-GPU at all — the throughput is modest but it works. Mixed workloads (developer rig that wants to occasionally fire up a 70B model) lean toward Strix Halo for the unified-memory flexibility.

Sources

— SpecPicks Editorial · Last verified 2026-05-25

Radeon RX 7900 XTX
Radeon RX 7900 XTX
$1099.97
View on Amazon →