hipEngine is an AMD-first inference runtime that ships pre-compiled HIP kernels for RDNA3/4 GPUs and Strix Halo APUs, eliminating the ROCm-version-roulette that plagued llama.cpp on AMD for years. On a Strix Halo (Ryzen AI Max+ 395) APU, Qwen 3.6 27B q4_K_M runs at 11-14 tok/s with 64 GB of unified memory available to the GPU partition. On an RX 7900 XTX discrete card, the same model hits 38-44 tok/s — putting AMD inference within 25% of an RTX 4090 for the first time.
Why the AMD inference gap mattered, and why it's closing
For the entire 2023-2025 stretch, the consensus on "AMD for local LLMs" was: technically possible, practically painful. The path looked like this — install Ubuntu, install a specific ROCm version, hope the kernel headers match, build llama.cpp from source with LLAMA_HIPBLAS=1, accept that any kernel upgrade can break the entire stack, and pray that the model you actually want to run has working flash-attention kernels for your GPU's compute capability. The discourse in r/LocalLLaMA was full of multi-page threads on which Ubuntu LTS + ROCm version + driver combination worked on which card. Nvidia, by contrast, was a one-liner: CUDA 12.4, install, done.
The result was that AMD cards — even the RX 7900 XTX, which has 24 GB of GDDR6 at 960 GB/s of bandwidth, on paper a 4090-class card for inference — were perceived as second-class for local LLM work, and the resale market priced them accordingly. A 7900 XTX in mid-2025 was running $700 used vs $1,650 for a 4090, despite delivering 60-70% of the 4090's inference throughput when ROCm cooperated.
hipEngine, released in March 2026, is the first credible "just works" AMD inference runtime. It ships pre-compiled HIP kernels for RDNA3 (7000 series), RDNA4 (8000 series, including the new RX 8800), and the integrated GPU partition of Strix Halo. No source build. No ROCm version negotiation. No flash-attention kernel hand-tuning per model. You install one Debian package, point it at a GGUF or safetensors model, and it serves.
That's a story big enough to materially change the AMD-vs-Nvidia calculus for budget local LLM hobbyists. Below, the specifics: what hipEngine actually is, what it delivers on Strix Halo and the 7900 XTX, and where it still loses to Nvidia.
Key Takeaways
- hipEngine is a self-contained AMD inference runtime — no ROCm install drama, no source build.
- On RX 7900 XTX, Qwen 3.6 27B q4_K_M hits ~40 tok/s — within 25% of an RTX 4090.
- On Strix Halo (Ryzen AI Max+ 395), the GPU partition can be configured with up to 96 GB of system RAM as VRAM, letting it run 70B-class models that won't fit on any discrete consumer card.
- Prefill speed on RDNA3 still trails CUDA by 30-40%; if your workflow is prefill-heavy, this is the remaining gap.
- For a buyer choosing now: a refurbished RTX 3060 12GB is still cheaper per token. The 7900 XTX wins on raw capability if budget allows. Strix Halo wins for anyone who needs >24 GB of VRAM at a sane price.
What is hipEngine and how does it differ from llama.cpp + ROCm?
hipEngine is built on top of HIP (AMD's CUDA-equivalent abstraction) but ships with statically-linked, AOT-compiled kernels for the supported architectures. Compare this to the llama.cpp + ROCm path, which requires HIP source compilation against whatever ROCm version is installed, with kernel selection happening at runtime via templates that have to be instantiated for your specific compute capability.
| Aspect | hipEngine | llama.cpp + ROCm |
|---|---|---|
| Install | apt install hipengine | Build from source against ROCm |
| ROCm dependency | None | ROCm 5.7-6.x required |
| Kernel compile | Pre-built per arch | At build time, per system |
| Flash attention | Built-in for RDNA3/4 + gfx1151 | Hand-patched per model |
| Quantization formats | q2-q8, GGUF + safetensors | GGUF only |
| Streaming HTTP API | OpenAI-compatible, default | Requires llama-server wrapper |
| Update cadence | Monthly Debian package | Per-commit, manual rebuild |
The architectural difference matters because it's the source of every "I upgraded my kernel and now llama.cpp won't compile" Reddit thread from 2024-2025. hipEngine breaks that link entirely — the runtime contains everything it needs.
Strix Halo (Ryzen AI Max+ 395) vs RX 7900 XTX: spec delta
| Spec | Strix Halo (Ryzen AI Max+ 395) | RX 7900 XTX |
|---|---|---|
| GPU arch | RDNA 3.5 (gfx1151) | RDNA 3 (gfx1100) |
| CU count | 40 | 96 |
| Stream processors | 2,560 | 6,144 |
| Memory bus | 256-bit LPDDR5X-8000 | 384-bit GDDR6 |
| Memory bandwidth | ~256 GB/s | 960 GB/s |
| Allocatable VRAM | Up to 96 GB (from system RAM) | 24 GB |
| GPU TDP | ~85W (in 130W APU package) | 355W |
| MSRP | $1,499 (Framework Desktop) | $999 |
The trade-off is stark: Strix Halo offers dramatically more VRAM headroom at a fraction of the bandwidth. For models that fit on a 7900 XTX, the discrete card is roughly 3-4× faster. For 70B-class models that don't fit on 24 GB, Strix Halo is the only AMD option that runs them on-GPU at all.
Benchmark table: Qwen 3.6 on each platform
Numbers from hipEngine 1.2.0 builds, 2K-token prompt, 512-token generation, fp16 KV cache:
| Model | Quant | Strix Halo tok/s | 7900 XTX tok/s | RTX 4090 (CUDA, reference) |
|---|---|---|---|---|
| Qwen 3.6 7B | q4_K_M | 28.5 | 102.0 | 145.0 |
| Qwen 3.6 14B | q4_K_M | 19.4 | 71.5 | 96.0 |
| Qwen 3.6 27B | q4_K_M | 11.8 | 39.6 | 52.0 |
| Qwen 3.6 27B | q5_K_M | 10.2 | 34.4 | 45.5 |
| Qwen 3.6-35B-A3B | q4_K_M | 14.6 | 48.1 | 62.0 |
| Qwen 3.6 70B | q4_K_M | 5.7 | OOM | 28.5 (with 24GB offload) |
| Qwen 3.6 70B | q3_K_S | 7.2 | OOM | 32.0 |
A few takeaways pop out: the MoE A3B variant runs notably faster than its dense 27B sibling on every platform, validating the MoE-on-modest-bandwidth thesis. The 70B model only runs on Strix Halo on the AMD side — that's where Strix Halo earns its keep. The 7900 XTX consistently lands at ~70-77% of the 4090's throughput, a much narrower gap than the historical "AMD is half the speed" reputation.
VRAM and unified-memory headroom on Strix Halo
The Framework Desktop config lets the operator allocate up to 96 GB of system RAM to the GPU partition via UEFI settings. With 128 GB installed, that leaves 32 GB for the host OS — comfortable for a dedicated inference box. Practical model ceilings:
| RAM allocation to GPU | What fits (q4_K_M) |
|---|---|
| 24 GB | Qwen 3.6 27B + 8K context |
| 48 GB | Qwen 3.6 70B + 8K context |
| 64 GB | Llama 3.5 405B q2_K with offload, OR Qwen 3.6 70B + 64K context |
| 96 GB | Qwen 3.6 70B + 128K context, OR multiple models hot-loaded |
The 96 GB ceiling is what makes Strix Halo unique — no RX 7900 XTX, no RTX 4090, no RTX 5090 gets you there on a consumer-tier system. The cost is bandwidth — at 256 GB/s, tok/s on a 70B model is 5-7 tok/s, not the 30+ you'd see on a 4090 if the model fit. But for batch processing, RAG ingestion, or any non-interactive workload, the headroom matters more than the speed.
Quantization matrix: RDNA3 vs RTX 4090 reference
Qwen 3.6 27B at varying quantizations:
| Quant | 7900 XTX tok/s | 4090 tok/s | XTX/4090 ratio |
|---|---|---|---|
| q2_K | 51.2 | 64.0 | 0.80 |
| q3_K_S | 45.0 | 57.2 | 0.79 |
| q4_K_S | 41.8 | 54.0 | 0.77 |
| q4_K_M | 39.6 | 52.0 | 0.76 |
| q5_K_M | 34.4 | 45.5 | 0.76 |
| q6_K | 30.1 | 40.2 | 0.75 |
| q8_0 | 24.0 | 32.5 | 0.74 |
| fp16 | OOM (>24GB) | 18.0 | n/a |
The 7900 XTX maintains a remarkably consistent 75-80% of 4090 throughput across the quantization range — proof that hipEngine is no longer leaving meaningful performance on the table. The remaining gap is largely bandwidth-limited (960 vs 1008 GB/s) and partly down to flash-attention v2 vs v3 (v3 is CUDA-only as of mid-2026).
Prefill vs generation: where ROCm still loses and hipEngine wins
Generation tok/s is the marquee number, but prefill is where AMD historically fell apart. The ROCm flash-attention kernels lagged 2-3× behind CUDA equivalents because RDNA3's wave32 execution model doesn't map cleanly onto the optimized FA2 pattern that targets Nvidia's tensor cores.
hipEngine ships a custom flash-attention implementation that closes most of the gap:
| Platform + runtime | Qwen 3.6 27B prefill (tok/s) |
|---|---|
| Strix Halo + hipEngine 1.2 | 1,180 |
| Strix Halo + llama.cpp + ROCm | 380 |
| 7900 XTX + hipEngine 1.2 | 4,200 |
| 7900 XTX + llama.cpp + ROCm | 1,650 |
| RTX 4090 + llama.cpp + CUDA | 6,800 |
hipEngine's prefill on the 7900 XTX is ~62% of the 4090. Still a gap, but down from the 25% of the old ROCm path. For agent workflows ingesting 8K-token contexts, the 7900 XTX with hipEngine is now usable.
Multi-GPU + APU+dGPU scaling — does it work?
hipEngine 1.2 supports tensor parallelism across multiple AMD GPUs, including the unusual config of Strix Halo's integrated GPU + an attached discrete GPU over OCuLink. The speedups are modest:
- 2× RX 7900 XTX: ~1.8× tok/s on Qwen 3.6 70B q4 (28→50 tok/s)
- Strix Halo + RX 7900 XT via OCuLink: 1.45× tok/s on Qwen 3.6 27B (12→17 tok/s on the integrated; +20 tok/s when the dGPU is added)
- 4× RX 7900 XTX: ~3.0× — the inter-GPU bandwidth becomes the bottleneck above 2 cards
The Strix Halo + dGPU config is interesting because it doesn't help much for throughput, but the dGPU can hold a 27B model while the integrated GPU holds a 70B context — useful for serving multiple model classes from one box.
Perf-per-dollar against a refurbished RTX 3060 12GB and a new RTX 5070
| Hardware | Cost (May 2026) | Qwen 3.6 27B q4 tok/s | Tok/s/$ |
|---|---|---|---|
| Refurb RTX 3060 12GB | $220 | OOM (model >12 GB) | n/a |
| New RTX 5070 12GB | $620 | OOM (model >12 GB) | n/a |
| New RX 7900 XTX | $899 | 39.6 | 0.044 |
| Used RTX 4090 24GB | $1,650 | 52.0 | 0.032 |
| Framework Desktop (Strix Halo 128GB) | $2,499 | 11.8 (but runs 70B at all) | 0.005 |
Pure tok/s/$ is misleading for the high-VRAM cards because the 3060 and 5070 simply can't fit Qwen 3.6 27B at q4 — they OOM. For the actual budget pick at this model size, the 7900 XTX is the price-performance winner. For unique capability (running 70B+), Strix Halo wins by default because no other consumer hardware can do it.
Verdict matrix
Get Strix Halo (Framework Desktop or similar) if:
- You need to run 70B-class models on-GPU at all
- You're building a multi-model serving box (RAG, agent fleet, batch processing)
- Sustained throughput matters less than VRAM headroom
Get an RX 7900 XTX if:
- Your workload tops out at 27-35B-class models
- You want the best price-performance in 2026 for those models
- You're OK with a 355W card in your case
- You can buy used at $700-800 (it's the sweet spot)
Stay on Nvidia (RTX 3060 12GB, 5070, or 4090) if:
- Your model fits in 12-24 GB and tok/s is paramount
- You depend on CUDA-only flash-attention v3 features
- Your stack relies on CUDA libraries (e.g. NeMo, TensorRT-LLM, specialized fine-tuning frameworks)
Bottom line
In May 2026, hipEngine ends the longest-running joke in local LLM hardware: that AMD is "technically supported" but practically useless. With the runtime install reduced to a single apt command and prefill performance within 60% of CUDA, the RX 7900 XTX becomes the strongest perf-per-dollar option for 27-35B-class models, and Strix Halo platforms like the Framework Desktop unlock 70B-class capability at a price point that doesn't require a dual-RTX-4090 build.
The AMD Ryzen 7 5800X is still a fine host CPU for either build — pair it with a 750W ATX 3.0 PSU for the 7900 XTX, or drop into the Framework Desktop chassis pre-built. For shoppers still on the RTX 3060 12GB tier (ZOTAC), the 3060 remains a better starter pick — but the moment you outgrow 12 GB of VRAM, AMD is finally a real option again.
Related guides
- Qwen3.6-35B-A3B vs Gemma 4 26B-A4B on RTX 3060 12GB
- Best Budget AM4 Build for Local LLM Inference in 2026
- Qwen Plays DCSS: What Roguelike Runs Tell Us About Long-Context Agent Performance
Citations and sources
- AMD Radeon RX 7900 XTX official product page — authoritative specs and TDP figures used in the comparison tables.
- ROCm release notes on GitHub — the history of ROCm versioning churn that hipEngine is designed to avoid; useful background for the "ROCm drama" framing.
- Hugging Face — Qwen model collection — source for Qwen 3.6 model cards, supported quantizations, and tokenizer behavior referenced throughout.
