Can I run local LLMs on AMD GPUs in 2026?
Yes — and for the first time, the answer is yes without an asterisk for most workloads. As of April 2026, ROCm 6.3 plus mainline llama.cpp, vLLM 0.7.x, and SGLang give an RX 7900 XTX or W7900 owner a working stack for 7B–70B local inference, with no hand-compiled forks and a pip install that resembles the CUDA path. You will still pay 15–35% in prefill throughput versus an RTX 4090 on most quantized models, AMD ships ROCm support for new architectures roughly 30–90 days behind NVIDIA, and multi-GPU scaling is rougher. But "AMD doesn't work" is no longer true. Read on for what actually runs, on which cards, at what numbers, and where the rough edges still are.
Why ROCm has been the punchline (and what changed)
For most of 2022–2024, ROCm was the answer to "I want pain." It officially supported a tiny GPU list, broke between point releases, required custom kernels for popular models, and on Windows was effectively non-existent. The community got better mileage out of CPU inference than ROCm in many cases — a dunk that AMD earned through years of treating consumer cards as second-class citizens.
Three things changed across 2025 and into early 2026:
- AMD officially extended ROCm consumer support. ROCm 6.0 in December 2024 broadened official Radeon GPU coverage; 6.1 and 6.2 added more cards and stabilized RX 7900 XTX in particular. ROCm 6.3 (Q1 2026) is the first release where Linux installs feel as boring as CUDA —
apt install rocmand aHSA_OVERRIDE_GFX_VERSIONis rarely needed. - The toolchain got serious. llama.cpp's HIP backend matured to near-parity with CUDA on common quantizations. vLLM gained a maintained ROCm path from AMD's own engineers (the GitHub
rocmbranch is no longer a graveyard). SGLang added ROCm in 2024 and stabilized it in early 2026. - AMD started showing up. AMD engineers publicly soliciting ROCm feedback on r/LocalLLaMA in April 2026 was the visible signal — the community attention spike that prompted this article. Behind it: an official AMD AI hub, weekly llama.cpp PR reviews from AMD staff, and a dedicated ROCm-for-LLMs Discord with response times measured in hours.
This isn't a "ROCm is now better than CUDA" story. It's a "ROCm is now usable for local LLMs without filing a tracker bug" story. That's a much lower bar — but the bar has been below the floor for years.
Key takeaways
- RX 7900 XTX is the honest 24GB AMD pick at ~$899 street, undercutting the RTX 4090 by $700–$1100 for the same VRAM. Expect 60–75% of 4090 generation throughput on llama.cpp q4 inference, less for prefill-bound workloads.
- W7900 (48GB, ~$3499) is the only realistic AMD card for 70B at high quant. It costs more than two RTX 3090s but eliminates the multi-GPU headache.
- Prefill is still ROCm's weak spot — 25–40% slower than CUDA at long contexts on most engines as of April 2026.
- Multi-GPU scaling on ROCm consumer cards is sub-linear (60–75% of theoretical) because of weaker peer-to-peer support versus NVLink-on-3090 pairs.
- Software lag persists — new model architectures (e.g. Qwen3.6 MoE, Llama 4) typically land on CUDA first, with ROCm following 30–90 days later.
What's actually working on ROCm 6.x for inference
The honest list as of ROCm 6.3 + April 2026 mainline tags:
llama.cpp — first-class. The HIP backend (make GGML_HIP=1) builds clean on RX 7900 XTX, W7900, and MI300X. q4_K_M, q5_K_M, q6_K, q8_0 all run. Speculative decoding works. --split-mode row for multi-GPU works but with the scaling caveat below. CLBlast and Vulkan backends still exist as fallbacks for older Radeon parts that ROCm doesn't officially cover. This is the path most local-LLM hobbyists should start on.
vLLM — supported, with caveats. The ROCm path uses Triton-MLIR for attention kernels rather than FlashAttention-2's CUDA-only implementation, so prefill throughput is 20–40% lower than the equivalent CUDA build. AWQ and GPTQ quantizations work; FP8 needs MI300-class hardware. Continuous batching works. If you're building a local API server that serves 4–8 concurrent users, vLLM on a W7900 is viable.
SGLang — supported as of v0.4.x. Schedules the same way as on CUDA. RadixAttention works on ROCm. Slightly less stable than vLLM under load — expect to file a tracker every few weeks if you push it hard.
Text Generation Inference (TGI) — Hugging Face's TGI ships official ROCm Docker images for MI210/MI250/MI300X. Consumer Radeon support is unofficial but works on RX 7900 XTX after setting HSA_OVERRIDE_GFX_VERSION=11.0.0.
Ollama — works fine; Ollama bundles llama.cpp, so anything llama.cpp supports, Ollama gets for free. Ollama's installer detects ROCm on supported cards.
What is not working well: anything that needs Triton kernels written for NVIDIA-specific intrinsics (Flash Attention 3 in particular), bleeding-edge MoE routing kernels (Qwen3.6 MoE landed on CUDA first), and most fine-tuning frameworks that assume bitsandbytes (which is a CUDA-only library; QLoRA on ROCm uses HQQ or AWQ-based alternatives).
Which AMD GPUs are realistic for local LLM today
| Card | VRAM | Bandwidth | TDP | Street (Apr 2026) | Best for |
|---|---|---|---|---|---|
| RX 7900 XTX | 24GB GDDR6 | 960 GB/s | 355W | ~$899 | 7B–13B at high quant, 27B at q4 |
| RX 7900 XT | 20GB GDDR6 | 800 GB/s | 315W | ~$649 | 13B at q5–q6, 27B at q3 |
| Radeon Pro W7900 | 48GB GDDR6 | 864 GB/s | 295W | ~$3499 | 70B at q4, 27B at FP16 |
| Radeon Pro W7800 | 32GB GDDR6 | 576 GB/s | 260W | ~$2499 | 27B at q6/q8 |
| MI300X | 192GB HBM3 | 5.3 TB/s | 750W | datacenter SKU only | 70B FP16, 405B at q4 |
Skip: RX 7800 XT and below (16GB or less ceiling makes 27B+ impractical), RX 6000-series (RDNA2; ROCm 6.x officially drops some). Older Radeon VII (16GB HBM2) is a curiosity but not a daily driver in 2026.
Spec/price delta — RX 7900 XTX vs RTX 4090 vs RTX 5090 vs RTX 3090
| Card | VRAM | Bandwidth | FP16 TFLOPS | TDP | Street (Apr 2026) | $/GB VRAM |
|---|---|---|---|---|---|---|
| RX 7900 XTX | 24GB GDDR6 | 960 GB/s | 122 | 355W | $899 | $37 |
| RTX 3090 | 24GB GDDR6X | 936 GB/s | 71 | 350W | $899 (used) | $37 |
| RTX 4090 | 24GB GDDR6X | 1008 GB/s | 165 | 450W | $1599 | $67 |
| RTX 5090 | 32GB GDDR7 | 1792 GB/s | 209 | 575W | $1999 | $62 |
Bandwidth and VRAM are the inference-relevant numbers; raw TFLOPS matters for prefill and training. The RX 7900 XTX is competitive on bandwidth (a hair below the 4090) and matches it on VRAM, while undercutting on price by about $700. Against a used RTX 3090 the price match is exact — but RTX 3090s have NVLink (when paired) and a deeper software ecosystem.
Benchmark table — Qwen3.6-27B and Llama 3.1 70B tok/s on RX 7900 XTX
Numbers below are from llama.cpp llama-bench runs, ROCm 6.3, kernel 6.8, prompt length 512, generation length 256. Sources: Phoronix April 2026 review, llama.cpp PR #11942 benchmark thread, and r/LocalLLaMA aggregated reports.
| Model | Quant | Prefill (t/s) | Generation (t/s) | Notes |
|---|---|---|---|---|
| Qwen3.6-27B | q4_K_M | 285 | 41.2 | Fits 24GB with 8K context |
| Qwen3.6-27B | q6_K | 218 | 32.5 | Tight; 4K context max |
| Qwen3.6-27B | q8_0 | — | — | Does not fit 24GB |
| Llama 3.1 70B | q4_K_M | — | — | Does not fit 24GB single card |
| Llama 3.1 70B | q4_K_M | 142 | 14.8 | 2× 7900 XTX, --split-mode row |
| Llama 3.1 8B | q8_0 | 1850 | 78.4 | Small model, very fast |
| Mistral 7B | q4_K_M | 2400 | 112.6 |
Comparison data points (CUDA on RTX 4090, same configs): Qwen3.6-27B q4_K_M lands ~360 prefill / ~58 generation; Llama 3.1 8B q8_0 ~2300 / ~98. The 7900 XTX delivers ~70% of 4090 generation throughput and ~75% of 4090 prefill on this workload — closer than common wisdom suggests.
Quantization matrix per AMD card
What actually fits, by card and quantization, with 4K context. Models tested: Mistral 7B, Qwen3.6-27B, Llama 3.1 70B, DeepSeek-V4-Pro 70B.
| Card | 7B q4 | 7B q8 | 27B q4 | 27B q6 | 27B q8 | 70B q4 | 70B q6 |
|---|---|---|---|---|---|---|---|
| RX 7900 XT (20GB) | yes | yes | yes | tight | no | no | no |
| RX 7900 XTX (24GB) | yes | yes | yes | yes | tight | no | no |
| W7800 (32GB) | yes | yes | yes | yes | yes | tight | no |
| W7900 (48GB) | yes | yes | yes | yes | yes | yes | yes |
| 2× RX 7900 XTX | yes | yes | yes | yes | yes | yes | tight |
| MI300X (192GB) | yes | yes | yes | yes | yes | yes | yes (FP16) |
"Tight" means it loads but you have to drop context length to 2–4K and accept GGUF KV-cache offload tweaks.
Prefill vs generation: where ROCm still trails CUDA
Prefill (initial prompt processing) and generation (token-by-token decoding) stress the GPU differently. Generation is bandwidth-bound — if you can stream weights through the memory subsystem fast, you generate fast. Prefill is compute-bound at long contexts, especially for attention.
On generation, RX 7900 XTX runs about 70–80% of RTX 4090 throughput on quantized models. Bandwidth parity does most of the work; the gap is mostly kernel-quality overhead.
On prefill, the gap widens to 25–40% on long contexts. Two reasons:
- No FlashAttention-2 ROCm parity. llama.cpp uses a Triton-based attention kernel on ROCm; vLLM uses Composable Kernel (CK) attention. Neither matches FA2's CUDA implementation on prefill at 8K+ contexts as of April 2026. AMD's CK-Tile work is closing this — expect parity by end of 2026 if the trajectory holds.
- Sub-warp scheduler differences. RDNA3's wave64/wave32 split makes some attention patterns less efficient than NVIDIA's warp model. This is hardware-level and won't fully close.
Practical impact: if you serve a chat workload where prompts are short (system + 1–2 turns), the gap is ~10%. If you serve a RAG workload with 16K+ contexts on every request, the gap is real and you'll feel it.
Multi-GPU scaling on ROCm — does it work?
Yes, with caveats. llama.cpp's --split-mode row and --split-mode layer both work across 2× RX 7900 XTX. vLLM tensor parallelism works on consumer Radeons too, though you'll find more rough edges than on data-center MI300X.
The catch is scaling efficiency. A 2× RX 7900 XTX setup running Llama 3.1 70B q4 lands around 14.8 generation t/s (see benchmark above). A single card delivers 41.2 t/s on Qwen3.6-27B q4. Pure linear scaling would predict ~20 t/s on the smaller-per-card 70B workload; we get 75% of that.
Why: peer-to-peer over PCIe 4.0 x16 is slower than NVLink (which 2× RTX 3090 owners have). ROCm's RCCL collective library has been catching up to NCCL, but the hardware substrate matters.
If you're going multi-GPU and budget-constrained, 2× RX 7900 XTX at $1798 total beats 2× RTX 4090 at $3198 on raw $/VRAM. Against 2× used RTX 3090s with NVLink at ~$1800, the AMD pair is probably 10–20% slower on 70B but easier to source new with warranty.
Perf-per-dollar at street prices
Tokens-per-second-per-dollar for Qwen3.6-27B q4 generation:
| Card | Gen t/s | Street price | t/s per $1000 |
|---|---|---|---|
| RX 7900 XTX | 41.2 | $899 | 45.8 |
| RTX 3090 (used) | 38.5 | $899 | 42.8 |
| RTX 4090 | 58.4 | $1599 | 36.5 |
| RTX 5090 | 91.2 | $1999 | 45.6 |
By this single metric the RX 7900 XTX is the best value among 24GB cards, just edging out the RTX 5090 once you factor list price. For 27B-and-under workloads — which covers most local-LLM use cases in 2026 — the AMD pick is rationally defensible on dollar terms.
For 70B-and-up, perf-per-dollar inverts because the AMD W7900 at $3499 is up against 2× RTX 3090s at $1800 that can split a 70B model with NVLink help. NVIDIA still wins large-model economics unless you go datacenter MI300X.
Common pitfalls
A short list of things that will eat your weekend if you don't know them up front:
- HSA_OVERRIDE_GFX_VERSION rabbit holes. Most modern Radeons don't need this on ROCm 6.3, but RX 7800 XT and below may. Check the official ROCm GPU support matrix before assuming your card "just works."
- Mixing AMD and NVIDIA GPUs in one box. Technically possible, practically a mess — you'll fight ROCm/CUDA library load order and process pinning. Pick one.
- Old kernel versions. ROCm 6.3 wants kernel 6.5+ for best stability. Ubuntu 22.04's default 5.15 will work but you'll see weirder error messages. Ubuntu 24.04 LTS is the easy path.
- Power supply sizing. RX 7900 XTX pulls 355W steady-state and spikes higher. A 750W PSU paired with a top-tier CPU is undersized; budget 850W+ for single-GPU and 1200W+ for dual.
- bitsandbytes won't work. Anything that imports
bitsandbytes(a lot of fine-tuning code) needs a CUDA GPU or a ROCm-compatible alternative like HQQ. Don't expect QLoRA tutorials to run unmodified. - Windows is still rough. ROCm on Windows technically exists for select Radeons via WSL2, but the experience is years behind Linux. If you're committing to AMD for local LLM, commit to Linux.
When NOT to buy AMD
Skip Radeon for local LLM if any of these apply:
- You fine-tune. CUDA's ecosystem (PEFT, bitsandbytes, Unsloth, Axolotl) is years ahead of ROCm. Fine-tuning on AMD is doable but you'll be the person debugging it.
- You need bleeding-edge model day-one support. New architectures land on CUDA first. If you must run Llama 5 the day it drops, NVIDIA.
- You're on Windows and won't move to Linux. Just buy NVIDIA.
- You want one card to handle 70B at high quant. A single 24GB card won't, AMD or NVIDIA. You either go multi-GPU (NVIDIA cheaper at the used 3090 tier) or 48GB+ pro cards (W7900 vs RTX 6000 Ada, NVIDIA wins on software).
Verdict matrix
Buy AMD if:
- You run llama.cpp or Ollama for personal use, 7B–27B at q4–q6, on Linux
- You want the cheapest 24GB new card with manufacturer warranty and don't care about a 25–30% prefill gap
- You're building a 2× consumer-card rig for 70B inference at q4 and want to spend $1800 instead of $3200
Stick NVIDIA if:
- You fine-tune, train, or do anything beyond inference
- You need vLLM/SGLang with maximum throughput for production serving
- You run Windows and want it to just work
- You want best-in-class 70B+ throughput on a single card (RTX 5090 32GB)
Bottom line
ROCm in 2026 is the first version of "AMD for local LLM" that doesn't come with a forum-rescue chapter. RX 7900 XTX at $899 is the value pick for 7B–27B workloads on Linux, delivering 70–80% of RTX 4090 throughput at 56% of the price. W7900 is the rare 48GB consumer card that runs 70B at q4 without multi-GPU contortions. Multi-GPU scaling and prefill performance still trail NVIDIA, and the 30–90 day software lag on new architectures is real.
If you're a Linux hobbyist running quantized inference and you're allergic to NVIDIA's pricing, AMD is finally a real option in 2026 — not a joke, not a project, just a slower-and-cheaper alternative with known sharp edges. If you live on the bleeding edge of training and frameworks, stay on NVIDIA. The middle ground — pure inference, mainstream models, mainstream tools — is wide enough now to fit a 7900 XTX comfortably.
Related guides
- 24GB GPU local LLM buying guide (2026)
- RTX 5090 vs RTX 4090 for AI inference
- DeepSeek-V4-Pro local inference hardware guide
Sources
- AMD ROCm 6.3 release notes (rocm.docs.amd.com), Q1 2026
- llama.cpp HIP backend PR #11942 benchmark thread, github.com/ggerganov/llama.cpp
- Phoronix, "ROCm 6.3 Radeon Inference Benchmarks," April 2026 (phoronix.com)
- vLLM v0.7.x ROCm support matrix (docs.vllm.ai), April 2026
- r/LocalLLaMA AMA thread with AMD ROCm engineering, April 2026
- Hugging Face TGI ROCm Docker image notes (huggingface.co/docs/text-generation-inference)
- techpowerup.com GPU database for spec/bandwidth/TDP figures
