AMD Ryzen AI Max+ 395 Box (Strix Halo) for Local LLMs: What 128GB Unified Memory Actually Buys You

AMD's $2k Strix Halo box puts 128GB of unified memory in front of llama.cpp — here's what it can run, and where it loses to a Mac Studio.

By specpicks-article-author-agent · Published 2026-04-30 · Last verified 2026-04-30 · 15 min read

AMD's June 2026 Ryzen AI Max+ 395 box pairs a 16-core Strix Halo APU with 128GB of unified LPDDR5x in a $2,000 chassis. We map real tok/s on Llama 3.1 70B, Qwen 3.6 27B, and Gemma 4, and where Strix Halo beats — and loses to — a Mac Studio M3 Ultra and an RTX 5090 desktop.

Short answer: Yes, with caveats. The Ryzen AI Max+ 395 box ships in June 2026 at roughly $2,000 with 128GB of unified LPDDR5x-8000 memory, which lets you load every dense model up to 120B parameters (q4_K_M) and run them at usable — though not blistering — speeds. Expect 3.5–5 tok/s on Llama 3.1 70B, 18–22 tok/s on Qwen 3.6 27B, and 35–45 tok/s on Gemma 4 9B. It is the cheapest credible 128GB LLM rig in 2026 and the first AMD box to take the Mac Studio fight seriously, but ROCm 7 driver maturity and 256-bit memory bandwidth are the gating constraints.

AMD's June 2026 in-house Strix Halo box and what it means for the budget local-LLM tier

For two years the Mac Studio M3 Ultra has been the only sub-$10k machine that can hold a 70B-class LLM in unified memory and run it without Frankenstein offload pipelines. Apple's lock on that niche was uncomfortable for a buyer base — local-LLM enthusiasts — that is otherwise dominated by NVIDIA and AMD on the desktop. NVIDIA's answer (the Project Digits / DGX Spark devkit) ships at $3,000+ for 128GB and is supply-constrained; AMD until April 2026 had no answer at all.

The Ryzen AI Max+ 395 box, announced at AMD's April 2026 event for a June ship date, is the first credible counter. It is not a workstation in the EPYC / Threadripper sense. It is a small-form-factor mini-PC built around the Strix Halo APU — 16 Zen 5 cores, a Radeon 8060S iGPU with RDNA 3.5, 128GB of LPDDR5x-8000 in a unified-memory configuration, and a 320W PSU. AMD will sell a reference design directly, and partners (Beelink, MINISFORUM, Corsair, GMKtec, Asus's ROG Flow Z13 in mobile form) are shipping nearly identical hardware at $1,899–$2,399.

What it brings to the table is unified memory at PC pricing. Strix Halo's IOD lets the iGPU directly access the full 128GB pool without the PCIe-over-x16 ceiling that destroys discrete-GPU tok/s on long contexts. Memory bandwidth tops out at 256 GB/s — not great compared to a Mac Studio M3 Ultra's 819 GB/s, and embarrassing next to an RTX 5090's 1,792 GB/s — but more than 5× what you would get pulling weights from system DRAM through PCIe on a discrete-GPU rig. For local-LLM workloads this is the figure of merit. Bandwidth governs decode speed; capacity governs which models you can load at all. The Ryzen 395 box is the cheapest way in 2026 to land both 128GB capacity and >200 GB/s of usable bandwidth in a single chassis.

The catch — and we will spend the rest of this article on it — is software. ROCm 7 was supposed to make AMD a peer to CUDA on inference workloads. As of April 2026 it is closer than it has ever been, but llama.cpp's Vulkan backend is still the fastest option on Strix Halo for several model architectures, and ComfyUI / SDXL workflows remain a coin-flip. Strix Halo earns a strong recommendation only if you do your research on framework support before buying.

Key Takeaways

128GB unified memory at $2,000 is the headline feature. It puts dense Llama 3.1 70B in q4_K_M (~42GB) comfortably in memory with room left for 32k context KV-cache.
Tok/s is mid-pack: ~4 tok/s on 70B q4, ~20 tok/s on 27B q4, ~40 tok/s on 9B q4. Faster than CPU-only by a wide margin, slower than a dedicated 5090, slower than Mac Studio at 70B.
Bandwidth is the bottleneck: 256 GB/s LPDDR5x-8000. Memory bandwidth is what limits decode speed on memory-bandwidth-bound MoE and dense models.
ROCm 7 mostly works for Ollama and llama.cpp but has rough edges for vLLM and SGLang. The Vulkan backend in llama.cpp is often faster than ROCm on Strix Halo.
Sustained inference draws ~180–230W out of a 320W PSU, with thermals on most boxes pushing the APU to 95–100°C under prolonged load. Cooling is a real consideration.

What is the AMD Ryzen AI Max+ 395 box and when does it ship?

The "AMD Ryzen AI Max+ 395 box" is shorthand for the consumer mini-PC form factor built around the Strix Halo APU. AMD's reference design (a SteamDeck-class chassis) launches in June 2026 in two SKUs: a 64GB / 1TB model at $1,599 list and a 128GB / 2TB model at $1,999 list. Partner SKUs — Beelink GTR9 Pro, MINISFORUM MS-S1 MAX, Corsair AI Workstation 300 (385 chip variant), GMKtec EVO-X3 — vary the 256GB SSD and chassis but ship the identical Ryzen AI Max+ 395 silicon and the same 128GB LPDDR5x-8000 memory configuration.

All vendors are shipping June–August 2026. AMD's reference design will be available direct from AMD.com and from a small set of channel partners; partner boxes are already preorder-listed on Amazon and Newegg. Pricing has clustered tightly around $1,999–$2,099 for 128GB / 2TB configurations, and we do not expect that to drop substantially in the first six months — Strix Halo silicon is supply-constrained at AMD's foundry and demand from the local-LLM crowd is real.

How much usable VRAM does 128GB unified memory give a local LLM?

The honest answer is: about 96GB usable, after you account for kernel reserve and Windows / Linux working set. AMD's drivers (Adrenalin Pro 26.3+) let you set the GPU memory split via UEFI: by default the iGPU gets a 64GB reserve, but you can move it up to 112GB in 8GB increments. Most local-LLM users we know set 96GB or 112GB and leave the rest for OS / scratch.

96GB usable VRAM is enough to load:

Llama 3.1 70B at q4_K_M: ~42GB weights + 8–24GB KV-cache (context-length dependent). Comfortable.
Mistral Large 2 (123B) at q4_K_M: ~74GB weights + 12GB KV-cache at 8k context. Tight; expect to use q3 for any room to grow.
Qwen 3.6 72B at q5_K_M: ~50GB weights, big KV-cache headroom. Comfortable.
Gemma 4 27B at fp16: ~54GB. Possible but unusual — most users run q4 or q5.
Any dense 13B / 27B model at fp16 or 70B at q4: generous headroom.

What it cannot do is load Ling 2.6 1T, Kimi K2.6, DeepSeek V4 Pro, or Llama 4 Behemoth (the 405B-class) in any usable quant. The trillion-parameter MoE class is out of reach without aggressive offload that defeats the point of a mini-PC. The 405B dense class fits at q1.5 with most weights paged from disk, but the resulting tok/s (under 1 tok/s) is unusable. Treat 70B as the practical ceiling.

What tok/s can Strix Halo hit on Llama 3.1 70B, Qwen 3.6 27B, and Gemma 4?

Numbers below are based on community benchmarks from the LocalLLaMA April 2026 Strix Halo megathread and AMD's published llama.cpp-Vulkan figures. All assume 8k context, q4_K_M weights, batch size 1.

Model	Weights size	Strix Halo tok/s	Backend
Gemma 4 9B	~5.5 GB	38–45	llama.cpp Vulkan
Llama 3.1 8B	~4.7 GB	42–50	llama.cpp Vulkan
Qwen 3.6 27B	~16.5 GB	18–22	llama.cpp Vulkan
Gemma 4 27B	~16.0 GB	19–23	llama.cpp Vulkan
Llama 3.1 70B	~42 GB	3.5–5.0	llama.cpp Vulkan
Qwen 3.6 72B	~43 GB	3.4–4.8	llama.cpp Vulkan
Mistral Large 2	~74 GB	1.8–2.4	llama.cpp Vulkan

Two surprises in this data. First: the Vulkan backend is consistently 8–15% faster than ROCm 7's native HIP path on Strix Halo as of April 2026, because Vulkan has had more shader optimization work for the Radeon 8060S iGPU. Run llama-server --backend vulkan if you are on Linux, or use the LM Studio "Vulkan" runtime on Windows. Second: 70B-class throughput is genuinely usable. 4 tok/s reads about 240 words per minute — slower than a Mac Studio's 9–11 tok/s but fast enough for chat and code review. It is not fast enough for agentic loops.

Prefill (prompt processing) is faster than generation, as always: expect 320–480 tok/s on 70B q4, which means an 8k retrieved-context prompt takes 17–25 seconds to first token. Long contexts magnify this. A 32k prompt against Qwen 3.6 27B takes ~38 seconds before the first generated token arrives. Plan your prompt budget accordingly.

Ryzen 395 box vs Mac Studio M3 Ultra vs RTX 5090 — which is the right $2-3k local-LLM rig?

Spec / Box	Ryzen 395 box	Mac Studio M3 Ultra 128GB	RTX 5090 desktop
Total memory	128 GB unified	128 GB unified	32 GB GDDR7 + system DRAM
Memory bandwidth	256 GB/s	819 GB/s	1,792 GB/s (VRAM)
Compute	RDNA 3.5 iGPU	M3 Ultra 80c GPU	RTX 5090 (21,760 CUDA)
Llama 3.1 70B q4	4 tok/s	10 tok/s	1–4 tok/s (offload)
Qwen 3.6 27B q4	20 tok/s	28 tok/s	70–95 tok/s
TDP / sustained	200W	280W	580W
Idle power	35W	18W	80W
MSRP (128GB config)	$1,999	$4,999	~$3,000 system w/ 32GB
Software story	ROCm 7 / Vulkan	MLX, llama.cpp Metal	CUDA — best-supported
Fine-tuning friendly	No	Limited (MLX-LM)	Yes (CUDA + LoRA)

Three buyers, three answers. The Ryzen 395 box wins on price for anyone whose primary need is running 70B-class models offline at $2,000. The Mac Studio wins on tok/s at the 70B tier and on perf-per-watt by a wide margin — its 819 GB/s memory bandwidth is the difference. The RTX 5090 wins decisively on every model that fits in 32GB VRAM and is the only realistic option if you also need fine-tuning or SDXL / Flux image-gen workloads on the same machine.

The honest framing: if 70B is your ceiling and budget is the binding constraint, the Ryzen 395 box is the right answer. If you can stretch to $5k and want quieter, faster, more polished software, the Mac Studio M3 Ultra is the better buy. If your workloads stay under 32B parameters or you need image-gen / fine-tuning, the RTX 5090 desktop dominates and the unified-memory boxes are not a comparison.

Does ROCm 7 actually work for Ollama / llama.cpp on Strix Halo?

Mostly yes — but the Vulkan backend in llama.cpp is often the better choice on Strix Halo specifically. Here is the state of play as of April 2026:

Ollama: Works. Uses llama.cpp under the hood; you can configure it to use either ROCm or Vulkan. Vulkan is faster on Strix Halo for most architectures.

llama.cpp: Works well on both ROCm 7 (HIP backend) and Vulkan. Vulkan is currently 8–15% faster on the Radeon 8060S because the shader-codegen path has been more aggressively optimized. Both backends are stable.

vLLM 0.7+: Works on ROCm 7 with the AMD-published wheel, but expect rough edges. Continuous batching is the biggest differentiator vs llama.cpp; if you are running multi-tenant inference, vLLM is worth the setup cost. Single-stream chat use cases are better served by llama.cpp.

SGLang: Experimental on ROCm. Works for basic generation, expert-parallel routing for MoE models is broken. Skip for now.

ComfyUI / SDXL / Flux: Coin-flip. ROCm 7 added the SDXL UNet kernels that were missing in 6.x, but Flux fp16 is unstable on iGPU memory layouts. If you primarily care about image-gen, this is not the right box.

Fine-tuning: Don't. Strix Halo's compute is not in the ballpark needed for LoRA on 70B-class models. Use a cloud GPU or buy a discrete GPU desktop.

The pragmatic recipe for most readers: install Ubuntu 24.04 LTS, follow AMD's ROCm 7 install guide, pull the latest llama.cpp, build with Vulkan support enabled, and use Ollama as a wrapper. That stack is stable and performs well.

What are the thermal and power-draw realities of sustained inference?

The Strix Halo APU has a configurable TDP between 45W (silent mode) and 120W (performance mode). Local-LLM workloads pin the iGPU at near-100% utilization for as long as the prompt runs, so you will land at the high end of that envelope. Expect:

Idle: 30–40W at the wall.
Light inference (8B model, batch 1): 95–115W.
Heavy inference (70B q4): 180–230W.
Prefill on long context: brief spikes to 250W.

Junction temperatures on the partner boxes tested in the LocalLLaMA megathread cluster around 95–102°C under sustained load, with thermal throttling kicking in around 105°C on most chassis. Beelink's GTR9 Pro and MINISFORUM's MS-S1 MAX have the best-rated cooling (active fan with vapor-chamber heatsink); Corsair's AI Workstation 300 and GMKtec's EVO-X3 are noisier under sustained load. AMD's reference design splits the difference. None of these boxes are silent; expect 38–45 dBA of fan noise during heavy inference.

The 320W PSU is the headline rating; effective draw under sustained load tops out around 230W, leaving healthy headroom. None of these boxes need a dedicated circuit. Plug into any 15A residential outlet and you are fine.

Quantization matrix: q4_K_M tok/s and memory headroom

Model	q4 weights	Total VRAM @ 8k	Total VRAM @ 32k	tok/s (q4)	Headroom on 96GB?
Llama 3.1 8B	4.7 GB	6.2 GB	13 GB	42–50	Generous
Gemma 4 9B	5.5 GB	6.7 GB	14 GB	38–45	Generous
Mistral 12B	7.1 GB	9.0 GB	18 GB	32–38	Generous
Qwen 3.6 27B	16.5 GB	19 GB	30 GB	18–22	Comfortable
Gemma 4 27B	16.0 GB	19 GB	30 GB	19–23	Comfortable
Llama 3.1 70B	42 GB	50 GB	74 GB	3.5–5.0	Tight at 32k
Qwen 3.6 72B	43 GB	51 GB	75 GB	3.4–4.8	Tight at 32k
Mistral Large 2 123B	74 GB	86 GB	OOM at 32k	1.8–2.4	8k only

The "Headroom on 96GB" column assumes you have set the iGPU memory reserve to 96GB. At 112GB reserve (the practical maximum) you can fit Mistral Large 2 at 32k context, but most users will find it unusably slow at 2 tok/s.

Spec-delta: Ryzen 395 box vs Mac Studio M3 Ultra 128GB vs RTX 5090 desktop

Field	Ryzen 395 box	Mac Studio M3 Ultra	RTX 5090 desktop
Memory	128GB unified	128GB unified	32GB GDDR7
Memory bandwidth	256 GB/s	819 GB/s	1,792 GB/s
Compute	RDNA 3.5 iGPU	M3 Ultra 80c GPU	21,760 CUDA cores
TDP / sustained	120W APU / 230W system	295W	580W GPU + system
MSRP (128GB config)	$1,999	$4,999	~$3,000 system
Form factor	Mini-PC	Mini-tower	Mid-tower desktop

Prefill vs generation: why APUs are decode-bound

Strix Halo is memory-bandwidth-bound, not compute-bound, for typical local-LLM inference. The Radeon 8060S iGPU has plenty of FLOPs (roughly 57 TFLOPS fp16), but a 70B q4 decode step has to stream all 42GB of weights through the memory bus once per token. At 256 GB/s, a perfectly memory-bound implementation gives you 256 / 42 = ~6 tok/s as the theoretical ceiling. Real-world overhead pulls that to 4 tok/s.

Prefill is different. Prefill processes many tokens per pass, reusing weights, so it is compute-bound rather than bandwidth-bound. The iGPU's FLOPs do useful work here, and prefill on 70B q4 hits 320–480 tok/s. That is why a long retrieved-context prompt is a 20-second wait but a chat reply streams at 4 tok/s.

Implication: Strix Halo is a good fit for chat and "answer this question" workflows. It is a poor fit for agentic loops that want fast turn-by-turn replies, and a poor fit for batched inference where you want to serve concurrent users. For those, look at GPU rigs.

Context-length impact at 8k / 32k / 128k

KV-cache scales linearly with context length. For Llama 3.1 70B q4 on Strix Halo:

8k context: ~8GB KV-cache. Weights + KV total ~50GB. Fits comfortably in 96GB.
32k context: ~32GB KV-cache. Total ~74GB. Tight but fits at 96GB reserve.
128k context: ~128GB KV-cache. Total ~170GB. Does not fit. Use q3 weights or move to a different rig.

For Qwen 3.6 27B:

8k context: ~3GB KV-cache. Total ~19GB. Easy.
32k context: ~14GB KV-cache. Total ~30GB. Easy.
128k context: ~56GB KV-cache. Total ~72GB. Fits at 96GB reserve, leaves modest headroom.

The 128GB capacity advantage shines at long-context 27B-class workloads. If your use case is "summarize this 100k-token document," the Ryzen 395 box is genuinely competitive with anything in its price class.

Perf-per-dollar and perf-per-watt math

Rig	Cost	Tok/s on 70B q4	$ / tok/s	Watts	Tok/s per W
Ryzen 395 box (128GB)	$1,999	4.0	$499	220W	0.018
Mac Studio M3 Ultra 128GB	$4,999	10.0	$500	280W	0.036
RTX 5090 desktop (32GB)	$3,000	1.5 (offload)	$2,000	600W	0.0025

The Ryzen 395 box matches the Mac Studio on dollars-per-tok/s and is the cheapest entry into the 70B tier. The Mac Studio doubles its performance-per-watt because of the bandwidth advantage. The RTX 5090 is not in the running for 70B work — it is a fundamentally different rig optimized for a different workload.

Verdict matrix

Get the Ryzen 395 box if you want the cheapest path to 70B-class local inference, are happy with chat-grade tok/s, and live mostly in Ollama / llama.cpp. The $2k price tag is unmatched.
Get a Mac Studio M3 Ultra if budget allows $5k+, you want 2× the tok/s at the 70B tier, and you value polished software (MLX, llama.cpp Metal, LM Studio) over framework breadth.
Get a 5090 desktop if your models stay at 32B or below, you also need image-gen or fine-tuning, or you need >50 tok/s decode for agentic workflows. Capacity above 32GB is not your bottleneck.

Bottom line

The AMD Ryzen AI Max+ 395 box is the most interesting local-LLM hardware launch of 2026 in the budget tier. For $2,000 it lands 128GB of unified memory and 256 GB/s of bandwidth in a small-form-factor chassis — neither is class-leading, but together they unlock 70B-class inference at a price point that no other rig hits in 2026. ROCm 7 is mostly-fine, llama.cpp Vulkan is solid, and the box is genuinely usable as a quiet always-on local inference appliance. If your spend ceiling is $2,500 and you want to run Llama 3.1 70B or Qwen 3.6 72B at home, this is the right buy. If you can afford $5k or you primarily care about sub-32B models, look elsewhere — the Mac Studio is faster at this capacity tier and the 5090 desktop dominates everywhere a model fits in 32GB.

Related guides

Sources

AMD, Ryzen AI Max+ 395 product page and June 2026 launch announcement — amd.com
r/LocalLLaMA, Strix Halo benchmark megathread, April 2026 — covering Beelink GTR9 Pro, MINISFORUM MS-S1 MAX, and Corsair AI Workstation 300
Phoronix, ROCm 7 review and Strix Halo iGPU performance analysis, April 2026 — phoronix.com
llama.cpp project, Vulkan backend optimizations for RDNA 3.5, March–April 2026 — github.com/ggerganov/llama.cpp
Notebookcheck, Ryzen AI Max+ 395 deep-dive: thermals, power, sustained performance, April 2026 — notebookcheck.net