AMD Ryzen AI Max 400 'Gorgon Halo': 192GB Unified Memory for Local LLMs

AMD Ryzen AI Max 400 'Gorgon Halo': 192GB Unified Memory for Local LLMs

AMD's $3-$4K APU lands 100B+ MoE LLMs in a 1-liter mini-PC for the price of a Mac Studio Ultra mid-tier.

Gorgon Halo's 192GB unified memory ceiling and LPDDR5X-9600 bandwidth land it as a real Mac Studio competitor for local 70B-120B LLM inference at half the price.

Yes — AMD's Ryzen AI Max 400 "Gorgon Halo" APU can run 70B-class LLMs locally because its 192GB unified-memory ceiling lets you load even a 120B-Q4 model without offloading to system RAM, and the LPDDR5X-9600 bandwidth (~307 GB/s) is enough to push 8-11 tok/s on Llama 3.1 70B-Q4. It is slower than a dual-RTX-3060 build for raw throughput but wins on idle power, form factor, and the ability to host model sizes that a discrete-GPU rig literally cannot fit.

Editorial intro: refreshing the APU lineup, the 192GB Mac-Studio-killer claim, and where Strix Halo left off

Per Tom's Hardware's launch coverage, AMD's Ryzen AI Max 400 "Gorgon Halo" refresh extends the Strix Halo lineup with a memory ceiling that genuinely repositions the APU against Apple's Mac Studio for local LLM hosting. Strix Halo's 128 GB ceiling was already enough to fit a 70B-Q4 model with room for context. The new 192 GB Gorgon Halo tier breaks past that into 120B-Q4 territory and even 405B-Q2 with tight context budgets — sizes that have, until now, required either a Mac Studio M3 Ultra ($5,599+) or a multi-GPU server build (RTX A6000 dual at ~$10,000 used).

The cynical read on a 192 GB unified-memory part is that bandwidth, not capacity, is the binding constraint for LLM throughput. That read is half right. A Mac Studio M3 Ultra runs 819 GB/s thanks to its 1024-bit memory bus; Strix Halo shipped at 256 GB/s on a 256-bit bus. Gorgon Halo's bump to LPDDR5X-9600 raises that to roughly 307 GB/s on the same bus width — closing the gap modestly but not erasing Apple's 2.7x bandwidth lead. For long-context prefill on dense models, Apple still wins on raw tok/s.

But the second half of the calculation has shifted. Local LLM users in 2026 are mostly running Mixture-of-Experts (MoE) models — DeepSeek V4, Qwen 3.6 MoE, Llama 4 MoE — where only a small fraction of weights is active per token. MoE makes the "you need 800 GB/s of bandwidth" argument less sharp because the per-token compute is dominated by the small active expert slice. A 307 GB/s APU keeps up with an 819 GB/s GPU much better on MoE than on dense 70B. That's the architectural shift that makes a 192 GB APU genuinely interesting at $3,000-$4,500 versus a $5,599 Mac Studio.

For builders who would rather assemble a discrete-GPU rig — multiple Ryzen 7 5800X hosts driving ZOTAC RTX 3060 12GB cards under tensor parallelism — Gorgon Halo's value proposition has to be weighed against raw throughput and CUDA-stack maturity. We'll walk through both paths.

Key takeaways

  • 192 GB unified memory addresses up to ~168 GB as VRAM via ROCm carveout — enough for 120B-Q4 or 405B-Q2.
  • Memory bandwidth: ~307 GB/s (LPDDR5X-9600, 256-bit) — up from 256 GB/s on Strix Halo, still 2.7x behind Mac Studio M3 Ultra (819 GB/s).
  • NPU: 80 TOPS on the AI engine block — useful for batched prefill and draft-token verification on speculative decoding.
  • Target workload: 70B-120B local hosting, long-context inference, low-idle-power 24/7 agents.
  • Pricing tier: expected $3,000-$4,500 in OEM mini-PCs (Framework Desktop refresh, Beelink, GMKTec) — about $1,500-$2,500 below Mac Studio M3 Ultra 192 GB.
  • Ship window: Q1-Q2 2026, with Framework Desktop confirmed as a launch partner.

What is the Ryzen AI Max 400 "Gorgon Halo" and how does it differ from Strix Halo?

The Ryzen AI Max 400 is the second-generation Strix Halo silicon, on a refined 4 nm node with the same 16-core Zen 5 CPU complex and the same Radeon 8060S (RDNA 3.5) integrated GPU as the original Strix Halo, but with three changes that matter:

  1. LPDDR5X-9600 support (was LPDDR5X-8533), bumping peak memory bandwidth from 256 GB/s to ~307 GB/s.
  2. 192 GB max RAM tier (was 128 GB), enabled by 48 Gbit LPDDR5X chips reaching production volume.
  3. NPU compute tuning that AMD claims hits 80 TOPS sustained on the embedded XDNA 2 block, up from the original Strix Halo's 50 TOPS peak.

Per AMD's official Ryzen AI Max product page, the chip is sold as a soldered-on-package solution to OEMs — no socketed desktop version exists. You buy it in a mini-PC (Framework Desktop, Beelink GTR, GMKTec K11) or a workstation-class laptop. The 192 GB SKU is BGA-soldered at the factory; you cannot upgrade memory after purchase.

That last constraint matters for buyers: pick the RAM tier carefully. A 96 GB Gorgon Halo at ~$2,800 is fine for 70B-Q4 with 32K context; the 192 GB tier at $3,800+ unlocks 120B-Q4 with comparable context and 70B at FP8 (which performs measurably better on multi-turn reasoning benchmarks).

How much VRAM-equivalent does 192 GB unified memory actually expose to llama.cpp?

ROCm exposes a configurable carveout via BIOS. On Strix Halo systems, AMD's reference firmware allowed 50% (default) up to 87.5% (manually set) of installed DRAM to be reserved for the GPU. Gorgon Halo's reference firmware extends that to a 96% max carveout in "AI workstation" mode — explicitly intended for local-LLM hosting.

On a 192 GB Gorgon Halo, that puts the upper bound at roughly 184 GB addressable as GPU memory when running in AI workstation mode. The OS gets the remaining 8 GB plus whatever swap you've configured. Plan to dedicate the box to inference if you push the carveout that aggressively — 8 GB is enough for Linux + a small inference server but not for a desktop with Chrome and a few IDEs.

llama.cpp's HIP backend treats the carveout as a single contiguous device, so a 120B-Q4 model (~75 GB on disk, ~80 GB loaded with KV cache for 32K context) fits cleanly without tensor splitting. Per llama.cpp's GitHub discussions, users running Strix Halo systems have reported successful 70B-FP16 loads (~140 GB) with the 87.5% carveout — the 96% carveout on Gorgon Halo extends that headroom further.

What memory bandwidth does Gorgon Halo deliver vs Mac Studio M3 Ultra?

SystemMemory TechBus WidthPeak BWEffective BW (LLM workload)
Strix Halo (Ryzen AI Max+ 395)LPDDR5X-8533256-bit256 GB/s~210 GB/s
Gorgon Halo (Ryzen AI Max 400)LPDDR5X-9600256-bit307 GB/s~250 GB/s
Mac Studio M3 UltraLPDDR5-85331024-bit819 GB/s~720 GB/s
RTX 3060 12GB (discrete)GDDR6192-bit360 GB/s~310 GB/s
RTX 4090 24GB (discrete)GDDR6X384-bit1,008 GB/s~880 GB/s
Dual RTX 3060 (tensor parallel)720 GB/s aggregate~540 GB/s

Mac Studio M3 Ultra still wins on raw bandwidth by 2.7x against Gorgon Halo. That gap is most visible in long-context prefill, where the model must process tens of thousands of input tokens before generating the first output token. A 32K-token prefill on Llama 70B takes roughly 18 seconds on Mac Studio Ultra vs ~52 seconds on Strix Halo at the same quant — Gorgon Halo should narrow that to ~38 seconds based on the 20% bandwidth uplift plus NPU prefill acceleration.

For short-prompt interactive chat (under 2K input tokens), the bandwidth gap matters much less. Decode tok/s on a 70B-Q4 model lands around 6-8 tok/s on Strix Halo, 8-11 tok/s expected on Gorgon Halo, and 15-18 tok/s on Mac Studio Ultra. You feel the Apple advantage in long-doc summarization; you don't feel it in chat.

Which LLM sizes actually fit — 70B BF16, 120B Q4, 405B Q2?

ModelQuantDisk SizeRAM Required (8K ctx)Fits on 192 GB?
Llama 3.1 70BQ4_K_M42 GB48 GB
Llama 3.1 70BFP16 (BF16)140 GB152 GB✅ (tight)
Llama 4 109B MoEQ4_K_M60 GB68 GB
Qwen 3 110BQ4_K_M65 GB74 GB
Llama 4 400B Maverick MoEQ4_K_M230 GB245 GB❌ (175 GB usable)
Llama 4 400B Maverick MoEQ2_K130 GB142 GB✅ (Q2 is lossy)
DeepSeek V3 671BQ4_K_M380 GB395 GB
DeepSeek V3 671BQ2_K220 GB240 GB❌ (175 GB usable)
Mistral Large 2 123BQ4_K_M70 GB78 GB

The headline 405B-Q2 fits only with a tight context budget (4K-8K). 70B-FP16 fits cleanly. The interesting band is 100B-130B-class MoE — Qwen 3 110B, Llama 4 109B, Mistral Large 2 123B — which sit comfortably in 192 GB at Q4 with 32K context. That's the "you cannot do this on any discrete-GPU rig under $8,000" zone where Gorgon Halo's value is strongest.

How does prefill speed compare to a discrete RTX 3060 12GB at the same model size?

A single RTX 3060 12GB does not fit a 70B model at any quantization. To run 70B-Q4 on RTX 3060, you need two cards in tensor parallelism (combined 24 GB VRAM is just barely enough for a 70B-Q4_K_S with a 4K context). Prefill on the dual-3060 build is dominated by PCIe x16 inter-card latency during the tensor parallel attention step.

Per llama.cpp's benchmark threads, a dual-3060 tensor-parallel rig on Llama 70B-Q4 hits roughly 320 prefill tok/s on an 8K input. Strix Halo measured ~180 prefill tok/s on the same workload (no inter-card overhead, but lower aggregate bandwidth). Gorgon Halo should land in the 215-240 prefill tok/s range — still slower than dual-3060 on prefill, but a meaningful step up.

For decode tok/s, the unified-memory APU narrows the gap considerably. Dual-3060 hits ~12-15 tok/s on 70B-Q4; Strix Halo measured ~6-8 tok/s; Gorgon Halo's bandwidth bump puts it at an expected 8-11 tok/s. Still a discrete-GPU win, but the gap is small enough that the APU's other advantages — power, form factor, single-box deployment — start to matter more than raw throughput.

What's the perf-per-dollar vs building an RTX 3060 dual-GPU rig?

Let's spec a comparable discrete-GPU build:

  • AMD Ryzen 7 5800X: $200
  • B550 motherboard (PCIe 4.0 x16/x8 split): $180
  • 64 GB DDR4-3600 (2x32): $140
  • 2x ZOTAC RTX 3060 12GB: $560 (current new pricing)
  • 850W Gold PSU: $130
  • 1 TB NVMe + case + cooler: $200
  • Total: ~$1,410

A 96 GB Gorgon Halo mini-PC lands at ~$2,800. A 192 GB tier at ~$3,800+. The discrete-GPU build is roughly half the price, and on dense 70B-Q4 it pushes more tok/s. The catch:

  1. The discrete rig can run 70B-Q4 with 24 GB combined VRAM — tight. Long contexts spill to system RAM and tank speed.
  2. It cannot run 100B+ MoE models at all — both the 70B-FP16 case and the 120B-Q4 case exceed combined VRAM.
  3. Idle power is roughly 250 W (both cards spun up). Gorgon Halo idles at ~40 W. For a 24/7 always-on local-agent rig, that's a $200/year electricity gap.
  4. Discrete is ~75 cm × 45 cm in a mid-tower. Gorgon Halo is 1L in a mini-PC. If your office desk matters, that's not nothing.

Perf-per-dollar on the specific workload of 70B-Q4 chat: discrete wins. Perf-per-dollar on the full envelope of "100B+ models, long context, 24/7, quiet desktop": Gorgon Halo wins.

Spec table: Gorgon Halo vs Strix Halo vs M3 Ultra

SpecRyzen AI Max+ 395 (Strix Halo)Ryzen AI Max 400 (Gorgon Halo)Apple M3 Ultra
Cores16 Zen 516 Zen 532 (24 perf + 8 eff)
NPU TOPS (peak)508038
iGPU/GPURadeon 8060S (RDNA 3.5, 40 CU)Radeon 8060S+ (RDNA 3.5, 40 CU)80-core Apple GPU
Memory techLPDDR5X-8533LPDDR5X-9600LPDDR5-8533
Bus width256-bit256-bit1024-bit
Peak BW256 GB/s307 GB/s819 GB/s
Max RAM128 GB192 GB512 GB
TDP55-120 W (configurable)55-120 W80-200 W
Starting OEM MSRP$1,900$2,800 (est)$5,599

Quantization matrix for Llama 70B / Qwen 110B on Gorgon Halo

QuantLlama 70B SizeLlama 70B Est tok/sQwen 110B SizeQwen 110B Est tok/s
Q2_K26 GB1441 GB11
Q3_K_M33 GB1351 GB10
Q4_K_M42 GB1165 GB9
Q5_K_M51 GB979 GB7
Q6_K60 GB892 GB6
Q8_075 GB7117 GB5
FP16140 GB4220 GB (DOESN'T FIT)

The sweet spot is Q4_K_M for Llama 70B and Q3_K_M for Qwen 110B — both fit comfortably with 32K context and push tok/s in the actually-usable range for interactive chat.

Common pitfalls and gotchas

  1. CUDA-only inference runtimes: there's no native CUDA on Gorgon Halo. AMD's path is ROCm/HIP. Workloads pinned to bitsandbytes 4-bit kernels, FlashAttention-2 CUDA-only kernels, or CUDA-12.8-specific ops will need ROCm-equivalent wrappers or won't run at all. Stick to llama.cpp, the ROCm fork of vLLM, MLC-LLM, or Ollama for clean operation.
  1. Memory tier locked at factory: the 192 GB SKU is BGA-soldered. You cannot upgrade a 96 GB unit to 192 GB later. Buy the tier you'll need in 18 months.
  1. OEM firmware quality: Strix Halo had a turbulent first six months of OEM firmware bugs (carveout sizing inconsistencies, idle-clock floor too high). Expect Gorgon Halo to clear those within three months of launch, but early adopters should be ready for BIOS-update cycles.
  1. Cooling under sustained inference: Strix Halo's 55W TDP target is idle plus light inference. Sustained 70B decode pulls the package up to 90W and pushes mini-PC fans into audible-range RPMs. The Framework Desktop has the best thermal headroom; cheap Beelink/GMKTec units will throttle within 20 minutes of sustained decode.
  1. Confusing "Ryzen AI Max" branding: AMD uses "Ryzen AI Max 300" for the Strix Halo refresh on the same silicon. The 400 series is the true Gorgon Halo refresh. Check the exact part number (e.g., "AI Max 400" vs "AI Max+ 395") before buying.

When NOT to choose Gorgon Halo

  • You only run 7B-13B models: a $300 used RTX 3060 12GB on any modern CPU host is dramatically cheaper and faster.
  • You need CUDA-specific stacks: bitsandbytes 4-bit + FlashAttention-2 + the latest llama-cpp-python features land on CUDA first.
  • You need >24 GB/s per-token decode: an RTX 4090 build pushes 30+ tok/s on 70B-Q4. Bandwidth-hungry users buy a $1,800 discrete GPU.
  • You play games on the same box: Gorgon Halo's iGPU is fine (Radeon 8060S, roughly RX 7600 class) but a dedicated RTX 4070 outclasses it. If gaming is 50% of your usage, build a discrete rig.

Verdict matrix

Get Gorgon Halo (192 GB) if... you want to host 100B-130B MoE models locally, value sub-50W idle power, want a 1L form factor, and can write off CUDA-only research code.

Get the dual RTX 3060 12GB build if... your model ceiling is 70B-Q4 with short context, you want max tok/s for the dollar, and you're comfortable maintaining a tower PC.

Get the MSI RTX 3060 Ventus 2X single-card build with Ryzen 7 5700X if... you live in the 7B-13B model space, want a cheap and quiet rig, and won't grow into 70B in the next 18 months.

Wait and get Mac Studio M3 Ultra 192 GB if... raw decode tok/s on dense 70B-FP16 is your only success metric and you can write the $5,599 check without flinching.

Bottom line

The Ryzen AI Max 400's 192 GB ceiling is the headline number, but the real upgrade is in MoE-era usability. Gorgon Halo makes 100B+ MoE models a normal-desk experience for under $4,000 — territory that previously required either a Mac Studio Ultra or a multi-GPU server build. Bandwidth still matters and Apple still wins it; AMD has chosen capacity-per-dollar as the lane to compete in, and at 192 GB / $3,800 it is genuinely the cheapest path to running models that simply do not fit anywhere else under $5,500.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How much of the 192GB unified memory is actually addressable as VRAM by llama.cpp?
Per AMD's developer documentation for the Strix Halo lineage, ROCm exposes a configurable VRAM carveout (BIOS-side, typically 50-87.5% of installed DRAM). On a 192GB Gorgon Halo system, that puts the upper bound near 168GB addressable as GPU memory. llama.cpp's HIP backend treats this as a single contiguous device, so a 120B-Q4 model (~75GB) loads without tensor splitting. System RAM available to the OS shrinks proportionally — plan to dedicate the box to inference if you push the carveout.
What memory bandwidth does Gorgon Halo deliver compared to the Mac Studio M3 Ultra?
Strix Halo shipped at 256 GB/s via LPDDR5X-8533 on a 256-bit bus. AMD's Gorgon Halo prerelease materials cite LPDDR5X-9600 on the same bus width, putting peak bandwidth near 307 GB/s. Mac Studio M3 Ultra runs 819 GB/s thanks to its 1024-bit bus. The 2.7× Apple advantage is real and shows up most in long-context prefill; for short-context decode, NPU and CPU efficiency narrow the gap considerably.
Will I get better tok/s than a dual RTX 3060 12GB build for a 70B-Q4 model?
Probably not for raw throughput. A dual-RTX-3060 rig pushes ~12-15 tok/s on Llama 3.1 70B-Q4 with tensor parallelism. Strix Halo measured ~6-8 tok/s on the same model; Gorgon Halo's bandwidth bump suggests 8-11 tok/s. The APU wins on idle power (40W vs 250W), form factor (1L mini-PC vs full ATX), and the ability to load 120B+ models the dual-3060 rig literally cannot fit.
Does Gorgon Halo support CUDA-only inference runtimes via translation?
No native CUDA support; AMD's path is ROCm/HIP. The ZLUDA project, briefly revived in 2024, was sunset by AMD. Practically: llama.cpp, vLLM (ROCm fork), MLC-LLM, and ollama all run native. PyTorch via ROCm works for most non-bleeding-edge models. Anything pinned to bitsandbytes 4-bit on CUDA kernels, FlashAttention-2 custom kernels, or recent CUDA-12.8-only ops will need wrappers or won't run.
When does it ship and what's the price tier?
Per the Tom's Hardware launch coverage, Gorgon Halo SKUs target Q1-Q2 2026 in OEM mini-PCs and Framework Desktop refreshes. Pricing has not been disclosed by AMD, but the Strix Halo lineage shipped in $1,600-$2,400 mini-PCs depending on RAM tier. A 192GB Gorgon Halo box is expected to land in the $3,000-$4,500 range, undercutting a 192GB Mac Studio M3 Ultra ($5,599 base) by roughly $1,500-$2,500.

Sources

— SpecPicks Editorial · Last verified 2026-05-25

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →