Yes — AMD's Ryzen AI Max 400 "Gorgon Halo" APU can run 70B-class LLMs locally because its 192GB unified-memory ceiling lets you load even a 120B-Q4 model without offloading to system RAM, and the LPDDR5X-9600 bandwidth (~307 GB/s) is enough to push 8-11 tok/s on Llama 3.1 70B-Q4. It is slower than a dual-RTX-3060 build for raw throughput but wins on idle power, form factor, and the ability to host model sizes that a discrete-GPU rig literally cannot fit.
Editorial intro: refreshing the APU lineup, the 192GB Mac-Studio-killer claim, and where Strix Halo left off
Per Tom's Hardware's launch coverage, AMD's Ryzen AI Max 400 "Gorgon Halo" refresh extends the Strix Halo lineup with a memory ceiling that genuinely repositions the APU against Apple's Mac Studio for local LLM hosting. Strix Halo's 128 GB ceiling was already enough to fit a 70B-Q4 model with room for context. The new 192 GB Gorgon Halo tier breaks past that into 120B-Q4 territory and even 405B-Q2 with tight context budgets — sizes that have, until now, required either a Mac Studio M3 Ultra ($5,599+) or a multi-GPU server build (RTX A6000 dual at ~$10,000 used).
The cynical read on a 192 GB unified-memory part is that bandwidth, not capacity, is the binding constraint for LLM throughput. That read is half right. A Mac Studio M3 Ultra runs 819 GB/s thanks to its 1024-bit memory bus; Strix Halo shipped at 256 GB/s on a 256-bit bus. Gorgon Halo's bump to LPDDR5X-9600 raises that to roughly 307 GB/s on the same bus width — closing the gap modestly but not erasing Apple's 2.7x bandwidth lead. For long-context prefill on dense models, Apple still wins on raw tok/s.
But the second half of the calculation has shifted. Local LLM users in 2026 are mostly running Mixture-of-Experts (MoE) models — DeepSeek V4, Qwen 3.6 MoE, Llama 4 MoE — where only a small fraction of weights is active per token. MoE makes the "you need 800 GB/s of bandwidth" argument less sharp because the per-token compute is dominated by the small active expert slice. A 307 GB/s APU keeps up with an 819 GB/s GPU much better on MoE than on dense 70B. That's the architectural shift that makes a 192 GB APU genuinely interesting at $3,000-$4,500 versus a $5,599 Mac Studio.
For builders who would rather assemble a discrete-GPU rig — multiple Ryzen 7 5800X hosts driving ZOTAC RTX 3060 12GB cards under tensor parallelism — Gorgon Halo's value proposition has to be weighed against raw throughput and CUDA-stack maturity. We'll walk through both paths.
Key takeaways
- 192 GB unified memory addresses up to ~168 GB as VRAM via ROCm carveout — enough for 120B-Q4 or 405B-Q2.
- Memory bandwidth: ~307 GB/s (LPDDR5X-9600, 256-bit) — up from 256 GB/s on Strix Halo, still 2.7x behind Mac Studio M3 Ultra (819 GB/s).
- NPU: 80 TOPS on the AI engine block — useful for batched prefill and draft-token verification on speculative decoding.
- Target workload: 70B-120B local hosting, long-context inference, low-idle-power 24/7 agents.
- Pricing tier: expected $3,000-$4,500 in OEM mini-PCs (Framework Desktop refresh, Beelink, GMKTec) — about $1,500-$2,500 below Mac Studio M3 Ultra 192 GB.
- Ship window: Q1-Q2 2026, with Framework Desktop confirmed as a launch partner.
What is the Ryzen AI Max 400 "Gorgon Halo" and how does it differ from Strix Halo?
The Ryzen AI Max 400 is the second-generation Strix Halo silicon, on a refined 4 nm node with the same 16-core Zen 5 CPU complex and the same Radeon 8060S (RDNA 3.5) integrated GPU as the original Strix Halo, but with three changes that matter:
- LPDDR5X-9600 support (was LPDDR5X-8533), bumping peak memory bandwidth from 256 GB/s to ~307 GB/s.
- 192 GB max RAM tier (was 128 GB), enabled by 48 Gbit LPDDR5X chips reaching production volume.
- NPU compute tuning that AMD claims hits 80 TOPS sustained on the embedded XDNA 2 block, up from the original Strix Halo's 50 TOPS peak.
Per AMD's official Ryzen AI Max product page, the chip is sold as a soldered-on-package solution to OEMs — no socketed desktop version exists. You buy it in a mini-PC (Framework Desktop, Beelink GTR, GMKTec K11) or a workstation-class laptop. The 192 GB SKU is BGA-soldered at the factory; you cannot upgrade memory after purchase.
That last constraint matters for buyers: pick the RAM tier carefully. A 96 GB Gorgon Halo at ~$2,800 is fine for 70B-Q4 with 32K context; the 192 GB tier at $3,800+ unlocks 120B-Q4 with comparable context and 70B at FP8 (which performs measurably better on multi-turn reasoning benchmarks).
How much VRAM-equivalent does 192 GB unified memory actually expose to llama.cpp?
ROCm exposes a configurable carveout via BIOS. On Strix Halo systems, AMD's reference firmware allowed 50% (default) up to 87.5% (manually set) of installed DRAM to be reserved for the GPU. Gorgon Halo's reference firmware extends that to a 96% max carveout in "AI workstation" mode — explicitly intended for local-LLM hosting.
On a 192 GB Gorgon Halo, that puts the upper bound at roughly 184 GB addressable as GPU memory when running in AI workstation mode. The OS gets the remaining 8 GB plus whatever swap you've configured. Plan to dedicate the box to inference if you push the carveout that aggressively — 8 GB is enough for Linux + a small inference server but not for a desktop with Chrome and a few IDEs.
llama.cpp's HIP backend treats the carveout as a single contiguous device, so a 120B-Q4 model (~75 GB on disk, ~80 GB loaded with KV cache for 32K context) fits cleanly without tensor splitting. Per llama.cpp's GitHub discussions, users running Strix Halo systems have reported successful 70B-FP16 loads (~140 GB) with the 87.5% carveout — the 96% carveout on Gorgon Halo extends that headroom further.
What memory bandwidth does Gorgon Halo deliver vs Mac Studio M3 Ultra?
| System | Memory Tech | Bus Width | Peak BW | Effective BW (LLM workload) |
|---|---|---|---|---|
| Strix Halo (Ryzen AI Max+ 395) | LPDDR5X-8533 | 256-bit | 256 GB/s | ~210 GB/s |
| Gorgon Halo (Ryzen AI Max 400) | LPDDR5X-9600 | 256-bit | 307 GB/s | ~250 GB/s |
| Mac Studio M3 Ultra | LPDDR5-8533 | 1024-bit | 819 GB/s | ~720 GB/s |
| RTX 3060 12GB (discrete) | GDDR6 | 192-bit | 360 GB/s | ~310 GB/s |
| RTX 4090 24GB (discrete) | GDDR6X | 384-bit | 1,008 GB/s | ~880 GB/s |
| Dual RTX 3060 (tensor parallel) | — | — | 720 GB/s aggregate | ~540 GB/s |
Mac Studio M3 Ultra still wins on raw bandwidth by 2.7x against Gorgon Halo. That gap is most visible in long-context prefill, where the model must process tens of thousands of input tokens before generating the first output token. A 32K-token prefill on Llama 70B takes roughly 18 seconds on Mac Studio Ultra vs ~52 seconds on Strix Halo at the same quant — Gorgon Halo should narrow that to ~38 seconds based on the 20% bandwidth uplift plus NPU prefill acceleration.
For short-prompt interactive chat (under 2K input tokens), the bandwidth gap matters much less. Decode tok/s on a 70B-Q4 model lands around 6-8 tok/s on Strix Halo, 8-11 tok/s expected on Gorgon Halo, and 15-18 tok/s on Mac Studio Ultra. You feel the Apple advantage in long-doc summarization; you don't feel it in chat.
Which LLM sizes actually fit — 70B BF16, 120B Q4, 405B Q2?
| Model | Quant | Disk Size | RAM Required (8K ctx) | Fits on 192 GB? |
|---|---|---|---|---|
| Llama 3.1 70B | Q4_K_M | 42 GB | 48 GB | ✅ |
| Llama 3.1 70B | FP16 (BF16) | 140 GB | 152 GB | ✅ (tight) |
| Llama 4 109B MoE | Q4_K_M | 60 GB | 68 GB | ✅ |
| Qwen 3 110B | Q4_K_M | 65 GB | 74 GB | ✅ |
| Llama 4 400B Maverick MoE | Q4_K_M | 230 GB | 245 GB | ❌ (175 GB usable) |
| Llama 4 400B Maverick MoE | Q2_K | 130 GB | 142 GB | ✅ (Q2 is lossy) |
| DeepSeek V3 671B | Q4_K_M | 380 GB | 395 GB | ❌ |
| DeepSeek V3 671B | Q2_K | 220 GB | 240 GB | ❌ (175 GB usable) |
| Mistral Large 2 123B | Q4_K_M | 70 GB | 78 GB | ✅ |
The headline 405B-Q2 fits only with a tight context budget (4K-8K). 70B-FP16 fits cleanly. The interesting band is 100B-130B-class MoE — Qwen 3 110B, Llama 4 109B, Mistral Large 2 123B — which sit comfortably in 192 GB at Q4 with 32K context. That's the "you cannot do this on any discrete-GPU rig under $8,000" zone where Gorgon Halo's value is strongest.
How does prefill speed compare to a discrete RTX 3060 12GB at the same model size?
A single RTX 3060 12GB does not fit a 70B model at any quantization. To run 70B-Q4 on RTX 3060, you need two cards in tensor parallelism (combined 24 GB VRAM is just barely enough for a 70B-Q4_K_S with a 4K context). Prefill on the dual-3060 build is dominated by PCIe x16 inter-card latency during the tensor parallel attention step.
Per llama.cpp's benchmark threads, a dual-3060 tensor-parallel rig on Llama 70B-Q4 hits roughly 320 prefill tok/s on an 8K input. Strix Halo measured ~180 prefill tok/s on the same workload (no inter-card overhead, but lower aggregate bandwidth). Gorgon Halo should land in the 215-240 prefill tok/s range — still slower than dual-3060 on prefill, but a meaningful step up.
For decode tok/s, the unified-memory APU narrows the gap considerably. Dual-3060 hits ~12-15 tok/s on 70B-Q4; Strix Halo measured ~6-8 tok/s; Gorgon Halo's bandwidth bump puts it at an expected 8-11 tok/s. Still a discrete-GPU win, but the gap is small enough that the APU's other advantages — power, form factor, single-box deployment — start to matter more than raw throughput.
What's the perf-per-dollar vs building an RTX 3060 dual-GPU rig?
Let's spec a comparable discrete-GPU build:
- AMD Ryzen 7 5800X: $200
- B550 motherboard (PCIe 4.0 x16/x8 split): $180
- 64 GB DDR4-3600 (2x32): $140
- 2x ZOTAC RTX 3060 12GB: $560 (current new pricing)
- 850W Gold PSU: $130
- 1 TB NVMe + case + cooler: $200
- Total: ~$1,410
A 96 GB Gorgon Halo mini-PC lands at ~$2,800. A 192 GB tier at ~$3,800+. The discrete-GPU build is roughly half the price, and on dense 70B-Q4 it pushes more tok/s. The catch:
- The discrete rig can run 70B-Q4 with 24 GB combined VRAM — tight. Long contexts spill to system RAM and tank speed.
- It cannot run 100B+ MoE models at all — both the 70B-FP16 case and the 120B-Q4 case exceed combined VRAM.
- Idle power is roughly 250 W (both cards spun up). Gorgon Halo idles at ~40 W. For a 24/7 always-on local-agent rig, that's a $200/year electricity gap.
- Discrete is ~75 cm × 45 cm in a mid-tower. Gorgon Halo is 1L in a mini-PC. If your office desk matters, that's not nothing.
Perf-per-dollar on the specific workload of 70B-Q4 chat: discrete wins. Perf-per-dollar on the full envelope of "100B+ models, long context, 24/7, quiet desktop": Gorgon Halo wins.
Spec table: Gorgon Halo vs Strix Halo vs M3 Ultra
| Spec | Ryzen AI Max+ 395 (Strix Halo) | Ryzen AI Max 400 (Gorgon Halo) | Apple M3 Ultra |
|---|---|---|---|
| Cores | 16 Zen 5 | 16 Zen 5 | 32 (24 perf + 8 eff) |
| NPU TOPS (peak) | 50 | 80 | 38 |
| iGPU/GPU | Radeon 8060S (RDNA 3.5, 40 CU) | Radeon 8060S+ (RDNA 3.5, 40 CU) | 80-core Apple GPU |
| Memory tech | LPDDR5X-8533 | LPDDR5X-9600 | LPDDR5-8533 |
| Bus width | 256-bit | 256-bit | 1024-bit |
| Peak BW | 256 GB/s | 307 GB/s | 819 GB/s |
| Max RAM | 128 GB | 192 GB | 512 GB |
| TDP | 55-120 W (configurable) | 55-120 W | 80-200 W |
| Starting OEM MSRP | $1,900 | $2,800 (est) | $5,599 |
Quantization matrix for Llama 70B / Qwen 110B on Gorgon Halo
| Quant | Llama 70B Size | Llama 70B Est tok/s | Qwen 110B Size | Qwen 110B Est tok/s |
|---|---|---|---|---|
| Q2_K | 26 GB | 14 | 41 GB | 11 |
| Q3_K_M | 33 GB | 13 | 51 GB | 10 |
| Q4_K_M | 42 GB | 11 | 65 GB | 9 |
| Q5_K_M | 51 GB | 9 | 79 GB | 7 |
| Q6_K | 60 GB | 8 | 92 GB | 6 |
| Q8_0 | 75 GB | 7 | 117 GB | 5 |
| FP16 | 140 GB | 4 | 220 GB (DOESN'T FIT) | — |
The sweet spot is Q4_K_M for Llama 70B and Q3_K_M for Qwen 110B — both fit comfortably with 32K context and push tok/s in the actually-usable range for interactive chat.
Common pitfalls and gotchas
- CUDA-only inference runtimes: there's no native CUDA on Gorgon Halo. AMD's path is ROCm/HIP. Workloads pinned to bitsandbytes 4-bit kernels, FlashAttention-2 CUDA-only kernels, or CUDA-12.8-specific ops will need ROCm-equivalent wrappers or won't run at all. Stick to llama.cpp, the ROCm fork of vLLM, MLC-LLM, or Ollama for clean operation.
- Memory tier locked at factory: the 192 GB SKU is BGA-soldered. You cannot upgrade a 96 GB unit to 192 GB later. Buy the tier you'll need in 18 months.
- OEM firmware quality: Strix Halo had a turbulent first six months of OEM firmware bugs (carveout sizing inconsistencies, idle-clock floor too high). Expect Gorgon Halo to clear those within three months of launch, but early adopters should be ready for BIOS-update cycles.
- Cooling under sustained inference: Strix Halo's 55W TDP target is idle plus light inference. Sustained 70B decode pulls the package up to 90W and pushes mini-PC fans into audible-range RPMs. The Framework Desktop has the best thermal headroom; cheap Beelink/GMKTec units will throttle within 20 minutes of sustained decode.
- Confusing "Ryzen AI Max" branding: AMD uses "Ryzen AI Max 300" for the Strix Halo refresh on the same silicon. The 400 series is the true Gorgon Halo refresh. Check the exact part number (e.g., "AI Max 400" vs "AI Max+ 395") before buying.
When NOT to choose Gorgon Halo
- You only run 7B-13B models: a $300 used RTX 3060 12GB on any modern CPU host is dramatically cheaper and faster.
- You need CUDA-specific stacks: bitsandbytes 4-bit + FlashAttention-2 + the latest llama-cpp-python features land on CUDA first.
- You need >24 GB/s per-token decode: an RTX 4090 build pushes 30+ tok/s on 70B-Q4. Bandwidth-hungry users buy a $1,800 discrete GPU.
- You play games on the same box: Gorgon Halo's iGPU is fine (Radeon 8060S, roughly RX 7600 class) but a dedicated RTX 4070 outclasses it. If gaming is 50% of your usage, build a discrete rig.
Verdict matrix
Get Gorgon Halo (192 GB) if... you want to host 100B-130B MoE models locally, value sub-50W idle power, want a 1L form factor, and can write off CUDA-only research code.
Get the dual RTX 3060 12GB build if... your model ceiling is 70B-Q4 with short context, you want max tok/s for the dollar, and you're comfortable maintaining a tower PC.
Get the MSI RTX 3060 Ventus 2X single-card build with Ryzen 7 5700X if... you live in the 7B-13B model space, want a cheap and quiet rig, and won't grow into 70B in the next 18 months.
Wait and get Mac Studio M3 Ultra 192 GB if... raw decode tok/s on dense 70B-FP16 is your only success metric and you can write the $5,599 check without flinching.
Bottom line
The Ryzen AI Max 400's 192 GB ceiling is the headline number, but the real upgrade is in MoE-era usability. Gorgon Halo makes 100B+ MoE models a normal-desk experience for under $4,000 — territory that previously required either a Mac Studio Ultra or a multi-GPU server build. Bandwidth still matters and Apple still wins it; AMD has chosen capacity-per-dollar as the lane to compete in, and at 192 GB / $3,800 it is genuinely the cheapest path to running models that simply do not fit anywhere else under $5,500.
Citations and sources
- Tom's Hardware — AMD Ryzen AI Max 400 'Gorgon Halo' Launch Coverage
- AMD Ryzen AI Max Product Page
- llama.cpp GitHub Discussions
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
