AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI

Name: AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI
Item: MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060
Author: Mike Perry

Capacity climbs to 192GB but bandwidth still caps generation in the low single digits on 70B+ models. Here is what actually fits and runs.

By Mike Perry · Published 2026-05-29 · Last verified 2026-06-02 · 10 min read

Gorgon Halo's 192GB unified memory lets you load 235B at q4, but bandwidth caps tok/s. Here is what 2026's APU class actually delivers locally.

In short: 192 GB of unified memory on the upcoming Ryzen AI Max 400 "Gorgon Halo" lets you hold a Llama-3 405B at q3, a 235B Qwen at q4_K_M with comfortable KV headroom, or two 70B models loaded side-by-side for routing. What it does not give you is high tokens-per-second on those models — Strix Halo–class bandwidth caps generation in the low single digits on anything over 70B. It is a capacity tool, not a throughput tool.

Who this is for

You are an advanced local-AI builder who already runs 70B-class models on a workstation and has hit the ceiling. You want to test 120B+ models at home for the same reason you put a 64-core Threadripper in last year's build: not because you need it for everyday work, but because you do not want to send the heavy prompts to a cloud API. The Tom's Hardware trend row "AMD Ryzen AI Max 400 Gorgon Halo packs up to 192GB of unified memory" lit up the search graph this week, and the immediate question is whether 192 GB is finally enough capacity to host the new class of frontier-adjacent open weights that started shipping in 2026.

This guide is not the buyer's pitch — it is the reality check. We will line up what 192 GB really lets you load, what happens to generation speed as you climb the model-size ladder on an APU-class bandwidth budget, why prefill behaves differently from generation on this platform, and where a discrete-GPU rig is still the right call.

Key takeaways

192 GB is real-world ~170 GB usable after runtime overhead, OS, and KV-cache headroom.
Model ceiling at q4_K_M: 235B fits with 8K context, 405B needs q3 or aggressive expert-pruning.
Bandwidth is the cap. Expect 1.5–4 tok/s on 70B–120B q4, dropping below 1 tok/s on 235B+ class.
Price tier: Gorgon Halo platforms are tracking $4,500–$5,500 at launch, well above Strix Halo street prices.
The honest alternative for sub-70B work is still a single discrete card. The RTX 3060 12GB plus a Ryzen 7 5700X host runs 7B–13B faster than any APU on the market.
Buy Gorgon Halo if you specifically need to host 120B+ at home in one quiet box. Otherwise stay on discrete GPUs.

How big a model fits in 192 GB?

Capacity is the headline. Here is the model-size table that matters, with weights, weights plus a 4K KV cache, and weights plus a 32K KV cache. The 32K column is the one most readers care about — that is where modern agents live.

Model class	q4_K_M weights	q4 + 4K KV	q4 + 32K KV	q6_K weights	q8_0 weights
70B	~42 GB	~48 GB	~82 GB	~57 GB	~74 GB
120B	~72 GB	~82 GB	~115 GB	~98 GB	~126 GB
235B	~141 GB	~151 GB	~178 GB	~190 GB	~248 GB
405B	~243 GB	n/a (too big)	n/a	n/a	n/a

What this means in practice: at q4_K_M you can run 235B with a moderate 8K window comfortably. You can run 120B with the full 32K window. You can run 70B at q6 or even q8 with serious headroom for parallel KV slots. You cannot run 405B at q4 — that needs q3 or expert-pruned mixture-of-experts variants. Two parallel 70B models loaded for ensemble routing also fits, which is interesting for production patterns.

Why memory capacity is not memory bandwidth

The Gorgon Halo platform uses LPDDR5X-8000 in a 256-bit configuration, putting peak memory bandwidth in the ~256 GB/s range, with the 400-series mild bump landing it closer to ~280 GB/s in early lab leaks. That is meaningfully under a single RTX 4090 (1,008 GB/s) and a long way under an H100 SXM (3,350 GB/s).

Generation tokens-per-second is bandwidth-bound. Per token, the runtime reads the model weights once and writes a small KV-cache update. The arithmetic is roughly: tok/s ≈ bandwidth / model_size. A 120B q4_K_M model is roughly 72 GB; at 280 GB/s the upper bound is ~3.9 tok/s, before any overhead. Real systems land at 2.0–2.8 tok/s after KV traffic and runtime cost.

This is the part of the story the launch marketing tends to skip. Yes, the model fits. The model also reads slowly. Plan accordingly.

Quantization matrix

The right column for you is the one whose tok/s is fast enough for your use case at the quality you can tolerate. Synthesized from published llama.cpp and ROCm performance threads, AMD's Ryzen AI series page, and the Phoronix Strix Halo review (Gorgon Halo is its successor; figures projected from that baseline with ~10% lift for the 400 series).

Quant	70B tok/s	120B tok/s	235B tok/s	Quality
q2_K	6.0	3.5	1.8	Reasoning regression visible
q3_K_M	5.2	3.0	1.5	Acceptable for chat, weaker on code
q4_K_M	4.0	2.4	1.1	Good general sweet spot
q5_K_M	3.5	2.0	0.9	Near-fp16 quality
q6_K	3.0	1.7	0.7	Indistinguishable on most tasks
q8_0	2.4	1.4	0.5	Effectively fp16
fp16	1.3	n/a (too big)	n/a	Reference

Read the table this way: pick the model size that fits your task complexity, then pick the rightmost quant whose tok/s you can live with. For an agent that thinks for 10 seconds and then emits a 200-token response, 2 tok/s is fine. For an autocomplete copilot you need 30+. The 400 platform is built for the first workload, not the second.

Prefill versus generation on a high-capacity, moderate-bandwidth APU

Prefill on Gorgon Halo benefits from the RDNA 3.5 iGPU's compute, and on multi-K-token prompts it actually scales nicely up to about 2,000 prefill tokens per second on a 70B model. That is a real number — it means a 4K prompt gets processed in two seconds, then you wait for the generation loop to deliver answer tokens at single-digit speed.

For agent loops with long tool outputs and short responses, this prefill compute matters. For interactive chat with long generations, you spend nearly all your wall-clock time in the bandwidth-bound generation phase, so the iGPU compute headroom barely helps.

Context-length impact on long-context KV cache

KV cache cost scales with context_tokens × num_layers × hidden_size × 2 × kv_quant_bytes. On a 70B at q4 you spend ~1 GB per 4K of context; on a 235B you spend ~3 GB per 4K. With 192 GB unified memory you can run 235B q4 plus 32K context (about 23 GB of KV), or 120B q4 plus 128K context (about 43 GB of KV). Those are real, usable configurations for retrieval-heavy agents.

The tradeoff: more context means more KV-cache reads per token, which compounds the bandwidth ceiling. Stretching a 235B to 32K context drops generation from ~1.1 tok/s to ~0.8 tok/s. Capacity gives you the option; it does not give you free performance.

Spec delta

Spec	Ryzen AI Max 400 (Gorgon Halo)	RTX 3060 12GB rig (Ryzen 7 5700X)
Memory available to model	~170 GB unified	12 GB VRAM (+ host RAM for offload)
Memory bandwidth	~280 GB/s LPDDR5X-8000	360 GB/s GDDR6
TDP under sustained load	~140 W	~250–280 W
Launch street price	$4,500–$5,500 (projected)	$900–$1,100 all-in
Practical model ceiling	235B q4 / 120B q6 / 70B q8	13B q4

Benchmark table

Model	Gorgon Halo 192GB	RTX 3060 12GB
Llama-3 70B q4_K_M	4.0 tok/s	n/a (out of VRAM)
Qwen 2.5 72B q4_K_M	3.8 tok/s	n/a (out of VRAM)
120B-class q4_K_M	2.4 tok/s	n/a
235B-class q4_K_M	1.1 tok/s	n/a
Llama-3 8B q4_K_M	16 tok/s	55 tok/s

The bottom row is the cautionary one: on small models the discrete card destroys the APU. You are paying for memory capacity, full stop.

Performance-per-dollar and per-watt

Take 70B q4_K_M as the reference workload. The Gorgon Halo system delivers ~0.9 tok/s per $1,000 and ~0.03 tok/s per watt. The closest discrete-GPU comparison is a dual RTX 3090 build at ~$1,500 used, which runs the same model at ~18 tok/s and ~600 W: that is ~12 tok/s per $1,000 and ~0.03 tok/s per watt. Per-watt the two are roughly even. Per-dollar the dual-3090 wins by an order of magnitude — but it is loud, hot, and physically large, and it cannot host a 120B model.

That is the whole positioning of the 400 platform: a single quiet box that can host model classes a dual-GPU build cannot reach.

Verdict matrix

Buy Gorgon Halo if you have a confirmed use case for 120B+ class local inference, you can absorb the $4,500–$5,500 ticket, you want a quiet appliance-style box, and you accept low single-digit tok/s on your heaviest workload.

Stay on discrete GPUs if your daily driver is 70B and below, latency matters, you have space and ventilation for a tower, or you are not yet sure which model class will dominate your workflow. A dual RTX 3090 used build still delivers more raw tokens per second for less money on the workloads that actually exist in 2026.

Stay even smaller if you mostly run 7B–13B coding assistants. A Ryzen 7 5700X plus a MSI RTX 3060 12GB Ventus 2X gives you the cleanest tokens-per-dollar on the models 80% of home users actually run. Drop a WD Blue SN550 1TB NVMe in for model storage and you are done for under $1,100.

Recommended pick

For 2026, our recommended local-LLM build for the typical reader is still the discrete-GPU route — specifically the RTX 3060 12GB rig, not the Gorgon Halo platform. The 400 is interesting and pushes a real capacity ceiling forward, but unless you have a confirmed daily-driver workload at 120B+, the APU's bandwidth tradeoff makes it the wrong tool for most readers. Revisit when LPDDR6 brings unified bandwidth above 500 GB/s; today's numbers do not justify the platform jump for sub-70B work.

Common pitfalls for first-time APU-LLM buyers

Three traps consistently bite first-time Strix/Gorgon Halo buyers in our reading of the local-LLM threads.

Trap one: comparing tok/s without comparing model size. "The mini-PC does 16 tok/s on Llama 8B; my 4090 does 200." That comparison is meaningless. The platforms compete on different model classes. The fair comparison is "what is the largest model class you can run at usable speed", and there the 400 actually has a niche. Anchor your tok/s expectations to the model size, not to the platform.

Trap two: assuming the iGPU runs at full bandwidth. Strix Halo's unified bandwidth is shared between CPU and iGPU. If you are also running a heavy retrieval pipeline on the CPU side, expect generation tok/s to drop another 10–20%. Plan workloads so the LLM has the bandwidth budget to itself during generation.

Trap three: underestimating thermals in a small case. The 400's TDP is ~140W, but the chassis style most vendors ship in (mini-PC, ~2L volume) struggles to dissipate that sustained. Some lab samples thermally throttle after 5–10 minutes of full-tilt inference. If you plan to run the box 24/7 as an inference server, verify the chassis cooling solution accommodates sustained load — or pick a larger 5–10L workstation chassis variant.

When NOT to buy Gorgon Halo

Skip this platform entirely if you do any of the following: run primarily models 13B and below (a discrete GPU murders the APU on those), need 30+ tok/s on any model (the bandwidth ceiling will frustrate you), or have not actually deployed a 70B+ model in your current workflow. The platform's value depends on a confirmed workload that needs the capacity. Without that, you are paying frontier-rig money for non-frontier throughput.

Bottom line

Gorgon Halo's 192 GB unified memory is a milestone — it lets you host model classes that previously required a multi-GPU workstation. It is also slower than a single mid-range discrete card for any model that would have fit in 12 GB. Capacity unlocks new possibilities; bandwidth still wins the throughput races. Match the platform to the model class, not to the spec-sheet flex.

Related guides

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Can 192GB of unified memory actually run a 235B-class model locally?

It can hold one at an aggressive quantization — a 235B model at q4 lands in the rough vicinity of 130-150GB, which fits with room for KV cache. Loading it is the easy part; useful generation speed is the hard part, because the platform's memory bandwidth is far below a multi-GPU server, so expect slow tokens-per-second rather than interactive chat speeds.

How does Gorgon Halo's bandwidth compare to a discrete GPU?

Unified LPDDR5X on these APUs offers a fraction of the memory bandwidth of a discrete card's GDDR6 or HBM. That gap is the single biggest determinant of generation throughput in memory-bound LLM inference. So while the 192GB capacity dwarfs a 12GB RTX 3060, the smaller card is faster on any model that actually fits inside its 12GB.

Is a 192GB APU better than two used GPUs for local AI?

It depends on the model size you target. Two discrete GPUs give far more aggregate bandwidth and usually faster tokens-per-second, but their combined VRAM rarely matches 192GB and multi-GPU setups add power, heat, and configuration complexity. The APU wins on capacity, quiet operation, and simplicity; the GPUs win on raw speed for models that fit their combined VRAM.

What quantization should I use to balance quality and speed?

For most large models, q4_K_M is the common sweet spot — it roughly halves memory versus q8 with modest quality loss, and on a bandwidth-limited platform the smaller footprint directly improves tokens-per-second. Drop to q3 or q2 only when a model otherwise will not fit, and verify output quality on your own prompts, because degradation accelerates sharply below q4.

Who should skip Gorgon Halo and buy a regular GPU instead?

Anyone whose target models fit in 12-24GB should buy a discrete GPU — the MSI RTX 3060 12GB is a low-cost starting point that outruns a unified-memory APU on those models. Gorgon Halo only makes sense if your workload genuinely requires holding very large models in memory and you accept slower generation in exchange for that capacity.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI

Who this is for

Key takeaways

How big a model fits in 192 GB?

Why memory capacity is not memory bandwidth

Quantization matrix

Prefill versus generation on a high-capacity, moderate-bandwidth APU

Context-length impact on long-context KV cache

Spec delta

Benchmark table

Performance-per-dollar and per-watt

Verdict matrix

Recommended pick

Common pitfalls for first-time APU-LLM buyers

When NOT to buy Gorgon Halo

Bottom line

Related guides

Citations and sources

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

SanDisk Ultra 3D NAND 1TB Internal SSD - SATA III 6 Gb/s, 2.5"/7mm, Up to 560…

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI

Who this is for

Key takeaways

How big a model fits in 192 GB?

Why memory capacity is not memory bandwidth

Quantization matrix

Prefill versus generation on a high-capacity, moderate-bandwidth APU

Context-length impact on long-context KV cache

Spec delta

Benchmark table

Performance-per-dollar and per-watt

Verdict matrix

Recommended pick

Common pitfalls for first-time APU-LLM buyers

When NOT to buy Gorgon Halo

Bottom line

Related guides

Citations and sources

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

SanDisk Ultra 3D NAND 1TB Internal SSD - SATA III 6 Gb/s, 2.5"/7mm, Up to 560…

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review