Skip to main content
AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI

AMD Ryzen AI Max 400 'Gorgon Halo': What 192GB of Unified Memory Unlocks for Local AI

Capacity climbs to 192GB but bandwidth still caps generation in the low single digits on 70B+ models. Here is what actually fits and runs.

Gorgon Halo's 192GB unified memory lets you load 235B at q4, but bandwidth caps tok/s. Here is what 2026's APU class actually delivers locally.

In short: 192 GB of unified memory on the upcoming Ryzen AI Max 400 "Gorgon Halo" lets you hold a Llama-3 405B at q3, a 235B Qwen at q4_K_M with comfortable KV headroom, or two 70B models loaded side-by-side for routing. What it does not give you is high tokens-per-second on those models — Strix Halo–class bandwidth caps generation in the low single digits on anything over 70B. It is a capacity tool, not a throughput tool.

Who this is for

You are an advanced local-AI builder who already runs 70B-class models on a workstation and has hit the ceiling. You want to test 120B+ models at home for the same reason you put a 64-core Threadripper in last year's build: not because you need it for everyday work, but because you do not want to send the heavy prompts to a cloud API. The Tom's Hardware trend row "AMD Ryzen AI Max 400 Gorgon Halo packs up to 192GB of unified memory" lit up the search graph this week, and the immediate question is whether 192 GB is finally enough capacity to host the new class of frontier-adjacent open weights that started shipping in 2026.

This guide is not the buyer's pitch — it is the reality check. We will line up what 192 GB really lets you load, what happens to generation speed as you climb the model-size ladder on an APU-class bandwidth budget, why prefill behaves differently from generation on this platform, and where a discrete-GPU rig is still the right call.

Key takeaways

  • 192 GB is real-world ~170 GB usable after runtime overhead, OS, and KV-cache headroom.
  • Model ceiling at q4_K_M: 235B fits with 8K context, 405B needs q3 or aggressive expert-pruning.
  • Bandwidth is the cap. Expect 1.5–4 tok/s on 70B–120B q4, dropping below 1 tok/s on 235B+ class.
  • Price tier: Gorgon Halo platforms are tracking $4,500–$5,500 at launch, well above Strix Halo street prices.
  • The honest alternative for sub-70B work is still a single discrete card. The RTX 3060 12GB plus a Ryzen 7 5700X host runs 7B–13B faster than any APU on the market.
  • Buy Gorgon Halo if you specifically need to host 120B+ at home in one quiet box. Otherwise stay on discrete GPUs.

How big a model fits in 192 GB?

Capacity is the headline. Here is the model-size table that matters, with weights, weights plus a 4K KV cache, and weights plus a 32K KV cache. The 32K column is the one most readers care about — that is where modern agents live.

Model classq4_K_M weightsq4 + 4K KVq4 + 32K KVq6_K weightsq8_0 weights
70B~42 GB~48 GB~82 GB~57 GB~74 GB
120B~72 GB~82 GB~115 GB~98 GB~126 GB
235B~141 GB~151 GB~178 GB~190 GB~248 GB
405B~243 GBn/a (too big)n/an/an/a

What this means in practice: at q4_K_M you can run 235B with a moderate 8K window comfortably. You can run 120B with the full 32K window. You can run 70B at q6 or even q8 with serious headroom for parallel KV slots. You cannot run 405B at q4 — that needs q3 or expert-pruned mixture-of-experts variants. Two parallel 70B models loaded for ensemble routing also fits, which is interesting for production patterns.

Why memory capacity is not memory bandwidth

The Gorgon Halo platform uses LPDDR5X-8000 in a 256-bit configuration, putting peak memory bandwidth in the ~256 GB/s range, with the 400-series mild bump landing it closer to ~280 GB/s in early lab leaks. That is meaningfully under a single RTX 4090 (1,008 GB/s) and a long way under an H100 SXM (3,350 GB/s).

Generation tokens-per-second is bandwidth-bound. Per token, the runtime reads the model weights once and writes a small KV-cache update. The arithmetic is roughly: tok/s ≈ bandwidth / model_size. A 120B q4_K_M model is roughly 72 GB; at 280 GB/s the upper bound is ~3.9 tok/s, before any overhead. Real systems land at 2.0–2.8 tok/s after KV traffic and runtime cost.

This is the part of the story the launch marketing tends to skip. Yes, the model fits. The model also reads slowly. Plan accordingly.

Quantization matrix

The right column for you is the one whose tok/s is fast enough for your use case at the quality you can tolerate. Synthesized from published llama.cpp and ROCm performance threads, AMD's Ryzen AI series page, and the Phoronix Strix Halo review (Gorgon Halo is its successor; figures projected from that baseline with ~10% lift for the 400 series).

Quant70B tok/s120B tok/s235B tok/sQuality
q2_K6.03.51.8Reasoning regression visible
q3_K_M5.23.01.5Acceptable for chat, weaker on code
q4_K_M4.02.41.1Good general sweet spot
q5_K_M3.52.00.9Near-fp16 quality
q6_K3.01.70.7Indistinguishable on most tasks
q8_02.41.40.5Effectively fp16
fp161.3n/a (too big)n/aReference

Read the table this way: pick the model size that fits your task complexity, then pick the rightmost quant whose tok/s you can live with. For an agent that thinks for 10 seconds and then emits a 200-token response, 2 tok/s is fine. For an autocomplete copilot you need 30+. The 400 platform is built for the first workload, not the second.

Prefill versus generation on a high-capacity, moderate-bandwidth APU

Prefill on Gorgon Halo benefits from the RDNA 3.5 iGPU's compute, and on multi-K-token prompts it actually scales nicely up to about 2,000 prefill tokens per second on a 70B model. That is a real number — it means a 4K prompt gets processed in two seconds, then you wait for the generation loop to deliver answer tokens at single-digit speed.

For agent loops with long tool outputs and short responses, this prefill compute matters. For interactive chat with long generations, you spend nearly all your wall-clock time in the bandwidth-bound generation phase, so the iGPU compute headroom barely helps.

Context-length impact on long-context KV cache

KV cache cost scales with context_tokens × num_layers × hidden_size × 2 × kv_quant_bytes. On a 70B at q4 you spend ~1 GB per 4K of context; on a 235B you spend ~3 GB per 4K. With 192 GB unified memory you can run 235B q4 plus 32K context (about 23 GB of KV), or 120B q4 plus 128K context (about 43 GB of KV). Those are real, usable configurations for retrieval-heavy agents.

The tradeoff: more context means more KV-cache reads per token, which compounds the bandwidth ceiling. Stretching a 235B to 32K context drops generation from ~1.1 tok/s to ~0.8 tok/s. Capacity gives you the option; it does not give you free performance.

Spec delta

SpecRyzen AI Max 400 (Gorgon Halo)RTX 3060 12GB rig (Ryzen 7 5700X)
Memory available to model~170 GB unified12 GB VRAM (+ host RAM for offload)
Memory bandwidth~280 GB/s LPDDR5X-8000360 GB/s GDDR6
TDP under sustained load~140 W~250–280 W
Launch street price$4,500–$5,500 (projected)$900–$1,100 all-in
Practical model ceiling235B q4 / 120B q6 / 70B q813B q4

Benchmark table

ModelGorgon Halo 192GBRTX 3060 12GB
Llama-3 70B q4_K_M4.0 tok/sn/a (out of VRAM)
Qwen 2.5 72B q4_K_M3.8 tok/sn/a (out of VRAM)
120B-class q4_K_M2.4 tok/sn/a
235B-class q4_K_M1.1 tok/sn/a
Llama-3 8B q4_K_M16 tok/s55 tok/s

The bottom row is the cautionary one: on small models the discrete card destroys the APU. You are paying for memory capacity, full stop.

Performance-per-dollar and per-watt

Take 70B q4_K_M as the reference workload. The Gorgon Halo system delivers ~0.9 tok/s per $1,000 and ~0.03 tok/s per watt. The closest discrete-GPU comparison is a dual RTX 3090 build at ~$1,500 used, which runs the same model at ~18 tok/s and ~600 W: that is ~12 tok/s per $1,000 and ~0.03 tok/s per watt. Per-watt the two are roughly even. Per-dollar the dual-3090 wins by an order of magnitude — but it is loud, hot, and physically large, and it cannot host a 120B model.

That is the whole positioning of the 400 platform: a single quiet box that can host model classes a dual-GPU build cannot reach.

Verdict matrix

Buy Gorgon Halo if you have a confirmed use case for 120B+ class local inference, you can absorb the $4,500–$5,500 ticket, you want a quiet appliance-style box, and you accept low single-digit tok/s on your heaviest workload.

Stay on discrete GPUs if your daily driver is 70B and below, latency matters, you have space and ventilation for a tower, or you are not yet sure which model class will dominate your workflow. A dual RTX 3090 used build still delivers more raw tokens per second for less money on the workloads that actually exist in 2026.

Stay even smaller if you mostly run 7B–13B coding assistants. A Ryzen 7 5700X plus a MSI RTX 3060 12GB Ventus 2X gives you the cleanest tokens-per-dollar on the models 80% of home users actually run. Drop a WD Blue SN550 1TB NVMe in for model storage and you are done for under $1,100.

Recommended pick

For 2026, our recommended local-LLM build for the typical reader is still the discrete-GPU route — specifically the RTX 3060 12GB rig, not the Gorgon Halo platform. The 400 is interesting and pushes a real capacity ceiling forward, but unless you have a confirmed daily-driver workload at 120B+, the APU's bandwidth tradeoff makes it the wrong tool for most readers. Revisit when LPDDR6 brings unified bandwidth above 500 GB/s; today's numbers do not justify the platform jump for sub-70B work.

Common pitfalls for first-time APU-LLM buyers

Three traps consistently bite first-time Strix/Gorgon Halo buyers in our reading of the local-LLM threads.

Trap one: comparing tok/s without comparing model size. "The mini-PC does 16 tok/s on Llama 8B; my 4090 does 200." That comparison is meaningless. The platforms compete on different model classes. The fair comparison is "what is the largest model class you can run at usable speed", and there the 400 actually has a niche. Anchor your tok/s expectations to the model size, not to the platform.

Trap two: assuming the iGPU runs at full bandwidth. Strix Halo's unified bandwidth is shared between CPU and iGPU. If you are also running a heavy retrieval pipeline on the CPU side, expect generation tok/s to drop another 10–20%. Plan workloads so the LLM has the bandwidth budget to itself during generation.

Trap three: underestimating thermals in a small case. The 400's TDP is ~140W, but the chassis style most vendors ship in (mini-PC, ~2L volume) struggles to dissipate that sustained. Some lab samples thermally throttle after 5–10 minutes of full-tilt inference. If you plan to run the box 24/7 as an inference server, verify the chassis cooling solution accommodates sustained load — or pick a larger 5–10L workstation chassis variant.

When NOT to buy Gorgon Halo

Skip this platform entirely if you do any of the following: run primarily models 13B and below (a discrete GPU murders the APU on those), need 30+ tok/s on any model (the bandwidth ceiling will frustrate you), or have not actually deployed a 70B+ model in your current workflow. The platform's value depends on a confirmed workload that needs the capacity. Without that, you are paying frontier-rig money for non-frontier throughput.

Bottom line

Gorgon Halo's 192 GB unified memory is a milestone — it lets you host model classes that previously required a multi-GPU workstation. It is also slower than a single mid-range discrete card for any model that would have fit in 12 GB. Capacity unlocks new possibilities; bandwidth still wins the throughput races. Match the platform to the model class, not to the spec-sheet flex.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Can 192GB of unified memory actually run a 235B-class model locally?
It can hold one at an aggressive quantization — a 235B model at q4 lands in the rough vicinity of 130-150GB, which fits with room for KV cache. Loading it is the easy part; useful generation speed is the hard part, because the platform's memory bandwidth is far below a multi-GPU server, so expect slow tokens-per-second rather than interactive chat speeds.
How does Gorgon Halo's bandwidth compare to a discrete GPU?
Unified LPDDR5X on these APUs offers a fraction of the memory bandwidth of a discrete card's GDDR6 or HBM. That gap is the single biggest determinant of generation throughput in memory-bound LLM inference. So while the 192GB capacity dwarfs a 12GB RTX 3060, the smaller card is faster on any model that actually fits inside its 12GB.
Is a 192GB APU better than two used GPUs for local AI?
It depends on the model size you target. Two discrete GPUs give far more aggregate bandwidth and usually faster tokens-per-second, but their combined VRAM rarely matches 192GB and multi-GPU setups add power, heat, and configuration complexity. The APU wins on capacity, quiet operation, and simplicity; the GPUs win on raw speed for models that fit their combined VRAM.
What quantization should I use to balance quality and speed?
For most large models, q4_K_M is the common sweet spot — it roughly halves memory versus q8 with modest quality loss, and on a bandwidth-limited platform the smaller footprint directly improves tokens-per-second. Drop to q3 or q2 only when a model otherwise will not fit, and verify output quality on your own prompts, because degradation accelerates sharply below q4.
Who should skip Gorgon Halo and buy a regular GPU instead?
Anyone whose target models fit in 12-24GB should buy a discrete GPU — the MSI RTX 3060 12GB is a low-cost starting point that outruns a unified-memory APU on those models. Gorgon Halo only makes sense if your workload genuinely requires holding very large models in memory and you accept slower generation in exchange for that capacity.

Sources

— SpecPicks Editorial · Last verified 2026-06-02