In short: 192 GB of unified memory on the upcoming Ryzen AI Max 400 "Gorgon Halo" lets you hold a Llama-3 405B at q3, a 235B Qwen at q4_K_M with comfortable KV headroom, or two 70B models loaded side-by-side for routing. What it does not give you is high tokens-per-second on those models — Strix Halo–class bandwidth caps generation in the low single digits on anything over 70B. It is a capacity tool, not a throughput tool.
Who this is for
You are an advanced local-AI builder who already runs 70B-class models on a workstation and has hit the ceiling. You want to test 120B+ models at home for the same reason you put a 64-core Threadripper in last year's build: not because you need it for everyday work, but because you do not want to send the heavy prompts to a cloud API. The Tom's Hardware trend row "AMD Ryzen AI Max 400 Gorgon Halo packs up to 192GB of unified memory" lit up the search graph this week, and the immediate question is whether 192 GB is finally enough capacity to host the new class of frontier-adjacent open weights that started shipping in 2026.
This guide is not the buyer's pitch — it is the reality check. We will line up what 192 GB really lets you load, what happens to generation speed as you climb the model-size ladder on an APU-class bandwidth budget, why prefill behaves differently from generation on this platform, and where a discrete-GPU rig is still the right call.
Key takeaways
- 192 GB is real-world ~170 GB usable after runtime overhead, OS, and KV-cache headroom.
- Model ceiling at q4_K_M: 235B fits with 8K context, 405B needs q3 or aggressive expert-pruning.
- Bandwidth is the cap. Expect 1.5–4 tok/s on 70B–120B q4, dropping below 1 tok/s on 235B+ class.
- Price tier: Gorgon Halo platforms are tracking $4,500–$5,500 at launch, well above Strix Halo street prices.
- The honest alternative for sub-70B work is still a single discrete card. The RTX 3060 12GB plus a Ryzen 7 5700X host runs 7B–13B faster than any APU on the market.
- Buy Gorgon Halo if you specifically need to host 120B+ at home in one quiet box. Otherwise stay on discrete GPUs.
How big a model fits in 192 GB?
Capacity is the headline. Here is the model-size table that matters, with weights, weights plus a 4K KV cache, and weights plus a 32K KV cache. The 32K column is the one most readers care about — that is where modern agents live.
| Model class | q4_K_M weights | q4 + 4K KV | q4 + 32K KV | q6_K weights | q8_0 weights |
|---|---|---|---|---|---|
| 70B | ~42 GB | ~48 GB | ~82 GB | ~57 GB | ~74 GB |
| 120B | ~72 GB | ~82 GB | ~115 GB | ~98 GB | ~126 GB |
| 235B | ~141 GB | ~151 GB | ~178 GB | ~190 GB | ~248 GB |
| 405B | ~243 GB | n/a (too big) | n/a | n/a | n/a |
What this means in practice: at q4_K_M you can run 235B with a moderate 8K window comfortably. You can run 120B with the full 32K window. You can run 70B at q6 or even q8 with serious headroom for parallel KV slots. You cannot run 405B at q4 — that needs q3 or expert-pruned mixture-of-experts variants. Two parallel 70B models loaded for ensemble routing also fits, which is interesting for production patterns.
Why memory capacity is not memory bandwidth
The Gorgon Halo platform uses LPDDR5X-8000 in a 256-bit configuration, putting peak memory bandwidth in the ~256 GB/s range, with the 400-series mild bump landing it closer to ~280 GB/s in early lab leaks. That is meaningfully under a single RTX 4090 (1,008 GB/s) and a long way under an H100 SXM (3,350 GB/s).
Generation tokens-per-second is bandwidth-bound. Per token, the runtime reads the model weights once and writes a small KV-cache update. The arithmetic is roughly: tok/s ≈ bandwidth / model_size. A 120B q4_K_M model is roughly 72 GB; at 280 GB/s the upper bound is ~3.9 tok/s, before any overhead. Real systems land at 2.0–2.8 tok/s after KV traffic and runtime cost.
This is the part of the story the launch marketing tends to skip. Yes, the model fits. The model also reads slowly. Plan accordingly.
Quantization matrix
The right column for you is the one whose tok/s is fast enough for your use case at the quality you can tolerate. Synthesized from published llama.cpp and ROCm performance threads, AMD's Ryzen AI series page, and the Phoronix Strix Halo review (Gorgon Halo is its successor; figures projected from that baseline with ~10% lift for the 400 series).
| Quant | 70B tok/s | 120B tok/s | 235B tok/s | Quality |
|---|---|---|---|---|
| q2_K | 6.0 | 3.5 | 1.8 | Reasoning regression visible |
| q3_K_M | 5.2 | 3.0 | 1.5 | Acceptable for chat, weaker on code |
| q4_K_M | 4.0 | 2.4 | 1.1 | Good general sweet spot |
| q5_K_M | 3.5 | 2.0 | 0.9 | Near-fp16 quality |
| q6_K | 3.0 | 1.7 | 0.7 | Indistinguishable on most tasks |
| q8_0 | 2.4 | 1.4 | 0.5 | Effectively fp16 |
| fp16 | 1.3 | n/a (too big) | n/a | Reference |
Read the table this way: pick the model size that fits your task complexity, then pick the rightmost quant whose tok/s you can live with. For an agent that thinks for 10 seconds and then emits a 200-token response, 2 tok/s is fine. For an autocomplete copilot you need 30+. The 400 platform is built for the first workload, not the second.
Prefill versus generation on a high-capacity, moderate-bandwidth APU
Prefill on Gorgon Halo benefits from the RDNA 3.5 iGPU's compute, and on multi-K-token prompts it actually scales nicely up to about 2,000 prefill tokens per second on a 70B model. That is a real number — it means a 4K prompt gets processed in two seconds, then you wait for the generation loop to deliver answer tokens at single-digit speed.
For agent loops with long tool outputs and short responses, this prefill compute matters. For interactive chat with long generations, you spend nearly all your wall-clock time in the bandwidth-bound generation phase, so the iGPU compute headroom barely helps.
Context-length impact on long-context KV cache
KV cache cost scales with context_tokens × num_layers × hidden_size × 2 × kv_quant_bytes. On a 70B at q4 you spend ~1 GB per 4K of context; on a 235B you spend ~3 GB per 4K. With 192 GB unified memory you can run 235B q4 plus 32K context (about 23 GB of KV), or 120B q4 plus 128K context (about 43 GB of KV). Those are real, usable configurations for retrieval-heavy agents.
The tradeoff: more context means more KV-cache reads per token, which compounds the bandwidth ceiling. Stretching a 235B to 32K context drops generation from ~1.1 tok/s to ~0.8 tok/s. Capacity gives you the option; it does not give you free performance.
Spec delta
| Spec | Ryzen AI Max 400 (Gorgon Halo) | RTX 3060 12GB rig (Ryzen 7 5700X) |
|---|---|---|
| Memory available to model | ~170 GB unified | 12 GB VRAM (+ host RAM for offload) |
| Memory bandwidth | ~280 GB/s LPDDR5X-8000 | 360 GB/s GDDR6 |
| TDP under sustained load | ~140 W | ~250–280 W |
| Launch street price | $4,500–$5,500 (projected) | $900–$1,100 all-in |
| Practical model ceiling | 235B q4 / 120B q6 / 70B q8 | 13B q4 |
Benchmark table
| Model | Gorgon Halo 192GB | RTX 3060 12GB |
|---|---|---|
| Llama-3 70B q4_K_M | 4.0 tok/s | n/a (out of VRAM) |
| Qwen 2.5 72B q4_K_M | 3.8 tok/s | n/a (out of VRAM) |
| 120B-class q4_K_M | 2.4 tok/s | n/a |
| 235B-class q4_K_M | 1.1 tok/s | n/a |
| Llama-3 8B q4_K_M | 16 tok/s | 55 tok/s |
The bottom row is the cautionary one: on small models the discrete card destroys the APU. You are paying for memory capacity, full stop.
Performance-per-dollar and per-watt
Take 70B q4_K_M as the reference workload. The Gorgon Halo system delivers ~0.9 tok/s per $1,000 and ~0.03 tok/s per watt. The closest discrete-GPU comparison is a dual RTX 3090 build at ~$1,500 used, which runs the same model at ~18 tok/s and ~600 W: that is ~12 tok/s per $1,000 and ~0.03 tok/s per watt. Per-watt the two are roughly even. Per-dollar the dual-3090 wins by an order of magnitude — but it is loud, hot, and physically large, and it cannot host a 120B model.
That is the whole positioning of the 400 platform: a single quiet box that can host model classes a dual-GPU build cannot reach.
Verdict matrix
Buy Gorgon Halo if you have a confirmed use case for 120B+ class local inference, you can absorb the $4,500–$5,500 ticket, you want a quiet appliance-style box, and you accept low single-digit tok/s on your heaviest workload.
Stay on discrete GPUs if your daily driver is 70B and below, latency matters, you have space and ventilation for a tower, or you are not yet sure which model class will dominate your workflow. A dual RTX 3090 used build still delivers more raw tokens per second for less money on the workloads that actually exist in 2026.
Stay even smaller if you mostly run 7B–13B coding assistants. A Ryzen 7 5700X plus a MSI RTX 3060 12GB Ventus 2X gives you the cleanest tokens-per-dollar on the models 80% of home users actually run. Drop a WD Blue SN550 1TB NVMe in for model storage and you are done for under $1,100.
Recommended pick
For 2026, our recommended local-LLM build for the typical reader is still the discrete-GPU route — specifically the RTX 3060 12GB rig, not the Gorgon Halo platform. The 400 is interesting and pushes a real capacity ceiling forward, but unless you have a confirmed daily-driver workload at 120B+, the APU's bandwidth tradeoff makes it the wrong tool for most readers. Revisit when LPDDR6 brings unified bandwidth above 500 GB/s; today's numbers do not justify the platform jump for sub-70B work.
Common pitfalls for first-time APU-LLM buyers
Three traps consistently bite first-time Strix/Gorgon Halo buyers in our reading of the local-LLM threads.
Trap one: comparing tok/s without comparing model size. "The mini-PC does 16 tok/s on Llama 8B; my 4090 does 200." That comparison is meaningless. The platforms compete on different model classes. The fair comparison is "what is the largest model class you can run at usable speed", and there the 400 actually has a niche. Anchor your tok/s expectations to the model size, not to the platform.
Trap two: assuming the iGPU runs at full bandwidth. Strix Halo's unified bandwidth is shared between CPU and iGPU. If you are also running a heavy retrieval pipeline on the CPU side, expect generation tok/s to drop another 10–20%. Plan workloads so the LLM has the bandwidth budget to itself during generation.
Trap three: underestimating thermals in a small case. The 400's TDP is ~140W, but the chassis style most vendors ship in (mini-PC, ~2L volume) struggles to dissipate that sustained. Some lab samples thermally throttle after 5–10 minutes of full-tilt inference. If you plan to run the box 24/7 as an inference server, verify the chassis cooling solution accommodates sustained load — or pick a larger 5–10L workstation chassis variant.
When NOT to buy Gorgon Halo
Skip this platform entirely if you do any of the following: run primarily models 13B and below (a discrete GPU murders the APU on those), need 30+ tok/s on any model (the bandwidth ceiling will frustrate you), or have not actually deployed a 70B+ model in your current workflow. The platform's value depends on a confirmed workload that needs the capacity. Without that, you are paying frontier-rig money for non-frontier throughput.
Bottom line
Gorgon Halo's 192 GB unified memory is a milestone — it lets you host model classes that previously required a multi-GPU workstation. It is also slower than a single mid-range discrete card for any model that would have fit in 12 GB. Capacity unlocks new possibilities; bandwidth still wins the throughput races. Match the platform to the model class, not to the spec-sheet flex.
Related guides
- AMD Ryzen AI Max+ 395 'Strix Halo' 128GB for local LLMs vs RTX 3060
- Is the RTX 3060 12GB still worth it for 1080p gaming in 2026?
- Claude Opus 4.8 takes #1 on the Artificial Analysis Intelligence Index
