In brief — 2026-05-30: A new r/LocalLLaMA thread asks whether anyone has tested AMD's Ryzen AI Max+ 395 "Strix Halo" mini-PC with 128 GB of unified memory for hosting large LLMs. The short answer: yes, it can load 70B-class models that a 12 GB discrete card cannot touch, but it pays for that capacity in generation throughput because LPDDR5X bandwidth is roughly half what a discrete RTX 3060 12GB ships. It's capacity tech, not throughput tech — useful for a specific buyer, not the new default.
What happened
The r/LocalLLaMA community resurfaced an interest in unified-memory APUs this week. The catalyst: a thread asking whether anyone has tested a Ryzen AI Max+ 395 / 128 GB Corsair desktop for LLM inference, and whether the unified-memory pool genuinely changes the home-LLM calculus that has settled around discrete 12 GB and 24 GB cards. The thread cycled through familiar arguments — capacity vs bandwidth, prefill vs generation, and whether a $4,000 mini-PC is a better home-LLM box than two used Ryzen 7 5800X + RTX 3060 12GB rigs side by side.
The Ryzen AI Max+ 395 platform sits in a specific niche. It's an APU — CPU, integrated GPU, and NPU sharing a single LPDDR5X memory pool — packaged for high-end mini-PCs and workstations. The headline 128 GB SKU configures up to 96 GB of that pool as GPU-addressable memory in some BIOSes, which is the part that matters for LLM loading: a 70B-q5 model that needs ~50 GB of weights plus a 30 GB KV cache fits, where the same model on a 12 GB discrete card cannot load at all without aggressive offload.
Per AMD's Ryzen AI Max product page, the platform pairs a Zen 5-class CPU with an RDNA-class iGPU and an XDNA NPU, sharing a unified LPDDR5X memory subsystem at up to 256 GB/s of effective bandwidth depending on channel configuration. That bandwidth is the second number that matters — and it's why the platform's capability is misread by buyers who only look at the capacity figure.
Why it matters: the bandwidth-vs-capacity trade
Token generation in a transformer is memory-bandwidth-bound. Each new token requires reading every weight of the layer being processed; for a 70B-q4 model, that's ~40 GB of reads per token. Divide bandwidth by weight footprint and you get a hard upper bound on tok/s:
- Ryzen AI Max+ 395 at ~256 GB/s, 70B-q4 model (~40 GB): ceiling ~6.4 tok/s. Real-world numbers from llama.cpp community testing land at 5–7 tok/s, consistent with the calculation.
- RTX 3060 12GB at 360 GB/s, same 70B-q4 model: the model doesn't fit. CPU/disk offload crashes throughput to <1 tok/s.
- RTX 3060 12GB, 8B-q4 model (~5 GB): ceiling ~72 tok/s. Real-world ~60–70 tok/s.
The math frames the question. If your target model fits in 12 GB, a discrete card is faster and dramatically cheaper. If your target model doesn't fit, the APU's unified pool is the only consumer-class answer. The platforms aren't competing for the same buyer — they're solving different problems.
Spec-delta: APU vs discrete RTX 3060 12GB
| Metric | Ryzen AI Max+ 395 (128GB SKU) | RTX 3060 12GB |
|---|---|---|
| Memory pool | 128 GB LPDDR5X (shared) | 12 GB GDDR6 (dedicated) |
| Effective bandwidth | ~256 GB/s | ~360 GB/s |
| Max model (q4) | ~190B params (theoretical), ~70B (practical) | ~22B params (theoretical), 14B (practical with ctx) |
| Generation tok/s — Llama 8B q4 | ~30–35 | ~60–70 |
| Generation tok/s — Qwen 32B q4 | ~12–15 | <5 (offload) |
| Generation tok/s — Llama 70B q4 | ~5–7 | does not fit |
| Prefill speed (256-tok prompt, 8B model) | ~1.5 s | ~0.4 s |
| Idle / load draw | ~25 W / ~120 W | ~15 W / ~170 W |
| Approx street price | $3,500–$4,200 (full box) | $300–$400 (card only) |
That last column is the part buyers often skip. The Ryzen AI Max+ 395 is a complete mini-PC; the RTX 3060 12GB is a card you bolt into an existing AM4 box that costs another $400–$600 to build. The total-system gap is roughly 4–5×, not 10×, but it's still a $3,000 decision.
When the APU is the right buy
The Ryzen AI Max+ 395 128 GB is the right buy when:
- Your primary models are 30B–70B class and you want them resident in fast memory, not on disk offload.
- You want a single small-form-factor box for the desk, not a tower with separate GPU.
- You'll run long-context (32K+) inference where the KV cache alone exceeds discrete-card VRAM.
- Power-and-noise budgets favor a 120 W APU over a 170 W discrete card in a 350 W tower.
- You need both the NPU and iGPU for non-LLM AI workloads (Stable Diffusion, real-time transcription, vision models) sharing one pool.
For those buyers, the platform has no direct consumer competitor. The next step up is a $5,000+ Threadripper + RTX A6000 build with markedly more performance but at a different price tier entirely.
Prefill is the second number that matters
Generation throughput gets the headlines; prefill (the time to digest the input prompt before the first new token) is the part that decides whether interactive use feels live. Prefill is compute-bound and parallel, the opposite of generation. The Ryzen AI Max+ 395's iGPU runs prefill at a fraction of what a discrete CUDA GPU manages — community measurements on 8B-class models show 1.5–2 second prefill on a 256-token prompt for the APU, versus under half a second for the RTX 3060 12GB.
That gap widens with prompt length. A 4,000-token system prompt + chat history takes 8–12 seconds to prefill on the APU and ~1.5 seconds on the discrete card. For chat-style turns, the discrete card feels snappy; the APU feels deliberative. For agentic workloads that feed multi-thousand-token contexts on every turn (logs, file diffs, error traces), prefill alone can break the interaction loop on the APU side.
This is why "the APU can load 70B" doesn't translate to "the APU should be your home-LLM box." Loading the model is the easy half; interactive use depends on both bandwidth-bound generation and compute-bound prefill, and the discrete card wins both on any model that fits in 12 GB of VRAM.
Context length is where the APU's pool actually pays off
The KV cache for transformer attention scales with sequence length and is per-layer. A 70B model at 128K context with multi-head attention needs 30–50 GB of KV cache on top of the weights themselves. The 12 GB card is hopeless here; even a 32B model at 32K context evicts to system RAM and crawls.
The APU's unified pool genuinely shines on long-context workloads. A 70B q5 model needs ~50 GB for weights and ~40 GB more for a 128K KV cache, total ~90 GB — well within a 128 GB box. No discrete consumer GPU under $5,000 in 2026 can do that without offload. For research workloads that genuinely need long context (RAG over long docs, agentic chains with multi-thousand-token system prompts), this is the APU's strongest case.
If your workload is short-context — 4K–16K tokens per turn, which covers most chat and code completion use — the unified pool is a capacity you're not paying for. If your workload is long-context RAG, multi-document summarization, or any agentic chain that holds large state in the prompt, the unified pool is the only consumer-class answer.
What community testing actually shows
The threads converging on r/LocalLLaMA, llama.cpp's GitHub discussions, and Anandtech reader threads converge on a handful of consistent numbers:
- Llama 3.1 8B q4_K_M generation: ~30–35 tok/s on the APU, ~60–70 tok/s on the discrete 3060 12GB. Discrete wins 2×.
- Qwen 32B q4_K_M generation: ~13–15 tok/s on the APU; the discrete card can't fit it natively and crashes to <5 tok/s under offload. APU wins ~3×.
- Llama 70B q4_K_M generation: ~5–7 tok/s on the APU; the discrete card can't load it. APU wins by default.
- Llama 70B q5_K_M with 32K context: 5–6 tok/s on the APU, impossible on a 12 GB card. APU wins by default.
The break-even is around 22–32B class models. Above that, the discrete card is forced into offload that ruins throughput; below that, the discrete card runs the same model faster and cheaper.
When the discrete RTX 3060 12GB is still the smarter spend
For everyone else — which is most home LLM builders in 2026 — the discrete card path is faster and cheaper:
- 8B–14B models comfortably fit in 12 GB; the RTX 3060 12GB is 2× faster than the APU on every one of them.
- Interactive chat with short prompts wants snappy prefill, which the discrete card's CUDA cores deliver in a fraction of the APU's time.
- The full system (12 GB card + Ryzen 7 5800X + 32 GB DDR4 + WD Blue SN550 1TB NVMe) lands at $700–$900, roughly a fifth of the APU.
- The CUDA ecosystem for image generation, fine-tuning, and adjacent ML tooling is still years ahead of the AMD ROCm/HIP equivalent. If you'll do anything beyond chat-style LLM inference, NVIDIA's tooling matters.
For sub-30B work, two RTX 3060 12GB cards on a B550 motherboard ($1,200 total system) sharded via vLLM or llama.cpp row-split give a combined 24 GB of VRAM with 720 GB/s aggregate bandwidth — far past the APU's, at a quarter of the cost.
The bandwidth gap is the real story
The unified-memory pitch is "more memory equals more model." That's true in capacity terms and misleading in throughput terms. LPDDR5X at the APU's configured speeds delivers ~256 GB/s of effective bandwidth; GDDR6 on a midrange discrete card delivers 360 GB/s; GDDR6X on RTX 4070/4080 delivers 500+ GB/s; HBM3 on enterprise cards (H100, MI3xx) delivers 3–4 TB/s. The further up the bandwidth tier you go, the faster the same model generates tokens.
Per the llama.cpp discussions, the rule of thumb is: tok/s = bandwidth ÷ (model size × bytes per parameter). A 70B model at q4 (~0.5 bytes/param × 70B = 35 GB) on 256 GB/s of bandwidth ceilings at ~7 tok/s. The same model on a hypothetical 1 TB/s card ceilings at ~28 tok/s. The APU's capacity advantage is real; its bandwidth ceiling is just as real.
The source
The originating discussion is the r/LocalLLaMA / llama.cpp community thread asking for first-hand benchmarks of the Ryzen AI Max+ 395 / 128 GB Corsair platform. Early community responses are converging on the spec-delta math summarised above: the APU is a capacity buy, not a throughput buy. Builders considering the platform should look at their target model list first — if every model fits in 12–16 GB, the discrete-card path is dramatically better value; if any of them don't, the APU may be the only home-class answer.
Bottom line for home builders
- Buy the Ryzen AI Max+ 395 128 GB if you genuinely need 70B+ models locally with long contexts and want a single small box for the desk.
- Build the RTX 3060 12GB + Ryzen 7 5800X box if you're running sub-30B models (most buyers), care about cost-per-tok/s, or want the CUDA ecosystem for image-gen and fine-tuning.
- Buy two RTX 3060 12GB cards (B08WRVQ4KR + B08W8DGK3X) if you want 24 GB aggregate on a $1,200 budget — strongest cost-per-throughput for any model that fits in the combined pool.
The market is segmented, not contested. Pick the tier that matches your model list, not the headline capacity number.
Related guides
- Cut AI API Bills: Run Local LLMs on an RTX 3060 12GB (2026)
- Ryzen AI Max+ 'Gorgon Halo' 192GB vs RTX 3060 12GB for Local LLMs
- Best Budget Local-AI Workstation Parts in 2026
- Intel's llm-scaler-vLLM 1.4 Adds Arc Pro B70: A Cheaper Local-Inference Path
- Claude Opus 4.8 vs GPT-5.5: What Runs Local on a 12GB GPU
