Skip to main content
Can a Ryzen AI Max+ 395 Mini-PC With 128GB Run Big LLMs? The r/LocalLLaMA Question

Can a Ryzen AI Max+ 395 Mini-PC With 128GB Run Big LLMs? The r/LocalLLaMA Question

Unified-memory APUs vs discrete 12 GB GPUs — the capacity / bandwidth tradeoff in plain numbers.

Yes, the Ryzen AI Max+ 395 / 128GB loads 70B models a 12GB card can't — but pays for capacity in bandwidth. The honest math for home LLM builders.

In brief — 2026-05-30: A new r/LocalLLaMA thread asks whether anyone has tested AMD's Ryzen AI Max+ 395 "Strix Halo" mini-PC with 128 GB of unified memory for hosting large LLMs. The short answer: yes, it can load 70B-class models that a 12 GB discrete card cannot touch, but it pays for that capacity in generation throughput because LPDDR5X bandwidth is roughly half what a discrete RTX 3060 12GB ships. It's capacity tech, not throughput tech — useful for a specific buyer, not the new default.

What happened

The r/LocalLLaMA community resurfaced an interest in unified-memory APUs this week. The catalyst: a thread asking whether anyone has tested a Ryzen AI Max+ 395 / 128 GB Corsair desktop for LLM inference, and whether the unified-memory pool genuinely changes the home-LLM calculus that has settled around discrete 12 GB and 24 GB cards. The thread cycled through familiar arguments — capacity vs bandwidth, prefill vs generation, and whether a $4,000 mini-PC is a better home-LLM box than two used Ryzen 7 5800X + RTX 3060 12GB rigs side by side.

The Ryzen AI Max+ 395 platform sits in a specific niche. It's an APU — CPU, integrated GPU, and NPU sharing a single LPDDR5X memory pool — packaged for high-end mini-PCs and workstations. The headline 128 GB SKU configures up to 96 GB of that pool as GPU-addressable memory in some BIOSes, which is the part that matters for LLM loading: a 70B-q5 model that needs ~50 GB of weights plus a 30 GB KV cache fits, where the same model on a 12 GB discrete card cannot load at all without aggressive offload.

Per AMD's Ryzen AI Max product page, the platform pairs a Zen 5-class CPU with an RDNA-class iGPU and an XDNA NPU, sharing a unified LPDDR5X memory subsystem at up to 256 GB/s of effective bandwidth depending on channel configuration. That bandwidth is the second number that matters — and it's why the platform's capability is misread by buyers who only look at the capacity figure.

Why it matters: the bandwidth-vs-capacity trade

Token generation in a transformer is memory-bandwidth-bound. Each new token requires reading every weight of the layer being processed; for a 70B-q4 model, that's ~40 GB of reads per token. Divide bandwidth by weight footprint and you get a hard upper bound on tok/s:

  • Ryzen AI Max+ 395 at ~256 GB/s, 70B-q4 model (~40 GB): ceiling ~6.4 tok/s. Real-world numbers from llama.cpp community testing land at 5–7 tok/s, consistent with the calculation.
  • RTX 3060 12GB at 360 GB/s, same 70B-q4 model: the model doesn't fit. CPU/disk offload crashes throughput to <1 tok/s.
  • RTX 3060 12GB, 8B-q4 model (~5 GB): ceiling ~72 tok/s. Real-world ~60–70 tok/s.

The math frames the question. If your target model fits in 12 GB, a discrete card is faster and dramatically cheaper. If your target model doesn't fit, the APU's unified pool is the only consumer-class answer. The platforms aren't competing for the same buyer — they're solving different problems.

Spec-delta: APU vs discrete RTX 3060 12GB

MetricRyzen AI Max+ 395 (128GB SKU)RTX 3060 12GB
Memory pool128 GB LPDDR5X (shared)12 GB GDDR6 (dedicated)
Effective bandwidth~256 GB/s~360 GB/s
Max model (q4)~190B params (theoretical), ~70B (practical)~22B params (theoretical), 14B (practical with ctx)
Generation tok/s — Llama 8B q4~30–35~60–70
Generation tok/s — Qwen 32B q4~12–15<5 (offload)
Generation tok/s — Llama 70B q4~5–7does not fit
Prefill speed (256-tok prompt, 8B model)~1.5 s~0.4 s
Idle / load draw~25 W / ~120 W~15 W / ~170 W
Approx street price$3,500–$4,200 (full box)$300–$400 (card only)

That last column is the part buyers often skip. The Ryzen AI Max+ 395 is a complete mini-PC; the RTX 3060 12GB is a card you bolt into an existing AM4 box that costs another $400–$600 to build. The total-system gap is roughly 4–5×, not 10×, but it's still a $3,000 decision.

When the APU is the right buy

The Ryzen AI Max+ 395 128 GB is the right buy when:

  • Your primary models are 30B–70B class and you want them resident in fast memory, not on disk offload.
  • You want a single small-form-factor box for the desk, not a tower with separate GPU.
  • You'll run long-context (32K+) inference where the KV cache alone exceeds discrete-card VRAM.
  • Power-and-noise budgets favor a 120 W APU over a 170 W discrete card in a 350 W tower.
  • You need both the NPU and iGPU for non-LLM AI workloads (Stable Diffusion, real-time transcription, vision models) sharing one pool.

For those buyers, the platform has no direct consumer competitor. The next step up is a $5,000+ Threadripper + RTX A6000 build with markedly more performance but at a different price tier entirely.

Prefill is the second number that matters

Generation throughput gets the headlines; prefill (the time to digest the input prompt before the first new token) is the part that decides whether interactive use feels live. Prefill is compute-bound and parallel, the opposite of generation. The Ryzen AI Max+ 395's iGPU runs prefill at a fraction of what a discrete CUDA GPU manages — community measurements on 8B-class models show 1.5–2 second prefill on a 256-token prompt for the APU, versus under half a second for the RTX 3060 12GB.

That gap widens with prompt length. A 4,000-token system prompt + chat history takes 8–12 seconds to prefill on the APU and ~1.5 seconds on the discrete card. For chat-style turns, the discrete card feels snappy; the APU feels deliberative. For agentic workloads that feed multi-thousand-token contexts on every turn (logs, file diffs, error traces), prefill alone can break the interaction loop on the APU side.

This is why "the APU can load 70B" doesn't translate to "the APU should be your home-LLM box." Loading the model is the easy half; interactive use depends on both bandwidth-bound generation and compute-bound prefill, and the discrete card wins both on any model that fits in 12 GB of VRAM.

Context length is where the APU's pool actually pays off

The KV cache for transformer attention scales with sequence length and is per-layer. A 70B model at 128K context with multi-head attention needs 30–50 GB of KV cache on top of the weights themselves. The 12 GB card is hopeless here; even a 32B model at 32K context evicts to system RAM and crawls.

The APU's unified pool genuinely shines on long-context workloads. A 70B q5 model needs ~50 GB for weights and ~40 GB more for a 128K KV cache, total ~90 GB — well within a 128 GB box. No discrete consumer GPU under $5,000 in 2026 can do that without offload. For research workloads that genuinely need long context (RAG over long docs, agentic chains with multi-thousand-token system prompts), this is the APU's strongest case.

If your workload is short-context — 4K–16K tokens per turn, which covers most chat and code completion use — the unified pool is a capacity you're not paying for. If your workload is long-context RAG, multi-document summarization, or any agentic chain that holds large state in the prompt, the unified pool is the only consumer-class answer.

What community testing actually shows

The threads converging on r/LocalLLaMA, llama.cpp's GitHub discussions, and Anandtech reader threads converge on a handful of consistent numbers:

  • Llama 3.1 8B q4_K_M generation: ~30–35 tok/s on the APU, ~60–70 tok/s on the discrete 3060 12GB. Discrete wins 2×.
  • Qwen 32B q4_K_M generation: ~13–15 tok/s on the APU; the discrete card can't fit it natively and crashes to <5 tok/s under offload. APU wins ~3×.
  • Llama 70B q4_K_M generation: ~5–7 tok/s on the APU; the discrete card can't load it. APU wins by default.
  • Llama 70B q5_K_M with 32K context: 5–6 tok/s on the APU, impossible on a 12 GB card. APU wins by default.

The break-even is around 22–32B class models. Above that, the discrete card is forced into offload that ruins throughput; below that, the discrete card runs the same model faster and cheaper.

When the discrete RTX 3060 12GB is still the smarter spend

For everyone else — which is most home LLM builders in 2026 — the discrete card path is faster and cheaper:

  • 8B–14B models comfortably fit in 12 GB; the RTX 3060 12GB is 2× faster than the APU on every one of them.
  • Interactive chat with short prompts wants snappy prefill, which the discrete card's CUDA cores deliver in a fraction of the APU's time.
  • The full system (12 GB card + Ryzen 7 5800X + 32 GB DDR4 + WD Blue SN550 1TB NVMe) lands at $700–$900, roughly a fifth of the APU.
  • The CUDA ecosystem for image generation, fine-tuning, and adjacent ML tooling is still years ahead of the AMD ROCm/HIP equivalent. If you'll do anything beyond chat-style LLM inference, NVIDIA's tooling matters.

For sub-30B work, two RTX 3060 12GB cards on a B550 motherboard ($1,200 total system) sharded via vLLM or llama.cpp row-split give a combined 24 GB of VRAM with 720 GB/s aggregate bandwidth — far past the APU's, at a quarter of the cost.

The bandwidth gap is the real story

The unified-memory pitch is "more memory equals more model." That's true in capacity terms and misleading in throughput terms. LPDDR5X at the APU's configured speeds delivers ~256 GB/s of effective bandwidth; GDDR6 on a midrange discrete card delivers 360 GB/s; GDDR6X on RTX 4070/4080 delivers 500+ GB/s; HBM3 on enterprise cards (H100, MI3xx) delivers 3–4 TB/s. The further up the bandwidth tier you go, the faster the same model generates tokens.

Per the llama.cpp discussions, the rule of thumb is: tok/s = bandwidth ÷ (model size × bytes per parameter). A 70B model at q4 (~0.5 bytes/param × 70B = 35 GB) on 256 GB/s of bandwidth ceilings at ~7 tok/s. The same model on a hypothetical 1 TB/s card ceilings at ~28 tok/s. The APU's capacity advantage is real; its bandwidth ceiling is just as real.

The source

The originating discussion is the r/LocalLLaMA / llama.cpp community thread asking for first-hand benchmarks of the Ryzen AI Max+ 395 / 128 GB Corsair platform. Early community responses are converging on the spec-delta math summarised above: the APU is a capacity buy, not a throughput buy. Builders considering the platform should look at their target model list first — if every model fits in 12–16 GB, the discrete-card path is dramatically better value; if any of them don't, the APU may be the only home-class answer.

Bottom line for home builders

  • Buy the Ryzen AI Max+ 395 128 GB if you genuinely need 70B+ models locally with long contexts and want a single small box for the desk.
  • Build the RTX 3060 12GB + Ryzen 7 5800X box if you're running sub-30B models (most buyers), care about cost-per-tok/s, or want the CUDA ecosystem for image-gen and fine-tuning.
  • Buy two RTX 3060 12GB cards (B08WRVQ4KR + B08W8DGK3X) if you want 24 GB aggregate on a $1,200 budget — strongest cost-per-throughput for any model that fits in the combined pool.

The market is segmented, not contested. Pick the tier that matches your model list, not the headline capacity number.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Does 128 GB of unified memory replace a discrete GPU for local LLMs?
For capacity, yes — a 128 GB pool fits 70B-class models that a 12 GB discrete card cannot. For throughput, no — LPDDR5X bandwidth (~256 GB/s) is roughly half of GDDR6 on the RTX 3060 12GB (360 GB/s), so tokens per second on the APU run slower than the discrete card on every model both can run. The platforms aren't competing for the same buyer; they're solving different problems.
What's the realistic tok/s for a 70B model on the Ryzen AI Max+ 395?
Roughly 5–7 tok/s on a 70B-q4 model with 256 GB/s of effective LPDDR5X bandwidth. This matches the theoretical ceiling (bandwidth divided by ~40 GB weight footprint per token). Smaller models scale roughly linearly: ~30–35 tok/s on Llama 8B q4, ~13–15 tok/s on Qwen 32B q4. The 5–7 tok/s on 70B is the platform's selling point — usable for batch work and slow interactive use, well below comfortable real-time chat.
Is it cheaper to buy an APU mini-PC or build a dual-RTX-3060 12GB box?
Two RTX 3060 12GB cards on a B550 motherboard with a Ryzen 7 5800X land around $1,200 total system; the Ryzen AI Max+ 395 128 GB mini-PC lands around $3,500–$4,200. The dual-3060 box has 24 GB combined VRAM with 720 GB/s aggregate bandwidth — far faster on any model that fits in the combined pool. The APU only wins when you genuinely need 70B+ models, which most home users don't run daily.
What about prefill speed on the APU vs a discrete GPU?
Prefill is compute-bound and parallel; generation is memory-bound and serial. The APU's iGPU runs prefill noticeably slower than a discrete CUDA card — 1.5–2 seconds on a 256-token prompt for the APU vs <0.5 second on the RTX 3060 12GB. For agent workloads that feed multi-thousand-token contexts on every turn, prefill alone can break interactive feel on the APU side.
Should I wait for benchmarks before buying the Ryzen AI Max+ 395 platform?
Yes, if you're price-sensitive. Community benchmarks are still converging, and the early numbers fit theoretical predictions (bandwidth ÷ model size). If you genuinely need 70B+ local inference, the platform has no consumer-class competitor at the price; if you're at all uncertain, a $700–$900 RTX 3060 12GB build covers 90% of home use cases at a quarter of the cost.

Sources

— SpecPicks Editorial · Last verified 2026-05-30

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →