For running local LLMs as of 2026, the Ryzen AI Max+ 395 with 128GB of unified memory wins whenever the model is too big to fit in 24GB — it loads 70B-class weights in a single pool that a dual RTX 3060 rig physically cannot hold. For models that fit in 24GB, two RTX 3060 12GB cards generate tokens faster thanks to much higher memory bandwidth. Pick the APU for capacity, the GPUs for speed.
The budget-LLM crossroads: one big memory pool vs two cheap GPUs
Local-LLM builders keep arriving at the same fork in the road, and a recent r/LocalLLaMA thread captured it perfectly: "Corsair desktop PC with Ryzen 395 and 128GB unified RAM — has anyone tested it for LLM?" On one side sits the Ryzen AI Max+ 395, an APU whose integrated GPU can address an enormous slice of 128GB of LPDDR5X. On the other side sits the build that has carried hobbyist inference for two years: a pair of RTX 3060 12GB cards, 24GB of GDDR6 total, bought used for not much money.
These two rigs are not the same tool. The APU is a capacity play — it trades bandwidth for the ability to hold models that simply will not fit anywhere else at this price. The dual-GPU rig is a bandwidth play — within the 24GB it can address, it moves data several times faster, so it generates tokens quicker on anything that fits. The mistake most buyers make is comparing them on a single number. There is no single number. There is the model you want to run, the context length you push, and the watts you are willing to burn 24/7.
This guide settles the question workload by workload. We cover what each platform can fit, what public benchmarks show for 7B, 27B, and 70B models, how quantization changes the math, why prefill and generation behave differently on a unified pool than on discrete cards, what happens as context scales toward 128k tokens, and whether that second 3060 actually pays for itself. By the end you will know which rig to buy for the model you actually intend to run.
Key takeaways
- Capacity king: The Ryzen AI Max+ 395 can allocate a large share of its 128GB unified memory to the iGPU, hosting 70B models in q4 entirely in memory. A dual RTX 3060 rig caps at 24GB and cannot.
- Bandwidth king: Two RTX 3060 cards deliver far higher memory bandwidth (360 GB/s each) than the APU's shared LPDDR5X pool, so they win tokens-per-second on any model that fits in 24GB.
- The second GPU adds VRAM, not 2x speed: Tensor-parallel scaling on a PCIe link lands around 1.4–1.7x, not 2x. Buy the second card to fit bigger models, not to double throughput.
- Power matters for always-on: A dual-3060 inference session can pull 400W+ at the wall; the APU platform sips far less, which is the real argument for an always-on assistant box.
- Quantization decides the fit: A 70B model needs ~40GB at q4_K_M — out of reach for 24GB split VRAM, comfortable for the unified pool.
How much model can each platform fit?
The fit question is the whole ballgame, because a model that does not fit either runs at disk-offload speeds (effectively unusable) or does not run at all. The dual RTX 3060 rig has 24GB of VRAM, but it is split 12GB + 12GB across two devices. Tensor-parallel runtimes such as vLLM and the multi-GPU paths in llama.cpp can shard a model's layers across both cards, so the practical ceiling is close to the combined 24GB minus overhead — but no single tensor can exceed what one card holds without spilling. The APU presents a single contiguous pool, so there is no sharding overhead and no per-device ceiling.
| Platform | Usable memory for weights | Largest model in q4_K_M | Fits 70B q4 in memory? |
|---|---|---|---|
| Ryzen AI Max+ 395 (128GB unified) | ~96GB allocatable to iGPU | 70B+ comfortably | Yes |
| Dual RTX 3060 12GB | ~22GB after KV/overhead | ~27B–34B | No (needs offload) |
| Single RTX 3060 12GB | ~10.5GB after KV/overhead | ~13B | No |
The takeaway is stark: for anything up to a 27B-class model in q4, both rigs are viable. The moment you want a 70B model resident in memory, the dual-3060 build is out and the unified-memory APU is the only option in this price bracket that runs it without grinding offload.
What token throughput do public benchmarks show?
Throughput is where the discrete GPUs reassert themselves. Memory bandwidth is the dominant factor for autoregressive generation, and GDDR6 on a 3060 moves data at roughly 360 GB/s per card versus the far lower effective bandwidth of a shared LPDDR5X pool. Independent testing of the Ryzen AI Max platform by Phoronix and community llama.cpp runs converge on the same shape: the APU is usable, not fast.
| Model (q4_K_M) | Dual RTX 3060 (tok/s) | Ryzen AI Max+ 395 (tok/s) |
|---|---|---|
| 7B | 55–75 | 18–28 |
| 27B | 18–26 | 8–14 |
| 70B | does not fit (offload: 2–4) | 4–8 |
Read this table as two different jobs. On a 7B model both rigs are interactive, but the GPUs feel snappier. On a 27B model the GPUs still lead comfortably. On a 70B model the comparison inverts: the dual-3060 rig can only run it via slow CPU offload at a few tokens per second, while the APU holds it in memory and sustains a usable, if modest, rate. The "winner" flips entirely based on model size.
Quantization matrix: VRAM, speed, and quality per format
Quantization is the lever that decides whether a model fits at all, and how much quality you trade to get there. The rough memory footprint for a 70B model and the practical posture of each rig:
| Quant | ~VRAM for 70B | Quality loss | Dual 3060 (24GB) | Ryzen 395 (unified) |
|---|---|---|---|---|
| q2_K | ~26GB | High | Borderline / offload | Fits, fast-ish |
| q3_K_M | ~31GB | Noticeable | Offload | Fits |
| q4_K_M | ~40GB | Low (recommended) | Offload | Fits |
| q5_K_M | ~47GB | Very low | Offload | Fits |
| q6_K | ~55GB | Near-lossless | No | Fits |
| q8_0 | ~70GB | Negligible | No | Fits |
| fp16 | ~140GB | Reference | No | No |
For the dual-3060 rig the realistic sweet spot is a 27B model at q4_K_M or q5_K_M, which fits in 24GB and runs fast. For the APU the sweet spot is a 70B model at q4_K_M — the largest format that keeps quality high while staying comfortably inside the memory pool. fp16 of a 70B model is off the table for both; you would need ~140GB.
Prefill vs generation: pool versus discrete cards
Two phases dominate inference, and the two rigs handle them differently. Prefill (processing your prompt) is compute-heavy and parallel — it loves raw FLOPS and high bandwidth, which favors the discrete GPUs. Generation (producing each new token) is memory-bandwidth-bound and sequential, again favoring GDDR6 bandwidth. The APU's advantage is not speed in either phase; it is that the data never has to cross a PCIe boundary or get sharded, so there is no inter-device synchronization cost. On the dual-3060 rig, tensor-parallel generation pays a communication tax every layer as partial results move across PCIe. On a small model that tax is invisible; on a sharded large model it erodes the bandwidth advantage. The practical result: the GPUs win prefill decisively, win generation on models that fit one card, and narrow their lead on models that must be sharded across both.
What happens as context scales from 4k to 128k?
Context length is the quiet killer of throughput because the KV cache grows linearly with tokens and consumes both memory and bandwidth. On a 12GB card, a long context for a 13B model can claim several gigabytes of KV cache, squeezing the weights and forcing smaller batches. On the dual-3060 rig, pushing context toward 32k–64k tokens both raises memory pressure and slows generation as the attention step reads an ever-larger cache each token. The APU's huge pool shrugs off the KV-cache growth — 128k context on a 70B model is a memory non-event when you have ~96GB to spend — but the underlying bandwidth ceiling means long-context generation still crawls relative to the GPUs at short context. In short: the GPUs are fastest at short context, the APU is the only one that survives extreme context without offload, and both slow down as context grows.
Does a second RTX 3060 actually double throughput?
No, and this is the most common budget-build misconception. Adding a second RTX 3060 roughly doubles available VRAM, which is genuinely valuable — it is what lets you step from a 13B model to a 27B model. But tokens-per-second does not double. Tensor-parallel inference splits each layer across the two cards and must synchronize partial results over the PCIe bus every step. Real-world scaling on consumer PCIe lands around 1.4–1.7x for generation, occasionally lower if the cards sit on x4 electrical slots. The honest framing: buy the second 3060 to fit a bigger model, not to make a model you already run go twice as fast.
Perf-per-dollar and perf-per-watt
The money math depends on street prices the day you buy, but the structure is stable. Used RTX 3060 12GB cards are cheap, so a dual-3060 rig is often the lowest-cost path to 24GB and the best tokens-per-dollar for models up to 27B. The APU platform costs more upfront and delivers fewer tokens per second, so on a small model it loses perf-per-dollar — but on a 70B model the dual-3060 rig cannot do the job at all, making the APU's perf-per-dollar effectively infinite by comparison (the alternative is "does not run").
Power is the APU's strongest argument. Two RTX 3060 cards draw roughly 170W each under load, and with system overhead a sustained session pulls 400W or more at the wall, generating heat and fan noise in a room you may sleep next to. The Ryzen AI Max+ 395 platform targets far lower total board power, which makes it the saner choice for an always-on assistant that idles most of the day and answers occasional queries. For perf-per-watt on a 24/7 box, the APU wins; for raw perf-per-watt during active heavy generation on a fitting model, the GPUs are competitive because they finish the work faster.
Common pitfalls when choosing between these rigs
- Comparing on a single number: There is no one metric. The APU wins capacity, the GPUs win bandwidth; the right answer depends on the model size you actually run.
- Assuming two GPUs double speed: They roughly double VRAM, not tokens per second. Tensor-parallel scaling lands near 1.4–1.7x.
- Ignoring memory bandwidth on the APU: The 128GB pool is huge but its bandwidth is far below GDDR6, so large-model generation is usable, not fast.
- Underestimating power for an always-on box: A dual-3060 rig pulling 400W+ around the clock adds up in heat, noise, and electricity the APU largely avoids.
- Forgetting PCIe slot width: Two cards on x4 electrical slots scale worse than on x8/x16 — check your board before assuming full multi-GPU throughput.
- Buying for a model you won't run: If you never touch 70B, the APU's headroom is wasted; if you live in 70B, the dual-3060 rig simply cannot do the job.
Bottom line: which budget rig wins for which workload
- You run 7B–27B models and want speed: Buy the dual RTX 3060 12GB rig. Higher bandwidth means more tokens per second and better tokens-per-dollar.
- You want a 70B model resident in memory on a budget: Buy the Ryzen AI Max+ 395. It is the only option here that holds 70B in q4 without crippling offload.
- You want an always-on, low-power assistant: Buy the APU. Idle and sustained wattage dwarf peak throughput for a 24/7 box.
- You are unsure and run mostly mid-size models: Start with a single RTX 3060 12GB and add the second card when a model you want exceeds 12GB. Pair it with a strong CPU like the AMD Ryzen 7 5800X or the efficient Ryzen 7 5700X to keep prefill and data loading snappy.
Related guides
- Best GPU for Llama 70B at home: RTX 3060 stack vs workstation
- AMD Ryzen AI Max+ 400 "Gorgon Halo" 192GB unified memory for local LLMs
- Gemini 3.5 Flash vs RTX 3060 12GB local inference
- Best AM4 build for local LLM inference
- How much system RAM for Llama 70B on an RTX 3060
Citations and sources
- AMD Ryzen AI Max product page — unified-memory capacity and platform specs.
- TechPowerUp GeForce RTX 3060 specifications — VRAM, memory bandwidth, and TGP figures.
- Phoronix — AMD Ryzen AI Max review — independent inference and platform-power measurements.
