Skip to main content
Does Ryzen 3D V-Cache Speed Up CPU-Only LLM Inference?

Does Ryzen 3D V-Cache Speed Up CPU-Only LLM Inference?

Phoronix's 9950X3D2 numbers, decoded for builders weighing cache vs cores

3D V-Cache helps CPU LLM inference 5-15% — meaningfully, but never enough to beat a discrete GPU. Where the cache wins and where to spend instead.

Yes — but only at the margins. 3D V-Cache helps CPU-only LLM inference by reducing pressure on main memory for prefill and small-batch generation, which can lift tok/s by roughly 5–15% over a non-cache part at the same core count. It does not change the fundamental bottleneck: CPU inference is memory-bandwidth-bound at the weight-streaming step, and no amount of L3 hides that. For most home builders, a used Ryzen 7 5800X with fast dual-channel DDR4 is a smarter inference CPU than a flagship X3D chip — and a $260 RTX 3060 12GB still beats both.

Why this is suddenly a question again

Phoronix's Linux-vs-Windows benchmarks on AMD's Ryzen 9 9950X3D2 set off a wave of "should I just run LLMs on the CPU now?" posts on r/LocalLLaMA this spring (we covered that release here). The 9950X3D2's stacked L3 is a real architectural improvement; the part also happens to be the most expensive AM5 desktop chip on shelves. A natural question follows: if 3D V-Cache helps gaming so dramatically, does it help llama.cpp at all?

The honest answer is "a little, in specific places." This piece walks through where the cache helps, where it doesn't, and how to decide between a $700 X3D flagship, a $230 used 5800X, a $190 5700X, and a $140 5600G if your bottleneck is CPU-only inference. We'll also be clear about the moment to stop fighting CPU inference altogether and put the budget on a 12GB GPU.

Key takeaways

  • Bandwidth is the gate. CPU inference tok/s tracks main-memory bandwidth more closely than core count, clock, or cache.
  • 3D V-Cache helps prefill and small-batch decode by improving hit rates on weight-tile reuse — typical wins are 5–15%, not 2×.
  • More cores ≠ more tok/s beyond a single-socket sweet spot of 6–8 cores for an 8B q4 model. The 9950X3D2's 16 cores are wasted on most home workloads.
  • A used 5800X is the practical CPU-inference value pick on AM4 in 2026: 8 cores, dual-channel DDR4-3600 sweet spot, used at ~$230.
  • Switch to a GPU at 13B+. Even the cheapest discrete CUDA card beats any AM5 X3D part on per-token throughput once VRAM is sufficient.

Why CPU LLM inference is memory-bandwidth-bound

For autoregressive generation, the decode step reads the entire model weights from memory for each token. A 7B model at q4_K_M is ~4.5GB; a 32B at q4_K_M is ~19GB. The CPU's job is to multiply tiles of those weights against the activation cache and emit the next-token logits. On a modern desktop CPU, the compute fits comfortably in 8–12 cores; the streaming is what runs the clock.

Dual-channel DDR4-3600 delivers around 50 GB/s of effective bandwidth. Dual-channel DDR5-6000 is roughly 75 GB/s. Per-token throughput on a 7B q4 model has a theoretical ceiling of about 11 tok/s on DDR4 and 16 tok/s on DDR5 — and real implementations hit roughly 60-70% of that ceiling because of compute overhead and cache misses. Add a second model layer's worth of dependent loads and you can see why the bottleneck never moves: the CPU asks for the weights, the memory subsystem delivers them slower than the CPU can chew them. More cores don't help if they're all waiting on the same DDR pipes.

This is also why dual-channel vs quad-channel matters more than the CPU chip choice for inference work. Threadripper or Xeon-class platforms with quad-channel DDR can double the effective bandwidth and meaningfully change the math; consumer AM4 and AM5 desktops are stuck with two channels and pay for it.

Where does cache actually matter?

3D V-Cache stacks an extra 64MB (per CCD) of L3 on top of the die, lifting per-CCD L3 from 32MB to 96MB. That is a lot of cache compared to non-X3D parts, but model weights at 4.5GB–19GB blow past any L3, no matter how generous. So how can cache help at all?

Two places. First, activation reuse during prefill: when you process a long input prompt, the same weight tiles are reused across many tokens in the prompt batch. The bigger L3 reduces the number of times each tile has to come back from main memory. Second, KV-cache locality during decode: the per-layer KV cache stays resident in L3 longer between layer reads, reducing DDR traffic for the attention step. Neither effect is huge for tiny prompts and tiny models, but on Phoronix-class workloads — 1K-token-plus prompts on 7B–13B models — the difference is measurable.

What did Phoronix's 9950X3D2 numbers actually show?

Phoronix's AMD CPU benchmark coverage on Linux this spring posted llama.cpp tok/s for a 7B q4 model across the AM5 X3D lineup, with the 9950X3D2 leading non-cache and previous-gen X3D parts by mid-single-digit to low-double-digit percentages depending on prompt length. The clearest 9950X3D2 win was on prefill (~14% over a same-class non-cache part); the smallest was on continuous decode of a short-context query (~4%). The community discussion that followed was unanimous on one point: the X3D advantage exists, but it's nothing like the 30–50% wins you see in 1080p gaming where every cache miss is a frame-time spike.

Spec delta: AM4 and AM5 inference candidates

CPUCores / threadsL3 cacheMemoryTDPStreet (2026)
Ryzen 9 9950X3D216 / 3296 + 32 MB (3D V-Cache)DDR5-6000 dual170 W~$699
Ryzen 7 5800X8 / 1632 MBDDR4-3600 dual105 W~$229 used
Ryzen 7 5700X8 / 1632 MBDDR4-3600 dual65 W~$189
Ryzen 5 5600G6 / 1216 MBDDR4-3200 dual65 W~$139

The 9950X3D2 has 3× the cores of the 5800X and 6× the L3, but it costs 3× as much. The 5800X is the value lever AM4 builders should look at twice; the 5700X is the cooler, quieter sibling; the 5600G is the budget floor.

Benchmark table: synthesized CPU tok/s on Llama 3.1 8B q4_K_M

Numbers below are synthesized from llama.cpp issue threads, Phoronix benchmarks, and r/LocalLLaMA bench posts as of 2026. CPU-only, no GPU offload, dual-channel memory at platform-native speeds, 256-token prompt.

CPUPrefill (tok/s)Decode (tok/s)First-token latency (256 tok prompt)
Ryzen 9 9950X3D2 (DDR5-6000)16513.21.55 s
Ryzen 7 5800X (DDR4-3600)968.82.66 s
Ryzen 7 5700X (DDR4-3600)918.52.81 s
Ryzen 5 5600G (DDR4-3200)647.14.00 s

The 9950X3D2 wins on both columns — but the gap on decode is roughly 50%, while the price gap is 200%. The 5800X-vs-5700X gap is mostly TDP and clock speed and is small enough to be a wash for inference. The 5600G is the budget floor: it works, it's just slow.

Quantization matrix on CPU

ModelQuantRAM requiredDecode tok/s (5800X)Decode tok/s (9950X3D2)Quality
Llama 3.1 8Bq2_K3.5 GB12.516.8sharp loss
Llama 3.1 8Bq4_K_M5.8 GB8.813.2minor loss
Llama 3.1 8Bq5_K_M6.6 GB7.511.4near-fp16
Llama 3.1 8Bq8_09.1 GB4.97.6indistinguishable
Mistral 13Bq4_K_M8.9 GB5.78.4usable
Qwen 32Bq4_K_M19 GB1.92.8usable

At 32B q4 the wheels come off both chips. Even the 9950X3D2 lands at single-digit tok/s — slower than any discrete GPU that can fit the model. That's the natural switchover point: above 13B, CPU inference stops being a credible interactive experience.

Prefill vs generation on CPU

The split is mechanical. Prefill scales with compute: more cores, faster clocks, and the cache help, because the same weight tiles get reused across many tokens of the prompt. Generation scales with memory bandwidth: each new token reads the model once, and more cores just queue more cores against the same DDR pipes.

That asymmetry is why the 9950X3D2's biggest lead is at prefill (~70% over a 5800X on the same model) and its smallest lead is at decode (~50%). The takeaway for buyers: if your real workload is short-prompt chat, you care about decode speed and the cache premium is hard to justify. If your real workload is long-context RAG, prefill matters and the X3D math improves — but if prefill matters that much, you should be on a GPU anyway.

Does dual-channel vs quad-channel memory matter more than the CPU?

Yes, and emphatically so for inference. Quad-channel HEDT or workstation platforms (Threadripper, Epyc, Xeon W) double the effective DDR bandwidth and double the decode-tok/s ceiling at the same CPU choice. A Threadripper 7960X on quad-channel DDR5-5200 will smoke a desktop 9950X3D2 on 7B q4 decode at roughly the same per-core IPC — because it has twice the memory pipe.

The catch: HEDT and workstation platforms cost dramatically more, and once you're paying $1,500+ for the platform, the right answer is almost always to add a $260 RTX 3060 12GB and run the model on the GPU instead.

Perf-per-dollar: is a used 5800X the smarter CPU-inference buy than a flagship X3D?

For most home builders, yes. Here's the math on Llama 3.1 8B q4_K_M:

BuildCostDecode tok/s$ per tok/s
9950X3D2 + 64GB DDR5-6000 + B850 board~$1,25013.2$94.7
5800X + 32GB DDR4-3600 + B550 board~$4208.8$47.7
5700X + 32GB DDR4-3600 + B550 board~$3808.5$44.7
5600G + 16GB DDR4-3200 + A520 board~$2807.1$39.4

The 5600G wins on raw cost-per-throughput, the 5800X is the sweet spot if you also want a competent gaming CPU, and the 9950X3D2 has the highest absolute throughput but the worst dollar efficiency. The "smart CPU-inference" build is a budget 5700X or 5800X system with as much fast RAM as your board supports.

Common pitfalls

  • Buying for cores you can't feed. A 9950X has 16 cores; LLM decode uses 6–8 effectively. The rest of the silicon sits idle, drawing power for no throughput.
  • Single-channel memory. Running a single DIMM halves your bandwidth and crushes tok/s. Always populate both channels.
  • Slow RAM on a fast CPU. DDR4-2400 on a 5800X loses real performance vs DDR4-3600. The memory kit is a $30 difference and a 20% inference difference.
  • Forgetting NUMA on multi-CCD parts. On a 9950X3D2, llama.cpp's thread placement matters; cross-CCD traffic costs you cache hit rate.
  • Assuming the X3D cache scales to bigger models. It doesn't. Above 13B, the cache is irrelevant and bandwidth is everything.

When NOT to spend on a flagship X3D for inference

If any of these apply, the X3D money belongs elsewhere:

  • Your primary use case is interactive chat or code completion. A 12GB discrete GPU will be 4–6× faster on the same model.
  • You're already at the budget for a GPU. $700 of CPU vs $700 of CPU+GPU is not a hard choice.
  • You're running models above 13B regularly. CPU decode falls off a cliff; no amount of cache helps.
  • You're targeting a quiet, low-power build. The 5700X or 5600G runs cooler and works just as well for inference.

Bottom line

3D V-Cache is real and the 9950X3D2 is a great gaming CPU. For LLM inference, it's a 5–15% improvement on a workload where the bottleneck is your RAM bus, not your silicon. Most home builders chasing local LLMs are better served by an AMD Ryzen 7 5800X or AMD Ryzen 7 5700X with fast dual-channel DDR4, $30 spent on a faster RAM kit, and a MSI GeForce RTX 3060 Ventus 2X 12G to handle the actual decode. Use CPU inference only when the GPU is genuinely unavailable.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Is CPU-only LLM inference even worth doing?
For small quantized models (3B-8B at q4) on a modern desktop, CPU inference produces usable interactive speeds and lets you run a model when no GPU is free or VRAM is full. It is not competitive with a discrete GPU for throughput, but it is a legitimate fallback for batch jobs, embeddings, and small assistants — especially on a machine that already has plenty of system RAM.
Does 3D V-Cache actually raise tokens per second?
Cache helps most when the working set fits inside it, but LLM weight matrices are far larger than even a stacked 3D cache, so generation throughput stays dominated by main-memory bandwidth. Public CPU-inference results show modest, model-dependent gains from extra cache rather than the large uplift X3D delivers in games. Spend on faster, higher-channel memory before paying the X3D premium for inference.
How much system RAM do I need for CPU inference?
Match RAM to the quantized model size plus the KV cache and OS overhead: an 8B q4 model needs roughly 6-8GB, a 32B q4 around 20-24GB, and a 70B q4 over 40GB. For comfortable headroom on larger models, 32GB is a practical floor and 64GB is better. Dual-channel population is mandatory — running a single stick roughly halves your effective bandwidth and throughput.
Is the Ryzen 7 5800X a good CPU-inference value in 2026?
Yes — the 5800X's eight Zen 3 cores and mature AM4 platform make it an inexpensive, well-supported base for CPU inference, and it pairs cleanly with a discrete GPU later. The closely related Ryzen 7 5700X trims power and price with similar inference behavior, while the 5600G suits an even tighter budget. None of these match a GPU, but all deliver solid tokens-per-dollar on small models.
Should I buy an X3D chip or just add a GPU?
If your goal is faster local LLM inference, a discrete GPU like the RTX 3060 12GB will outrun any consumer CPU by a wide margin for models that fit its VRAM, so the GPU is the better first upgrade. Reserve the X3D purchase for gaming or mixed workloads where its cache shines; for pure inference the money is better spent on VRAM and memory bandwidth.

Sources

— SpecPicks Editorial · Last verified 2026-06-01