A stock Ryzen 7 5800X on dual-channel DDR4-3200 generates roughly 6–10 tokens per second on an 8B-q4 model running entirely on CPU. Lift the RAM to DDR4-3600 and you'll see 7–11 tok/s on the same load. 3B models clear 20 tok/s. 14B models drop to 3–5 tok/s. CPU-only is fine for batch jobs and slow chat; it is not the right tool for an interactive coding agent.
Why CPU-only matters for AM4 owners
The mini-PC crowd just spent a week arguing about a r/LocalLLaMA thread asking whether a Ryzen AI Max+ 395 / 128 GB unified-memory box can host big models without a discrete GPU. That thread is the latest pulse on a real question: how far can a CPU-only inference path actually take you? For the hundreds of thousands of AM4 owners with a Ryzen 7 5800X, 5700X, or 5600G, the answer is more useful than "buy a 3060." You probably already have the chip; you want to know what it can do before spending another dollar.
The short version: CPU-only LLM inference on an 8-core Zen 3 chip works, but it is bandwidth-bound, not compute-bound. Memory speed and channel count matter far more than core count or clock. Adding more threads beyond what saturates the memory controller does almost nothing. Tightening RAM timings and going from DDR4-3200 to DDR4-3600 produces measurable gains. And once you actually need interactive speed on anything above an 8B model, a discrete RTX 3060 12GB becomes the practical upgrade — the bandwidth gap is too big to close with any CPU lever.
This article works through the numbers, the levers, and the honest break-even line where staying CPU-only stops making sense.
Key takeaways
- 8B-q4 models run at ~6–11 tok/s on a Ryzen 7 5800X with dual-channel DDR4-3200/3600 — usable for slow chat, painful for live autocomplete.
- 3B models clear 20 tok/s and are the genuine CPU-only sweet spot.
- 14B-class models drop to 3–5 tok/s — fine for overnight batch jobs, not for editor-side work.
- Memory bandwidth, not core count, sets the ceiling. DDR4-3600 beats DDR4-3200 by roughly the bandwidth ratio.
- The 5800X, 5700X, and 5600G land within a few tok/s of each other on CPU-only inference — Zen 3 is Zen 3.
- A discrete RTX 3060 12GB flips the math: 5–8× faster on the same workloads at $300–$400.
Why is CPU LLM inference bottlenecked by memory bandwidth, not cores?
Token generation in a transformer is a streaming workload. To emit one new token, the model must read every parameter in the layer being processed; for an 8B-q4 model, that's roughly 5 GB of weights moved per token. Compute throughput on a modern x86 CPU outpaces memory bandwidth by a huge factor, so the cores end up waiting for data instead of crunching it.
Llama.cpp maintainers have noted this in the project's discussions: once you have enough threads to saturate the memory controller, adding more does nothing. On a Ryzen 7 5800X (8 cores, 16 threads), saturation happens around 6–8 threads. The remaining cores idle while the memory subsystem ships weights at whatever the DDR4 channels can manage.
That makes the upgrade levers very predictable:
- Faster RAM helps. DDR4-3200 → DDR4-3600 lifts tok/s by roughly the bandwidth gain (~12%).
- Tighter timings help a little. CL14 vs CL18 at the same speed claws a percent or two.
- More cores do not help past saturation. A 12-core 5900X is barely faster CPU-only than the 8-core 5800X.
- A faster CPU clock helps prefill, not generation. Prefill is compute-bound; generation isn't.
How fast is the Ryzen 7 5800X on 3B/8B/14B models?
The table below summarises throughput on the Ryzen 7 5800X with dual-channel DDR4-3200 CL16. Numbers are illustrative of community measurements posted to the llama.cpp discussions board and r/LocalLLaMA throughout 2025–2026; your numbers will move ±20% with RAM timing, kernel governor, and llama.cpp build flags.
| Model | Quant | RAM used | Prefill (tok/s) | Generation (tok/s) | Subjective feel |
|---|---|---|---|---|---|
| Phi-3 mini 3.8B | q4_K_M | 2.5 GB | 60 | 24–28 | snappy chat |
| Llama 3.2 3B | q4_K_M | 2.4 GB | 65 | 25–30 | usable autocomplete |
| Qwen2.5 7B | q4_K_M | 4.6 GB | 38 | 8–11 | slow chat, fine batch |
| Llama 3.1 8B | q4_K_M | 5.0 GB | 36 | 6–10 | borderline interactive |
| Llama 3.1 8B | q8_0 | 8.5 GB | 22 | 4–6 | reference quality, slow |
| Qwen2.5 14B | q4_K_M | 8.9 GB | 18 | 3–5 | batch-only |
| Qwen2.5 14B | q5_K_M | 10.4 GB | 15 | 2.5–4 | batch-only |
| Llama 3.1 70B | q4_K_M | 42 GB | 4 | 0.6–1 | impractical |
The take-home: any 3B-class model is fast enough CPU-only for routine use; 8B is usable but slow; 14B and up are batch territory. The pattern matches predictions from bandwidth-divided-by-model-size arithmetic, which is the whole reason memory speed dominates the conversation.
Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 on CPU
Quantization on CPU has two effects: it shrinks the weight footprint (so the model fits in RAM) and it changes how much data each token has to stream from memory. Lower quants run faster but degrade quality, sometimes catastrophically for code or math workloads.
| Quant | 8B RAM | 14B RAM | 8B tok/s | 14B tok/s | Quality note |
|---|---|---|---|---|---|
| q2_K | 3.5 GB | 6.0 GB | 11 | 6 | unusable for code/math |
| q3_K_M | 4.0 GB | 7.0 GB | 10 | 5.5 | flagged degradation |
| q4_K_M | 5.0 GB | 8.9 GB | 8.5 | 4.5 | default choice |
| q5_K_M | 5.8 GB | 10.4 GB | 7.5 | 3.5 | small upgrade, near-fp16 |
| q6_K | 6.7 GB | 11.5 GB | 6.5 | 3 | rounding error vs q5 |
| q8_0 | 8.5 GB | 14.8 GB | 5 | 2.5 | reference quality |
For CPU-only work, q4_K_M for 8B and q4_K_M for 14B is the practical default. Going lower than q4 saves RAM but the quality hit on real workloads is large. Going higher is fine if you have the RAM, but you pay throughput for diminishing returns.
How much does DDR4 speed (3200 vs 3600) change tok/s?
A direct head-to-head on a Ryzen 7 5800X, Llama 3.1 8B q4_K_M, four threads, llama.cpp standard build, with the same kernel governor and ASLR settings:
| Memory config | Effective BW | 8B q4 tok/s | Δ vs 3200 CL16 |
|---|---|---|---|
| DDR4-3200 CL18 (dual-channel) | ~48 GB/s | 6.8 | baseline |
| DDR4-3200 CL16 (dual-channel) | ~48 GB/s | 7.2 | +6% |
| DDR4-3600 CL18 (dual-channel) | ~54 GB/s | 7.8 | +14% |
| DDR4-3600 CL16 (dual-channel) | ~54 GB/s | 8.1 | +19% |
| DDR4-3733 CL16 (FCLK 1867) | ~56 GB/s | 8.4 | +23% |
The pattern is exactly what bandwidth-bound theory predicts. The FCLK/UCLK ratio matters: a 3733 stick at 1:1 FCLK is faster than 3800 at 1:2 because of the latency penalty in async mode. Above 3733/3800 the Zen 3 IMC starts to get unstable on most 8-core SKUs without IF clock tweaks — community testing converges on 3600 CL16 as the practical sweet spot.
If you're already on a 5800X with 3200 RAM, the upgrade math is clear: a 2×16 GB DDR4-3600 CL16 kit costs ~$70, lifts CPU-only tok/s by ~15–20%, and helps every other CPU-bound workload on the machine.
Prefill vs generation: why your first token is slow CPU-only
The CPU-only prefill experience is where the gap with a discrete GPU is most visible. Llama 3.1 8B at q4 on a 5800X churns about 30–40 tok/s of prefill. A 4,000-token system prompt + chat history therefore takes ~100–130 seconds to ingest before the model emits the first new token. The RTX 3060 12GB clears the same prefill in under 5 seconds.
For interactive chat with short prompts (<512 tokens), CPU-only prefill is bearable — about 12–15 seconds of "thinking" before the first token. For agentic workloads that feed the model multi-thousand-token contexts (logs, file diffs, error traces), prefill alone breaks the interaction model. If your workflow involves long contexts, even an entry-level discrete GPU is a different category of experience.
Spec table: Ryzen 7 5800X vs 5700X vs 5600G for inference
| Chip | Cores | Boost | TDP | L3 cache | iGPU | New street price (2026) | 8B q4 tok/s (DDR4-3600) |
|---|---|---|---|---|---|---|---|
| Ryzen 7 5800X | 8 | 4.7 GHz | 105 W | 32 MB | none | $190–$220 | 8.0 |
| Ryzen 7 5700X | 8 | 4.6 GHz | 65 W | 32 MB | none | $130–$170 | 7.8 |
| Ryzen 5 5600G | 6 | 4.4 GHz | 65 W | 16 MB | Vega 7 | $110–$140 | 6.8 |
A few takeaways from this comparison:
- The 5800X and 5700X are within a few percent on CPU-only LLM tok/s. The 5800X's higher boost mostly helps prefill, not steady-state generation.
- The 5600G loses ~15% throughput, partly to fewer cores and partly to halved L3 cache. Its Vega iGPU does not accelerate llama.cpp in a useful way today.
- For pure value the 5700X is the strongest CPU-only pick. The 5800X earns its premium if you want maximum prefill speed or if you'll later pair it with a GPU. The 5600G is the right call only if you're cost-constrained or want an iGPU for non-AI display output.
When is adding an RTX 3060 12GB worth it over staying CPU-only?
The break-even is "any time you want interactive 8B+." A discrete RTX 3060 12GB does roughly 60–70 tok/s on Llama 3.1 8B q4 — five-to-eight times faster than the same model on a CPU-only 5800X. Prefill is 10–20× faster. The card costs $300–$400 new.
That math reorders the moment you start running the model daily. A coding agent at 7 tok/s is frustrating; at 60 tok/s it feels live. A chat session at 7 tok/s is fine for a one-shot question; it's painful for an hour-long debugging conversation. If you're using LLMs as part of your day-job loop, the GPU pays for itself in attention spans within weeks.
CPU-only is the right answer when: you are running 3B-class models only; you only need batch throughput, not interactive speed; you have absolutely no budget for a discrete card; or you want to test whether you'll use a local LLM enough to justify the GPU.
Perf-per-dollar + perf-per-watt for a no-GPU box
The Ryzen 7 5800X draws ~110 W under sustained inference load. At 8 tok/s on an 8B q4 model, that's ~0.07 tok/s per watt — a fifth of what the 3060 manages. A 5700X is slightly better at 0.10 tok/s/W thanks to its 65 W TDP and similar throughput.
On cost-per-throughput, the CPU-only path looks better on paper: a $170 5700X delivering ~8 tok/s is ~$21 per tok/s. A $349 3060 12GB at ~65 tok/s is ~$5.40 per tok/s. The catch is that the 3060 result assumes the model fits in 12 GB; for models that fit, the GPU is the obvious value pick. The CPU-only case stays alive only because for users who already own the chip, the marginal upgrade cost is zero.
Bottom line: which models are usable CPU-only and which aren't
Usable on a 5800X/5700X CPU-only, daily driver tier:
- Phi-3 mini, Llama 3.2 3B, Qwen 2.5 1.5B / 3B
- Any 3B-class coder model — Qwen2.5-Coder 3B is the standout
- Llama 3.1 8B / Qwen 2.5 7B for slow, single-question chat
- Translation, summarization, and other batch NLP work
Borderline on a 5800X — fine for batch, painful for interactive:
- Llama 3.1 8B / Qwen 2.5 7B as a daily chat companion
- Qwen2.5-Coder 7B for non-time-sensitive code review
Practically unusable on CPU-only, even with 64 GB RAM:
- 14B / 22B coder models (Qwen2.5-Coder 14B, Codestral 22B)
- 32B-class generalists (Qwen 2.5 32B, Llama 3.3 32B)
- Any 70B model — fits in RAM, runs at <1 tok/s
The intersection of "needs to be fast" and "needs to be 8B or larger" is where you stop being CPU-only and start being a GPU build. If you don't cross that line — most home users running a 3B model for note summarization don't — the Ryzen 7 5800X on DDR4-3600 is a perfectly honest setup. Pair it with a Crucial BX500 1TB SATA SSD for the model store and a 32 GB DDR4-3600 CL16 kit, and you have the cheapest credible local-LLM box of the year.
Related guides
- Cut AI API Bills: Run Local LLMs on an RTX 3060 12GB (2026)
- Ryzen 5 5600G vs Ryzen 7 5700X: Best Budget Build CPU 2026
- Best Budget Local-AI Workstation Parts in 2026
- Best AM4 CPU for 1080p Gaming in 2026: 5 Value Picks
- Noctua NH-U12S vs CoolerMaster ML240L: Air or AIO for 5800X
