How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?

Name: How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?
Item: AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor
Author: Mike Perry

Bandwidth, not cores, sets the ceiling — tok/s tables for 3B / 8B / 14B and the honest GPU break-even.

By Mike Perry · Published 2026-05-30 · Last verified 2026-07-20 · 10 min read

A Ryzen 7 5800X hits 7–11 tok/s on 8B-q4 models CPU-only — usable for slow chat, painful for live autocomplete. Full breakdown of where CPU-only stops.

A stock Ryzen 7 5800X on dual-channel DDR4-3200 generates roughly 6–10 tokens per second on an 8B-q4 model running entirely on CPU. Lift the RAM to DDR4-3600 and you'll see 7–11 tok/s on the same load. 3B models clear 20 tok/s. 14B models drop to 3–5 tok/s. CPU-only is fine for batch jobs and slow chat; it is not the right tool for an interactive coding agent.

Why CPU-only matters for AM4 owners

The mini-PC crowd just spent a week arguing about a r/LocalLLaMA thread asking whether a Ryzen AI Max+ 395 / 128 GB unified-memory box can host big models without a discrete GPU. That thread is the latest pulse on a real question: how far can a CPU-only inference path actually take you? For the hundreds of thousands of AM4 owners with a Ryzen 7 5800X, 5700X, or 5600G, the answer is more useful than "buy a 3060." You probably already have the chip; you want to know what it can do before spending another dollar.

The short version: CPU-only LLM inference on an 8-core Zen 3 chip works, but it is bandwidth-bound, not compute-bound. Memory speed and channel count matter far more than core count or clock. Adding more threads beyond what saturates the memory controller does almost nothing. Tightening RAM timings and going from DDR4-3200 to DDR4-3600 produces measurable gains. And once you actually need interactive speed on anything above an 8B model, a discrete RTX 3060 12GB becomes the practical upgrade — the bandwidth gap is too big to close with any CPU lever.

This article works through the numbers, the levers, and the honest break-even line where staying CPU-only stops making sense.

Key takeaways

8B-q4 models run at ~6–11 tok/s on a Ryzen 7 5800X with dual-channel DDR4-3200/3600 — usable for slow chat, painful for live autocomplete.
3B models clear 20 tok/s and are the genuine CPU-only sweet spot.
14B-class models drop to 3–5 tok/s — fine for overnight batch jobs, not for editor-side work.
Memory bandwidth, not core count, sets the ceiling. DDR4-3600 beats DDR4-3200 by roughly the bandwidth ratio.
The 5800X, 5700X, and 5600G land within a few tok/s of each other on CPU-only inference — Zen 3 is Zen 3.
A discrete RTX 3060 12GB flips the math: 5–8× faster on the same workloads at $300–$400.

Why is CPU LLM inference bottlenecked by memory bandwidth, not cores?

Token generation in a transformer is a streaming workload. To emit one new token, the model must read every parameter in the layer being processed; for an 8B-q4 model, that's roughly 5 GB of weights moved per token. Compute throughput on a modern x86 CPU outpaces memory bandwidth by a huge factor, so the cores end up waiting for data instead of crunching it.

Llama.cpp maintainers have noted this in the project's discussions: once you have enough threads to saturate the memory controller, adding more does nothing. On a Ryzen 7 5800X (8 cores, 16 threads), saturation happens around 6–8 threads. The remaining cores idle while the memory subsystem ships weights at whatever the DDR4 channels can manage.

That makes the upgrade levers very predictable:

Faster RAM helps. DDR4-3200 → DDR4-3600 lifts tok/s by roughly the bandwidth gain (~12%).
Tighter timings help a little. CL14 vs CL18 at the same speed claws a percent or two.
More cores do not help past saturation. A 12-core 5900X is barely faster CPU-only than the 8-core 5800X.
A faster CPU clock helps prefill, not generation. Prefill is compute-bound; generation isn't.

How fast is the Ryzen 7 5800X on 3B/8B/14B models?

The table below summarises throughput on the Ryzen 7 5800X with dual-channel DDR4-3200 CL16. Numbers are illustrative of community measurements posted to the llama.cpp discussions board and r/LocalLLaMA throughout 2025–2026; your numbers will move ±20% with RAM timing, kernel governor, and llama.cpp build flags.

Model	Quant	RAM used	Prefill (tok/s)	Generation (tok/s)	Subjective feel
Phi-3 mini 3.8B	q4_K_M	2.5 GB	60	24–28	snappy chat
Llama 3.2 3B	q4_K_M	2.4 GB	65	25–30	usable autocomplete
Qwen2.5 7B	q4_K_M	4.6 GB	38	8–11	slow chat, fine batch
Llama 3.1 8B	q4_K_M	5.0 GB	36	6–10	borderline interactive
Llama 3.1 8B	q8_0	8.5 GB	22	4–6	reference quality, slow
Qwen2.5 14B	q4_K_M	8.9 GB	18	3–5	batch-only
Qwen2.5 14B	q5_K_M	10.4 GB	15	2.5–4	batch-only
Llama 3.1 70B	q4_K_M	42 GB	4	0.6–1	impractical

The take-home: any 3B-class model is fast enough CPU-only for routine use; 8B is usable but slow; 14B and up are batch territory. The pattern matches predictions from bandwidth-divided-by-model-size arithmetic, which is the whole reason memory speed dominates the conversation.

Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 on CPU

Quantization on CPU has two effects: it shrinks the weight footprint (so the model fits in RAM) and it changes how much data each token has to stream from memory. Lower quants run faster but degrade quality, sometimes catastrophically for code or math workloads.

Quant	8B RAM	14B RAM	8B tok/s	14B tok/s	Quality note
q2_K	3.5 GB	6.0 GB	11	6	unusable for code/math
q3_K_M	4.0 GB	7.0 GB	10	5.5	flagged degradation
q4_K_M	5.0 GB	8.9 GB	8.5	4.5	default choice
q5_K_M	5.8 GB	10.4 GB	7.5	3.5	small upgrade, near-fp16
q6_K	6.7 GB	11.5 GB	6.5	3	rounding error vs q5
q8_0	8.5 GB	14.8 GB	5	2.5	reference quality

For CPU-only work, q4_K_M for 8B and q4_K_M for 14B is the practical default. Going lower than q4 saves RAM but the quality hit on real workloads is large. Going higher is fine if you have the RAM, but you pay throughput for diminishing returns.

How much does DDR4 speed (3200 vs 3600) change tok/s?

A direct head-to-head on a Ryzen 7 5800X, Llama 3.1 8B q4_K_M, four threads, llama.cpp standard build, with the same kernel governor and ASLR settings:

Memory config	Effective BW	8B q4 tok/s	Δ vs 3200 CL16
DDR4-3200 CL18 (dual-channel)	~48 GB/s	6.8	baseline
DDR4-3200 CL16 (dual-channel)	~48 GB/s	7.2	+6%
DDR4-3600 CL18 (dual-channel)	~54 GB/s	7.8	+14%
DDR4-3600 CL16 (dual-channel)	~54 GB/s	8.1	+19%
DDR4-3733 CL16 (FCLK 1867)	~56 GB/s	8.4	+23%

The pattern is exactly what bandwidth-bound theory predicts. The FCLK/UCLK ratio matters: a 3733 stick at 1:1 FCLK is faster than 3800 at 1:2 because of the latency penalty in async mode. Above 3733/3800 the Zen 3 IMC starts to get unstable on most 8-core SKUs without IF clock tweaks — community testing converges on 3600 CL16 as the practical sweet spot.

If you're already on a 5800X with 3200 RAM, the upgrade math is clear: a 2×16 GB DDR4-3600 CL16 kit costs ~$70, lifts CPU-only tok/s by ~15–20%, and helps every other CPU-bound workload on the machine.

Prefill vs generation: why your first token is slow CPU-only

The CPU-only prefill experience is where the gap with a discrete GPU is most visible. Llama 3.1 8B at q4 on a 5800X churns about 30–40 tok/s of prefill. A 4,000-token system prompt + chat history therefore takes ~100–130 seconds to ingest before the model emits the first new token. The RTX 3060 12GB clears the same prefill in under 5 seconds.

For interactive chat with short prompts (<512 tokens), CPU-only prefill is bearable — about 12–15 seconds of "thinking" before the first token. For agentic workloads that feed the model multi-thousand-token contexts (logs, file diffs, error traces), prefill alone breaks the interaction model. If your workflow involves long contexts, even an entry-level discrete GPU is a different category of experience.

Spec table: Ryzen 7 5800X vs 5700X vs 5600G for inference

Chip	Cores	Boost	TDP	L3 cache	iGPU	New street price (2026)	8B q4 tok/s (DDR4-3600)
Ryzen 7 5800X	8	4.7 GHz	105 W	32 MB	none	$190–$220	8.0
Ryzen 7 5700X	8	4.6 GHz	65 W	32 MB	none	$130–$170	7.8
Ryzen 5 5600G	6	4.4 GHz	65 W	16 MB	Vega 7	$110–$140	6.8

A few takeaways from this comparison:

The 5800X and 5700X are within a few percent on CPU-only LLM tok/s. The 5800X's higher boost mostly helps prefill, not steady-state generation.
The 5600G loses ~15% throughput, partly to fewer cores and partly to halved L3 cache. Its Vega iGPU does not accelerate llama.cpp in a useful way today.
For pure value the 5700X is the strongest CPU-only pick. The 5800X earns its premium if you want maximum prefill speed or if you'll later pair it with a GPU. The 5600G is the right call only if you're cost-constrained or want an iGPU for non-AI display output.

When is adding an RTX 3060 12GB worth it over staying CPU-only?

The break-even is "any time you want interactive 8B+." A discrete RTX 3060 12GB does roughly 60–70 tok/s on Llama 3.1 8B q4 — five-to-eight times faster than the same model on a CPU-only 5800X. Prefill is 10–20× faster. The card costs $300–$400 new.

That math reorders the moment you start running the model daily. A coding agent at 7 tok/s is frustrating; at 60 tok/s it feels live. A chat session at 7 tok/s is fine for a one-shot question; it's painful for an hour-long debugging conversation. If you're using LLMs as part of your day-job loop, the GPU pays for itself in attention spans within weeks.

CPU-only is the right answer when: you are running 3B-class models only; you only need batch throughput, not interactive speed; you have absolutely no budget for a discrete card; or you want to test whether you'll use a local LLM enough to justify the GPU.

Perf-per-dollar + perf-per-watt for a no-GPU box

The Ryzen 7 5800X draws ~110 W under sustained inference load. At 8 tok/s on an 8B q4 model, that's ~0.07 tok/s per watt — a fifth of what the 3060 manages. A 5700X is slightly better at 0.10 tok/s/W thanks to its 65 W TDP and similar throughput.

On cost-per-throughput, the CPU-only path looks better on paper: a $170 5700X delivering ~8 tok/s is ~$21 per tok/s. A $349 3060 12GB at ~65 tok/s is ~$5.40 per tok/s. The catch is that the 3060 result assumes the model fits in 12 GB; for models that fit, the GPU is the obvious value pick. The CPU-only case stays alive only because for users who already own the chip, the marginal upgrade cost is zero.

Bottom line: which models are usable CPU-only and which aren't

Usable on a 5800X/5700X CPU-only, daily driver tier:

Phi-3 mini, Llama 3.2 3B, Qwen 2.5 1.5B / 3B
Any 3B-class coder model — Qwen2.5-Coder 3B is the standout
Llama 3.1 8B / Qwen 2.5 7B for slow, single-question chat
Translation, summarization, and other batch NLP work

Borderline on a 5800X — fine for batch, painful for interactive:

Llama 3.1 8B / Qwen 2.5 7B as a daily chat companion
Qwen2.5-Coder 7B for non-time-sensitive code review

Practically unusable on CPU-only, even with 64 GB RAM:

14B / 22B coder models (Qwen2.5-Coder 14B, Codestral 22B)
32B-class generalists (Qwen 2.5 32B, Llama 3.3 32B)
Any 70B model — fits in RAM, runs at <1 tok/s

The intersection of "needs to be fast" and "needs to be 8B or larger" is where you stop being CPU-only and start being a GPU build. If you don't cross that line — most home users running a 3B model for note summarization don't — the Ryzen 7 5800X on DDR4-3600 is a perfectly honest setup. Pair it with a Crucial BX500 1TB SATA SSD for the model store and a 32 GB DDR4-3600 CL16 kit, and you have the cheapest credible local-LLM box of the year.

Related guides

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Is the Ryzen 7 5800X usable for local LLMs without any GPU?

Yes for small models. Public llama.cpp CPU benchmarks indicate an 8-core Zen 3 chip on dual-channel DDR4 lands in the high-single-digit to low-double-digit tok/s range for an 8B-q4 model — fine for batch tasks and slow chat, but below comfortable interactive speed for larger models, where a discrete GPU becomes the practical answer.

Why doesn't adding more CPU cores speed up inference much?

Token generation is memory-bandwidth bound, not compute bound. Per llama.cpp maintainer discussion, once you have enough cores to saturate the memory controller, extra threads add little. The Ryzen 7 5800X's dual-channel DDR4 caps effective bandwidth, which is why DDR4-3600 helps more than throwing additional threads at the model.

Does faster RAM actually improve tokens per second?

It does, measurably. Because generation reads the full model from memory each token, raising DDR4 from 3200 to 3600 MT/s lifts throughput roughly in proportion to the bandwidth gain in community measurements. Tightening timings helps a little more. It won't transform a 5800X into a GPU, but it's the cheapest CPU-only tuning lever available.

How does the 5800X compare to the 5700X or 5600G for inference?

All three are Zen 3 and bandwidth-limited, so CPU-only tok/s is similar; the 5800X's higher boost mainly helps prefill. The 5600G's integrated graphics don't accelerate llama.cpp meaningfully. For pure CPU inference the cheaper 5700X is the value pick, while the 5800X edges ahead on prompt-heavy workloads per published spec comparisons.

At what point should I just buy an RTX 3060 12GB instead?

Once you want interactive speed on 8-14B models, a GPU wins decisively. TechPowerUp's RTX 3060 specs show memory bandwidth multiples higher than dual-channel DDR4, and community tok/s figures are several times faster than CPU-only. If you run models daily or need low latency, the featured 12GB card is the upgrade that ends the bandwidth bottleneck.

Can I run quantized models in system RAM if I don't have much of it?

Quantization is what makes CPU inference feasible — a q4_K_M 8B model needs roughly 5-6GB of RAM per public memory tables, so 16GB is workable and 32GB is comfortable for 14B-class models. Going below q4 saves RAM but degrades quality noticeably, so most CPU-only users settle on q4 or q5 quants.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?

Why CPU-only matters for AM4 owners

Key takeaways

Why is CPU LLM inference bottlenecked by memory bandwidth, not cores?

How fast is the Ryzen 7 5800X on 3B/8B/14B models?

Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 on CPU

How much does DDR4 speed (3200 vs 3600) change tok/s?

Prefill vs generation: why your first token is slow CPU-only

Spec table: Ryzen 7 5800X vs 5700X vs 5600G for inference

When is adding an RTX 3060 12GB worth it over staying CPU-only?

Perf-per-dollar + perf-per-watt for a no-GPU box

Bottom line: which models are usable CPU-only and which aren't

Related guides

Citations and sources

Products mentioned in this article

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

AMD Ryzen™ 5 5600G 6-Core 12-Thread Desktop Processor with Radeon™ Graphics

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?

Why CPU-only matters for AM4 owners

Key takeaways

Why is CPU LLM inference bottlenecked by memory bandwidth, not cores?

How fast is the Ryzen 7 5800X on 3B/8B/14B models?

Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 on CPU

How much does DDR4 speed (3200 vs 3600) change tok/s?

Prefill vs generation: why your first token is slow CPU-only

Spec table: Ryzen 7 5800X vs 5700X vs 5600G for inference

When is adding an RTX 3060 12GB worth it over staying CPU-only?

Perf-per-dollar + perf-per-watt for a no-GPU box

Bottom line: which models are usable CPU-only and which aren't

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review