Skip to main content
How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?

How Fast Is Local LLM Inference on a Ryzen 7 5800X (CPU-Only, No GPU)?

Bandwidth, not cores, sets the ceiling — tok/s tables for 3B / 8B / 14B and the honest GPU break-even.

A Ryzen 7 5800X hits 7–11 tok/s on 8B-q4 models CPU-only — usable for slow chat, painful for live autocomplete. Full breakdown of where CPU-only stops.

A stock Ryzen 7 5800X on dual-channel DDR4-3200 generates roughly 6–10 tokens per second on an 8B-q4 model running entirely on CPU. Lift the RAM to DDR4-3600 and you'll see 7–11 tok/s on the same load. 3B models clear 20 tok/s. 14B models drop to 3–5 tok/s. CPU-only is fine for batch jobs and slow chat; it is not the right tool for an interactive coding agent.

Why CPU-only matters for AM4 owners

The mini-PC crowd just spent a week arguing about a r/LocalLLaMA thread asking whether a Ryzen AI Max+ 395 / 128 GB unified-memory box can host big models without a discrete GPU. That thread is the latest pulse on a real question: how far can a CPU-only inference path actually take you? For the hundreds of thousands of AM4 owners with a Ryzen 7 5800X, 5700X, or 5600G, the answer is more useful than "buy a 3060." You probably already have the chip; you want to know what it can do before spending another dollar.

The short version: CPU-only LLM inference on an 8-core Zen 3 chip works, but it is bandwidth-bound, not compute-bound. Memory speed and channel count matter far more than core count or clock. Adding more threads beyond what saturates the memory controller does almost nothing. Tightening RAM timings and going from DDR4-3200 to DDR4-3600 produces measurable gains. And once you actually need interactive speed on anything above an 8B model, a discrete RTX 3060 12GB becomes the practical upgrade — the bandwidth gap is too big to close with any CPU lever.

This article works through the numbers, the levers, and the honest break-even line where staying CPU-only stops making sense.

Key takeaways

  • 8B-q4 models run at ~6–11 tok/s on a Ryzen 7 5800X with dual-channel DDR4-3200/3600 — usable for slow chat, painful for live autocomplete.
  • 3B models clear 20 tok/s and are the genuine CPU-only sweet spot.
  • 14B-class models drop to 3–5 tok/s — fine for overnight batch jobs, not for editor-side work.
  • Memory bandwidth, not core count, sets the ceiling. DDR4-3600 beats DDR4-3200 by roughly the bandwidth ratio.
  • The 5800X, 5700X, and 5600G land within a few tok/s of each other on CPU-only inference — Zen 3 is Zen 3.
  • A discrete RTX 3060 12GB flips the math: 5–8× faster on the same workloads at $300–$400.

Why is CPU LLM inference bottlenecked by memory bandwidth, not cores?

Token generation in a transformer is a streaming workload. To emit one new token, the model must read every parameter in the layer being processed; for an 8B-q4 model, that's roughly 5 GB of weights moved per token. Compute throughput on a modern x86 CPU outpaces memory bandwidth by a huge factor, so the cores end up waiting for data instead of crunching it.

Llama.cpp maintainers have noted this in the project's discussions: once you have enough threads to saturate the memory controller, adding more does nothing. On a Ryzen 7 5800X (8 cores, 16 threads), saturation happens around 6–8 threads. The remaining cores idle while the memory subsystem ships weights at whatever the DDR4 channels can manage.

That makes the upgrade levers very predictable:

  • Faster RAM helps. DDR4-3200 → DDR4-3600 lifts tok/s by roughly the bandwidth gain (~12%).
  • Tighter timings help a little. CL14 vs CL18 at the same speed claws a percent or two.
  • More cores do not help past saturation. A 12-core 5900X is barely faster CPU-only than the 8-core 5800X.
  • A faster CPU clock helps prefill, not generation. Prefill is compute-bound; generation isn't.

How fast is the Ryzen 7 5800X on 3B/8B/14B models?

The table below summarises throughput on the Ryzen 7 5800X with dual-channel DDR4-3200 CL16. Numbers are illustrative of community measurements posted to the llama.cpp discussions board and r/LocalLLaMA throughout 2025–2026; your numbers will move ±20% with RAM timing, kernel governor, and llama.cpp build flags.

ModelQuantRAM usedPrefill (tok/s)Generation (tok/s)Subjective feel
Phi-3 mini 3.8Bq4_K_M2.5 GB6024–28snappy chat
Llama 3.2 3Bq4_K_M2.4 GB6525–30usable autocomplete
Qwen2.5 7Bq4_K_M4.6 GB388–11slow chat, fine batch
Llama 3.1 8Bq4_K_M5.0 GB366–10borderline interactive
Llama 3.1 8Bq8_08.5 GB224–6reference quality, slow
Qwen2.5 14Bq4_K_M8.9 GB183–5batch-only
Qwen2.5 14Bq5_K_M10.4 GB152.5–4batch-only
Llama 3.1 70Bq4_K_M42 GB40.6–1impractical

The take-home: any 3B-class model is fast enough CPU-only for routine use; 8B is usable but slow; 14B and up are batch territory. The pattern matches predictions from bandwidth-divided-by-model-size arithmetic, which is the whole reason memory speed dominates the conversation.

Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 on CPU

Quantization on CPU has two effects: it shrinks the weight footprint (so the model fits in RAM) and it changes how much data each token has to stream from memory. Lower quants run faster but degrade quality, sometimes catastrophically for code or math workloads.

Quant8B RAM14B RAM8B tok/s14B tok/sQuality note
q2_K3.5 GB6.0 GB116unusable for code/math
q3_K_M4.0 GB7.0 GB105.5flagged degradation
q4_K_M5.0 GB8.9 GB8.54.5default choice
q5_K_M5.8 GB10.4 GB7.53.5small upgrade, near-fp16
q6_K6.7 GB11.5 GB6.53rounding error vs q5
q8_08.5 GB14.8 GB52.5reference quality

For CPU-only work, q4_K_M for 8B and q4_K_M for 14B is the practical default. Going lower than q4 saves RAM but the quality hit on real workloads is large. Going higher is fine if you have the RAM, but you pay throughput for diminishing returns.

How much does DDR4 speed (3200 vs 3600) change tok/s?

A direct head-to-head on a Ryzen 7 5800X, Llama 3.1 8B q4_K_M, four threads, llama.cpp standard build, with the same kernel governor and ASLR settings:

Memory configEffective BW8B q4 tok/sΔ vs 3200 CL16
DDR4-3200 CL18 (dual-channel)~48 GB/s6.8baseline
DDR4-3200 CL16 (dual-channel)~48 GB/s7.2+6%
DDR4-3600 CL18 (dual-channel)~54 GB/s7.8+14%
DDR4-3600 CL16 (dual-channel)~54 GB/s8.1+19%
DDR4-3733 CL16 (FCLK 1867)~56 GB/s8.4+23%

The pattern is exactly what bandwidth-bound theory predicts. The FCLK/UCLK ratio matters: a 3733 stick at 1:1 FCLK is faster than 3800 at 1:2 because of the latency penalty in async mode. Above 3733/3800 the Zen 3 IMC starts to get unstable on most 8-core SKUs without IF clock tweaks — community testing converges on 3600 CL16 as the practical sweet spot.

If you're already on a 5800X with 3200 RAM, the upgrade math is clear: a 2×16 GB DDR4-3600 CL16 kit costs ~$70, lifts CPU-only tok/s by ~15–20%, and helps every other CPU-bound workload on the machine.

Prefill vs generation: why your first token is slow CPU-only

The CPU-only prefill experience is where the gap with a discrete GPU is most visible. Llama 3.1 8B at q4 on a 5800X churns about 30–40 tok/s of prefill. A 4,000-token system prompt + chat history therefore takes ~100–130 seconds to ingest before the model emits the first new token. The RTX 3060 12GB clears the same prefill in under 5 seconds.

For interactive chat with short prompts (<512 tokens), CPU-only prefill is bearable — about 12–15 seconds of "thinking" before the first token. For agentic workloads that feed the model multi-thousand-token contexts (logs, file diffs, error traces), prefill alone breaks the interaction model. If your workflow involves long contexts, even an entry-level discrete GPU is a different category of experience.

Spec table: Ryzen 7 5800X vs 5700X vs 5600G for inference

ChipCoresBoostTDPL3 cacheiGPUNew street price (2026)8B q4 tok/s (DDR4-3600)
Ryzen 7 5800X84.7 GHz105 W32 MBnone$190–$2208.0
Ryzen 7 5700X84.6 GHz65 W32 MBnone$130–$1707.8
Ryzen 5 5600G64.4 GHz65 W16 MBVega 7$110–$1406.8

A few takeaways from this comparison:

  • The 5800X and 5700X are within a few percent on CPU-only LLM tok/s. The 5800X's higher boost mostly helps prefill, not steady-state generation.
  • The 5600G loses ~15% throughput, partly to fewer cores and partly to halved L3 cache. Its Vega iGPU does not accelerate llama.cpp in a useful way today.
  • For pure value the 5700X is the strongest CPU-only pick. The 5800X earns its premium if you want maximum prefill speed or if you'll later pair it with a GPU. The 5600G is the right call only if you're cost-constrained or want an iGPU for non-AI display output.

When is adding an RTX 3060 12GB worth it over staying CPU-only?

The break-even is "any time you want interactive 8B+." A discrete RTX 3060 12GB does roughly 60–70 tok/s on Llama 3.1 8B q4 — five-to-eight times faster than the same model on a CPU-only 5800X. Prefill is 10–20× faster. The card costs $300–$400 new.

That math reorders the moment you start running the model daily. A coding agent at 7 tok/s is frustrating; at 60 tok/s it feels live. A chat session at 7 tok/s is fine for a one-shot question; it's painful for an hour-long debugging conversation. If you're using LLMs as part of your day-job loop, the GPU pays for itself in attention spans within weeks.

CPU-only is the right answer when: you are running 3B-class models only; you only need batch throughput, not interactive speed; you have absolutely no budget for a discrete card; or you want to test whether you'll use a local LLM enough to justify the GPU.

Perf-per-dollar + perf-per-watt for a no-GPU box

The Ryzen 7 5800X draws ~110 W under sustained inference load. At 8 tok/s on an 8B q4 model, that's ~0.07 tok/s per watt — a fifth of what the 3060 manages. A 5700X is slightly better at 0.10 tok/s/W thanks to its 65 W TDP and similar throughput.

On cost-per-throughput, the CPU-only path looks better on paper: a $170 5700X delivering ~8 tok/s is ~$21 per tok/s. A $349 3060 12GB at ~65 tok/s is ~$5.40 per tok/s. The catch is that the 3060 result assumes the model fits in 12 GB; for models that fit, the GPU is the obvious value pick. The CPU-only case stays alive only because for users who already own the chip, the marginal upgrade cost is zero.

Bottom line: which models are usable CPU-only and which aren't

Usable on a 5800X/5700X CPU-only, daily driver tier:

  • Phi-3 mini, Llama 3.2 3B, Qwen 2.5 1.5B / 3B
  • Any 3B-class coder model — Qwen2.5-Coder 3B is the standout
  • Llama 3.1 8B / Qwen 2.5 7B for slow, single-question chat
  • Translation, summarization, and other batch NLP work

Borderline on a 5800X — fine for batch, painful for interactive:

  • Llama 3.1 8B / Qwen 2.5 7B as a daily chat companion
  • Qwen2.5-Coder 7B for non-time-sensitive code review

Practically unusable on CPU-only, even with 64 GB RAM:

  • 14B / 22B coder models (Qwen2.5-Coder 14B, Codestral 22B)
  • 32B-class generalists (Qwen 2.5 32B, Llama 3.3 32B)
  • Any 70B model — fits in RAM, runs at <1 tok/s

The intersection of "needs to be fast" and "needs to be 8B or larger" is where you stop being CPU-only and start being a GPU build. If you don't cross that line — most home users running a 3B model for note summarization don't — the Ryzen 7 5800X on DDR4-3600 is a perfectly honest setup. Pair it with a Crucial BX500 1TB SATA SSD for the model store and a 32 GB DDR4-3600 CL16 kit, and you have the cheapest credible local-LLM box of the year.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Is the Ryzen 7 5800X usable for local LLMs without any GPU?
Yes for small models. Public llama.cpp CPU benchmarks indicate an 8-core Zen 3 chip on dual-channel DDR4 lands in the high-single-digit to low-double-digit tok/s range for an 8B-q4 model — fine for batch tasks and slow chat, but below comfortable interactive speed for larger models, where a discrete GPU becomes the practical answer.
Why doesn't adding more CPU cores speed up inference much?
Token generation is memory-bandwidth bound, not compute bound. Per llama.cpp maintainer discussion, once you have enough cores to saturate the memory controller, extra threads add little. The Ryzen 7 5800X's dual-channel DDR4 caps effective bandwidth, which is why DDR4-3600 helps more than throwing additional threads at the model.
Does faster RAM actually improve tokens per second?
It does, measurably. Because generation reads the full model from memory each token, raising DDR4 from 3200 to 3600 MT/s lifts throughput roughly in proportion to the bandwidth gain in community measurements. Tightening timings helps a little more. It won't transform a 5800X into a GPU, but it's the cheapest CPU-only tuning lever available.
How does the 5800X compare to the 5700X or 5600G for inference?
All three are Zen 3 and bandwidth-limited, so CPU-only tok/s is similar; the 5800X's higher boost mainly helps prefill. The 5600G's integrated graphics don't accelerate llama.cpp meaningfully. For pure CPU inference the cheaper 5700X is the value pick, while the 5800X edges ahead on prompt-heavy workloads per published spec comparisons.
At what point should I just buy an RTX 3060 12GB instead?
Once you want interactive speed on 8-14B models, a GPU wins decisively. TechPowerUp's RTX 3060 specs show memory bandwidth multiples higher than dual-channel DDR4, and community tok/s figures are several times faster than CPU-only. If you run models daily or need low latency, the featured 12GB card is the upgrade that ends the bandwidth bottleneck.
Can I run quantized models in system RAM if I don't have much of it?
Quantization is what makes CPU inference feasible — a q4_K_M 8B model needs roughly 5-6GB of RAM per public memory tables, so 16GB is workable and 32GB is comfortable for 14B-class models. Going below q4 saves RAM but degrades quality noticeably, so most CPU-only users settle on q4 or q5 quants.

Sources

— SpecPicks Editorial · Last verified 2026-05-30

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →