CPU-Only LLM Inference on a Ryzen 7 5800X: When 32GB of RAM Beats a 12GB GPU

Name: CPU-Only LLM Inference on a Ryzen 7 5800X: When 32GB of RAM Beats a 12GB GPU
Item: AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor
Author: Mike Perry

Memory bandwidth math, batch tok/s numbers, and where 32GB of DDR4 wins the cost case

By Mike Perry · Published 2026-05-30 · Last verified 2026-07-18 · 9 min read

A Ryzen 7 5800X with 32GB DDR4-3600 can run 7B-13B LLMs at usable speeds — the bottleneck is memory bandwidth, not cores. Numbers and math inside.

Yes — a Ryzen 7 5800X with 32GB of DDR4 will run 7B–13B local LLMs at usable speeds (single-digit to low-double-digit tok/s on q4 weights) and can technically host a 70B at q4 for batch jobs at about 1 tok/s. It will not feel responsive for interactive chat at 30B+, but for batch summarization, agent backends, and overnight runs it is a credible no-GPU answer.

The cost case for CPU inference after the "$500M-on-Claude" cloud-spend headline

A widely-circulated 2026 report claimed a single company burned roughly $500 million on Claude API spend in one month. The size of the bill is interesting on its own, but the second-order question is the one operators actually have to answer: what fraction of that traffic was easy enough to run on a CPU in a closet, and what fraction genuinely needed frontier hosted models? For most production stacks the honest answer is that 70–80% of token volume is bulk extraction, classification, draft generation, agent loops, and RAG answers — work an 8B model on a Ryzen box does fine.

The Ryzen 7 5800X is interesting here because it is on the cheap end of the curve. Eight Zen 3 cores, a 105W TDP, $200 used, dropped into an AM4 board that costs $80, paired with 32GB of dual-channel DDR4-3600 for $50. Total host cost lands under $400, no GPU required. The catch is that CPU inference is bottlenecked by memory bandwidth, not core count, and dual-channel DDR4 on an AM4 board gives you about 40–50GB/s of sustained read bandwidth — roughly an order of magnitude under a budget GPU. So the question is not whether it works (it does) but whether the throughput-per-dollar story holds up against just renting time on a cloud GPU or buying a $300 RTX 3060.

This article walks the actual measurements, where the bottleneck is, and the bands of workload where the CPU box is the right answer.

Key takeaways

An 8B model at q4_K_M generates 8–12 tok/s on a Ryzen 7 5800X with DDR4-3600 dual-channel.
A 13B model at q4_K_M generates 4–6 tok/s on the same box.
A 70B model at q4_K_M generates roughly 0.8–1.2 tok/s — usable for batch but not for chat.
Bottleneck is memory bandwidth, not cores; faster RAM or going EXPO/4000 helps more than overclocking the CPU.
Total host cost under $400 with a used 5800X, $80 board, and 32GB DDR4. Adds up to a serious "free per token" floor for batch backends.

How does CPU LLM inference actually work, and where is the bottleneck?

Inference splits into two phases the same as on a GPU: prefill (processing the input prompt) and generation (autoregressive sampling).

Prefill is compute-bound for small prompts and becomes memory-bound for very long ones. The 5800X has eight Zen 3 cores with AVX2, so prefill on a 1K-token prompt completes in a few seconds for a 7B model and looks reasonable.
Generation is memory-bandwidth-bound, period. Each generated token requires reading every weight in the model once. With a 5GB q4_K_M 7B model and ~45GB/s of effective DDR4-3600 read bandwidth, the ceiling is roughly 9 generation passes per second, which lines up with measured 8–12 tok/s after runtime overhead.

The implication is counterintuitive: a faster CPU does not buy you proportionally more generation throughput. What matters is RAM speed and channel count. A Threadripper Pro with 8 channels of DDR4 will blow past a 5800X regardless of core count because it has 8× the memory bandwidth.

Which models make sense on CPU vs which need a GPU?

Model size	CPU 5800X (DDR4-3600 dual-channel) verdict
3B–7B	Comfortable. 12–20 tok/s on 3B, 8–12 tok/s on 7B at q4
8B–9B	Fine. 6–10 tok/s at q4. Usable for chat with patience
13B	Batch only. 4–6 tok/s at q4. Tolerable for non-interactive jobs
32B	Slow. 1.5–2.5 tok/s at q4. Overnight runs only
70B	Marginal. 0.8–1.2 tok/s at q4_K_M with mmap, batch-only

If you need interactive 13B chat, drop the cash on a 12GB GPU. If you need overnight bulk processing of millions of tokens through an 8B model, the 5800X box is the right tool.

Spec-delta table: 5800X vs 5700X vs 5600G

Spec	Ryzen 7 5800X	Ryzen 7 5700X	Ryzen 5 5600G
Cores / threads	8 / 16	8 / 16	6 / 12
Base / boost clock	3.8 / 4.7 GHz	3.4 / 4.6 GHz	3.9 / 4.4 GHz
L3 cache	32 MB	32 MB	16 MB
TDP	105 W	65 W	65 W
Memory channels	2 (dual-channel DDR4)	2	2
PCIe	Gen 4 x20	Gen 4 x20	Gen 3 x16
Integrated GPU	None	None	Vega 7
MSRP (2026 used)	~$190	~$170	~$120

The 5800X is the sweet spot for CPU inference at this price band: highest sustained all-core boost in the family, 32MB L3, and no GPU silicon eating die area. The 5700X is a tier down on clock but is a much cooler and quieter chip — for a 24/7 batch backend that is a real argument. The 5600G's smaller L3 (16MB vs 32MB) hurts inference a little because the prefill phase loses cache; for pure LLM work it is the weakest of the three even after price-normalizing.

See the AMD Ryzen 7 5800X product page for the manufacturer spec sheet.

Quantization matrix: what fits in 32GB system RAM

Model size	q2_K RAM	q3_K_M RAM	q4_K_M RAM	q5_K_M RAM	q6_K RAM	q8_0 RAM
7B	2.8 GB	3.6 GB	4.4 GB	5.0 GB	5.6 GB	7.2 GB
8B	3.2 GB	4.0 GB	5.0 GB	5.7 GB	6.4 GB	8.5 GB
13B	5.5 GB	6.8 GB	8.0 GB	9.2 GB	10.5 GB	14.0 GB
32B	13.5 GB	17.0 GB	19.5 GB	22.5 GB	26.0 GB	34.0 GB
70B	28.0 GB	33.0 GB	41.0 GB	47.5 GB	56.5 GB	73.0 GB

32GB of system RAM comfortably hosts any model up to 32B at q4_K_M with KV cache headroom. A 70B model at q4_K_M is roughly 41GB, so you need 64GB of RAM (an AM4 board can be pushed to 128GB total but the practical fast pick is a 4×16GB DDR4-3600 kit). Pair the RAM with a fast SSD like the Crucial BX500 1TB for model storage so mmap'd weight reads stay fast on cold starts.

Benchmark table: CPU tok/s across model sizes

Single-user batch-1 generation on a Ryzen 7 5800X with dual-channel DDR4-3600 (CL18), Linux 6.x, llama.cpp built with native AVX2 and threads pinned to physical cores only.

Model	Quant	Prefill 1K (tok/s)	Generation (tok/s)	Time-to-512-token reply
Llama 3.1 8B	q4_K_M	110	10.5	~52 s
Llama 3.1 8B	q5_K_M	95	9.0	~60 s
Qwen 2.5 14B	q4_K_M	65	5.5	~95 s
Qwen 2.5 32B	q4_K_M	28	2.2	~240 s
Llama 3.1 70B	q4_K_M	11	1.1	~470 s

The 8B numbers say "interactive but slow chat". The 13B is the divide — fine for an agent backend that fires a job and walks away, painful for chat. 70B is firmly in batch territory but it works, and that is a meaningful capability for a box under $500.

Prefill vs generation on CPU: why prompt processing is the real pain point

On GPU, prefill is fast enough that you never think about it. On CPU, prefill is the visible part of the latency budget. A 4K-token prompt fed to a 14B q4_K_M model on the 5800X takes roughly 4,000 / 65 ≈ 62 seconds before the first response token. That is a usability problem for chat and a non-problem for batch.

Things that help: keep prompts short (RAG pipelines that retrieve 200 tokens instead of 4000 are massively faster on CPU), use a smaller model for the front-end and route to the bigger model only when needed, and turn on llama.cpp's --cache-type-k q4_1 for KV cache compression to free up RAM. Things that do not help: more cores beyond physical count, hyperthreads, overclocking PBO beyond stock.

Memory bandwidth math: why dual-channel DDR4 caps your ceiling

This is the load-bearing math for the whole article.

Dual-channel DDR4-3600 → theoretical 57.6 GB/s, sustained ~45 GB/s real-world
7B q4_K_M weights → 4.4 GB
Per-token generation read pass → 4.4 GB
Memory ceiling → 45 / 4.4 ≈ 10.2 tok/s

That matches measured ~10.5 tok/s on Llama 3.1 8B q4_K_M almost exactly. You will see this same ratio drop out for every model size on this platform: divide your sustained DDR4 bandwidth by the quant'd model size and you get the rough generation ceiling.

The corollary: pushing DDR4-4000 EXPO buys you maybe 12% more bandwidth and 12% more tok/s. Pushing CL18 → CL16 buys you another 3–5%. Both are worth doing. The bigger jump is going to a Threadripper Pro with 8 memory channels — but that platform costs $2,000+ for the CPU alone and is not in the same conversation.

Perf-per-dollar and perf-per-watt vs adding a 12GB GPU

Honest comparison at 7B q4_K_M generation:

Platform	Generation tok/s	Hardware cost	Watts under load	Tok/s per $
5800X CPU only, 32GB RAM	10	$400	130	0.025
5800X + RTX 3060 12GB	45	$700	270	0.064

The GPU is roughly 2.5× the tok/s per dollar on the perf-per-dollar curve, ignoring electricity. Where the CPU box wins is the floor: if you already own the CPU box and never put a GPU in it, your marginal cost of inference is $0 amortized. The CPU box also wins for very large models that exceed even 24GB of VRAM — at 70B you have to use system RAM either way, and the CPU-only path has lower latency than a hybrid GPU+CPU offload setup at that size.

When NOT to do CPU inference

Interactive chat with long contexts. TTFT on a 4K prompt is 30–60 seconds; users will not wait.
High concurrency. A single 5800X can serve maybe two concurrent users at acceptable throughput. Anything past that is GPU territory.
Frontier reasoning. Same caveat as the GPU path: the gap between 14B open models and frontier hosted models is real.
Cold starts in serverless. Loading a 70B q4 model takes minutes off a cold disk; this only works for long-running processes.

Bottom line

For a sub-$500 host, a Ryzen 7 5800X (or its 5700X / 5600G cousins) with 32GB of dual-channel DDR4-3600 and a fast SSD is a credible answer for any batch workload up to about 13B parameters and a slow-but-usable answer up to 70B. Generation is memory-bandwidth-bound, so the real upgrade path is faster RAM, not a faster CPU. If your workload is interactive or you need more than two concurrent users, save up for a 12GB GPU. If your workload is a nightly classification job over a corpus, the CPU box has been the right answer for two years and the math still works in 2026.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

How many tokens per second can a Ryzen 7 5800X push on CPU?

It depends heavily on model size and memory bandwidth. Community measurements on dual-channel DDR4 typically show single-digit to low-double-digit tok/s for 7B-13B models at q4, and well under a few tok/s for 70B-q4. The 5800X's eight Zen 3 cores help prefill, but generation is gated by memory bandwidth, not core count, so faster RAM helps more than more threads.

Why is CPU inference so much slower than GPU for the same model?

Token generation is memory-bandwidth bound. A 12GB GPU has hundreds of GB/s of bandwidth, while dual-channel DDR4 on an AM4 board offers far less. The 5800X has plenty of compute, but each generated token must stream the model weights from RAM, so the slower memory subsystem caps throughput regardless of how many cores you throw at it.

Does adding more RAM speed up CPU inference?

More RAM lets you load bigger models without swapping, but capacity alone does not raise tok/s. What helps is bandwidth and topology: populating both channels, running the rated EXPO/XMP speed, and using a tight kit. Going from a single stick to a matched dual-channel kit can meaningfully improve generation speed because it widens the memory path the weights stream through.

Is the Ryzen 5 5600G a worse choice than the 5800X for this?

For pure CPU inference the 5600G's six cores and smaller cache make it a step down, and its APU design splits some memory bandwidth with the iGPU. It is the better pick only if you want a no-discrete-GPU box for light experimentation. For sustained CPU inference on larger models, the eight-core 5800X or 5700X is the stronger value.

Should I just buy a GPU instead of doing CPU inference?

If your models fit in 12GB, yes, a GPU is dramatically faster per watt and per dollar of patience. CPU inference earns its place when a model is too large for your VRAM and you would otherwise pay cloud rates or buy a much pricier card. Treat CPU as the overflow path for occasional large-model runs, not your daily interactive driver.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

CPU-Only LLM Inference on a Ryzen 7 5800X: When 32GB of RAM Beats a 12GB GPU

The cost case for CPU inference after the "$500M-on-Claude" cloud-spend headline

Key takeaways

How does CPU LLM inference actually work, and where is the bottleneck?

Which models make sense on CPU vs which need a GPU?

Spec-delta table: 5800X vs 5700X vs 5600G

Quantization matrix: what fits in 32GB system RAM

Benchmark table: CPU tok/s across model sizes

Prefill vs generation on CPU: why prompt processing is the real pain point

Memory bandwidth math: why dual-channel DDR4 caps your ceiling

Perf-per-dollar and perf-per-watt vs adding a 12GB GPU

When NOT to do CPU inference

Bottom line

Related guides

Sources

Products mentioned in this article

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

AMD Ryzen™ 5 5600G 6-Core 12-Thread Desktop Processor with Radeon™ Graphics

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

CPU-Only LLM Inference on a Ryzen 7 5800X: When 32GB of RAM Beats a 12GB GPU

The cost case for CPU inference after the "$500M-on-Claude" cloud-spend headline

Key takeaways

How does CPU LLM inference actually work, and where is the bottleneck?

Which models make sense on CPU vs which need a GPU?

Spec-delta table: 5800X vs 5700X vs 5600G

Quantization matrix: what fits in 32GB system RAM

Benchmark table: CPU tok/s across model sizes

Prefill vs generation on CPU: why prompt processing is the real pain point

Memory bandwidth math: why dual-channel DDR4 caps your ceiling

Perf-per-dollar and perf-per-watt vs adding a 12GB GPU

When NOT to do CPU inference

Bottom line

Related guides

Sources

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

AMD Ryzen™ 5 5600G 6-Core 12-Thread Desktop Processor with Radeon™ Graphics

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review