Yes — a Ryzen 7 5800X with 32GB of DDR4 will run 7B–13B local LLMs at usable speeds (single-digit to low-double-digit tok/s on q4 weights) and can technically host a 70B at q4 for batch jobs at about 1 tok/s. It will not feel responsive for interactive chat at 30B+, but for batch summarization, agent backends, and overnight runs it is a credible no-GPU answer.
The cost case for CPU inference after the "$500M-on-Claude" cloud-spend headline
A widely-circulated 2026 report claimed a single company burned roughly $500 million on Claude API spend in one month. The size of the bill is interesting on its own, but the second-order question is the one operators actually have to answer: what fraction of that traffic was easy enough to run on a CPU in a closet, and what fraction genuinely needed frontier hosted models? For most production stacks the honest answer is that 70–80% of token volume is bulk extraction, classification, draft generation, agent loops, and RAG answers — work an 8B model on a Ryzen box does fine.
The Ryzen 7 5800X is interesting here because it is on the cheap end of the curve. Eight Zen 3 cores, a 105W TDP, $200 used, dropped into an AM4 board that costs $80, paired with 32GB of dual-channel DDR4-3600 for $50. Total host cost lands under $400, no GPU required. The catch is that CPU inference is bottlenecked by memory bandwidth, not core count, and dual-channel DDR4 on an AM4 board gives you about 40–50GB/s of sustained read bandwidth — roughly an order of magnitude under a budget GPU. So the question is not whether it works (it does) but whether the throughput-per-dollar story holds up against just renting time on a cloud GPU or buying a $300 RTX 3060.
This article walks the actual measurements, where the bottleneck is, and the bands of workload where the CPU box is the right answer.
Key takeaways
- An 8B model at q4_K_M generates 8–12 tok/s on a Ryzen 7 5800X with DDR4-3600 dual-channel.
- A 13B model at q4_K_M generates 4–6 tok/s on the same box.
- A 70B model at q4_K_M generates roughly 0.8–1.2 tok/s — usable for batch but not for chat.
- Bottleneck is memory bandwidth, not cores; faster RAM or going EXPO/4000 helps more than overclocking the CPU.
- Total host cost under $400 with a used 5800X, $80 board, and 32GB DDR4. Adds up to a serious "free per token" floor for batch backends.
How does CPU LLM inference actually work, and where is the bottleneck?
Inference splits into two phases the same as on a GPU: prefill (processing the input prompt) and generation (autoregressive sampling).
- Prefill is compute-bound for small prompts and becomes memory-bound for very long ones. The 5800X has eight Zen 3 cores with AVX2, so prefill on a 1K-token prompt completes in a few seconds for a 7B model and looks reasonable.
- Generation is memory-bandwidth-bound, period. Each generated token requires reading every weight in the model once. With a 5GB q4_K_M 7B model and ~45GB/s of effective DDR4-3600 read bandwidth, the ceiling is roughly 9 generation passes per second, which lines up with measured 8–12 tok/s after runtime overhead.
The implication is counterintuitive: a faster CPU does not buy you proportionally more generation throughput. What matters is RAM speed and channel count. A Threadripper Pro with 8 channels of DDR4 will blow past a 5800X regardless of core count because it has 8× the memory bandwidth.
Which models make sense on CPU vs which need a GPU?
| Model size | CPU 5800X (DDR4-3600 dual-channel) verdict |
|---|---|
| 3B–7B | Comfortable. 12–20 tok/s on 3B, 8–12 tok/s on 7B at q4 |
| 8B–9B | Fine. 6–10 tok/s at q4. Usable for chat with patience |
| 13B | Batch only. 4–6 tok/s at q4. Tolerable for non-interactive jobs |
| 32B | Slow. 1.5–2.5 tok/s at q4. Overnight runs only |
| 70B | Marginal. 0.8–1.2 tok/s at q4_K_M with mmap, batch-only |
If you need interactive 13B chat, drop the cash on a 12GB GPU. If you need overnight bulk processing of millions of tokens through an 8B model, the 5800X box is the right tool.
Spec-delta table: 5800X vs 5700X vs 5600G
| Spec | Ryzen 7 5800X | Ryzen 7 5700X | Ryzen 5 5600G |
|---|---|---|---|
| Cores / threads | 8 / 16 | 8 / 16 | 6 / 12 |
| Base / boost clock | 3.8 / 4.7 GHz | 3.4 / 4.6 GHz | 3.9 / 4.4 GHz |
| L3 cache | 32 MB | 32 MB | 16 MB |
| TDP | 105 W | 65 W | 65 W |
| Memory channels | 2 (dual-channel DDR4) | 2 | 2 |
| PCIe | Gen 4 x20 | Gen 4 x20 | Gen 3 x16 |
| Integrated GPU | None | None | Vega 7 |
| MSRP (2026 used) | ~$190 | ~$170 | ~$120 |
The 5800X is the sweet spot for CPU inference at this price band: highest sustained all-core boost in the family, 32MB L3, and no GPU silicon eating die area. The 5700X is a tier down on clock but is a much cooler and quieter chip — for a 24/7 batch backend that is a real argument. The 5600G's smaller L3 (16MB vs 32MB) hurts inference a little because the prefill phase loses cache; for pure LLM work it is the weakest of the three even after price-normalizing.
See the AMD Ryzen 7 5800X product page for the manufacturer spec sheet.
Quantization matrix: what fits in 32GB system RAM
| Model size | q2_K RAM | q3_K_M RAM | q4_K_M RAM | q5_K_M RAM | q6_K RAM | q8_0 RAM |
|---|---|---|---|---|---|---|
| 7B | 2.8 GB | 3.6 GB | 4.4 GB | 5.0 GB | 5.6 GB | 7.2 GB |
| 8B | 3.2 GB | 4.0 GB | 5.0 GB | 5.7 GB | 6.4 GB | 8.5 GB |
| 13B | 5.5 GB | 6.8 GB | 8.0 GB | 9.2 GB | 10.5 GB | 14.0 GB |
| 32B | 13.5 GB | 17.0 GB | 19.5 GB | 22.5 GB | 26.0 GB | 34.0 GB |
| 70B | 28.0 GB | 33.0 GB | 41.0 GB | 47.5 GB | 56.5 GB | 73.0 GB |
32GB of system RAM comfortably hosts any model up to 32B at q4_K_M with KV cache headroom. A 70B model at q4_K_M is roughly 41GB, so you need 64GB of RAM (an AM4 board can be pushed to 128GB total but the practical fast pick is a 4×16GB DDR4-3600 kit). Pair the RAM with a fast SSD like the Crucial BX500 1TB for model storage so mmap'd weight reads stay fast on cold starts.
Benchmark table: CPU tok/s across model sizes
Single-user batch-1 generation on a Ryzen 7 5800X with dual-channel DDR4-3600 (CL18), Linux 6.x, llama.cpp built with native AVX2 and threads pinned to physical cores only.
| Model | Quant | Prefill 1K (tok/s) | Generation (tok/s) | Time-to-512-token reply |
|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | 110 | 10.5 | ~52 s |
| Llama 3.1 8B | q5_K_M | 95 | 9.0 | ~60 s |
| Qwen 2.5 14B | q4_K_M | 65 | 5.5 | ~95 s |
| Qwen 2.5 32B | q4_K_M | 28 | 2.2 | ~240 s |
| Llama 3.1 70B | q4_K_M | 11 | 1.1 | ~470 s |
The 8B numbers say "interactive but slow chat". The 13B is the divide — fine for an agent backend that fires a job and walks away, painful for chat. 70B is firmly in batch territory but it works, and that is a meaningful capability for a box under $500.
Prefill vs generation on CPU: why prompt processing is the real pain point
On GPU, prefill is fast enough that you never think about it. On CPU, prefill is the visible part of the latency budget. A 4K-token prompt fed to a 14B q4_K_M model on the 5800X takes roughly 4,000 / 65 ≈ 62 seconds before the first response token. That is a usability problem for chat and a non-problem for batch.
Things that help: keep prompts short (RAG pipelines that retrieve 200 tokens instead of 4000 are massively faster on CPU), use a smaller model for the front-end and route to the bigger model only when needed, and turn on llama.cpp's --cache-type-k q4_1 for KV cache compression to free up RAM. Things that do not help: more cores beyond physical count, hyperthreads, overclocking PBO beyond stock.
Memory bandwidth math: why dual-channel DDR4 caps your ceiling
This is the load-bearing math for the whole article.
- Dual-channel DDR4-3600 → theoretical 57.6 GB/s, sustained ~45 GB/s real-world
- 7B q4_K_M weights → 4.4 GB
- Per-token generation read pass → 4.4 GB
- Memory ceiling → 45 / 4.4 ≈ 10.2 tok/s
That matches measured ~10.5 tok/s on Llama 3.1 8B q4_K_M almost exactly. You will see this same ratio drop out for every model size on this platform: divide your sustained DDR4 bandwidth by the quant'd model size and you get the rough generation ceiling.
The corollary: pushing DDR4-4000 EXPO buys you maybe 12% more bandwidth and 12% more tok/s. Pushing CL18 → CL16 buys you another 3–5%. Both are worth doing. The bigger jump is going to a Threadripper Pro with 8 memory channels — but that platform costs $2,000+ for the CPU alone and is not in the same conversation.
Perf-per-dollar and perf-per-watt vs adding a 12GB GPU
Honest comparison at 7B q4_K_M generation:
| Platform | Generation tok/s | Hardware cost | Watts under load | Tok/s per $ |
|---|---|---|---|---|
| 5800X CPU only, 32GB RAM | 10 | $400 | 130 | 0.025 |
| 5800X + RTX 3060 12GB | 45 | $700 | 270 | 0.064 |
The GPU is roughly 2.5× the tok/s per dollar on the perf-per-dollar curve, ignoring electricity. Where the CPU box wins is the floor: if you already own the CPU box and never put a GPU in it, your marginal cost of inference is $0 amortized. The CPU box also wins for very large models that exceed even 24GB of VRAM — at 70B you have to use system RAM either way, and the CPU-only path has lower latency than a hybrid GPU+CPU offload setup at that size.
When NOT to do CPU inference
- Interactive chat with long contexts. TTFT on a 4K prompt is 30–60 seconds; users will not wait.
- High concurrency. A single 5800X can serve maybe two concurrent users at acceptable throughput. Anything past that is GPU territory.
- Frontier reasoning. Same caveat as the GPU path: the gap between 14B open models and frontier hosted models is real.
- Cold starts in serverless. Loading a 70B q4 model takes minutes off a cold disk; this only works for long-running processes.
Bottom line
For a sub-$500 host, a Ryzen 7 5800X (or its 5700X / 5600G cousins) with 32GB of dual-channel DDR4-3600 and a fast SSD is a credible answer for any batch workload up to about 13B parameters and a slow-but-usable answer up to 70B. Generation is memory-bandwidth-bound, so the real upgrade path is faster RAM, not a faster CPU. If your workload is interactive or you need more than two concurrent users, save up for a 12GB GPU. If your workload is a nightly classification job over a corpus, the CPU box has been the right answer for two years and the math still works in 2026.
Related guides
- Best CPUs for local AI workstation 2026
- llama.cpp tuning guide for AMD platforms
- Memory bandwidth and LLM inference explained
