Yes, a Ryzen 7 5800X helps when a local LLM spills layers off a 12GB RTX 3060. But "helps" is doing a lot of work in that sentence. The brutal truth is that one offloaded layer cuts generation from ~38 tokens per second to about 12, and a fully CPU-resident 14B q4 model lands around 4 tok/s on this CPU. That is not a gentle slope; that is a cliff.
When VRAM runs out
Every local builder runs into this. The model card said "12B q4" and the model loaded fine, but the moment you stretched the context window to 8k or asked for a slightly larger sibling — a 14B, a 27B, an Mixtral 8x7B — the loader started parking layers in system RAM and your tok/s collapsed. This is the moment CPU offload becomes a real engineering decision rather than a flag in a config file.
This piece is for the builder who is already running an RTX 3060 12GB on AM4 and is weighing how much CPU to throw at the spillover problem. The cited measurements are from public llama.cpp benchmark threads on the llama.cpp GitHub and from manufacturer specs published by AMD and NVIDIA / TechPowerUp.
Key Takeaways
- Even one offloaded layer drops single-user tok/s by ~70% on a 12-14B model.
- Memory bandwidth, not core count, is the CPU-side bottleneck. The 5800X and 5700X land within 5-8% of each other.
- A bigger GPU beats more CPU on this workload. RAM upgrades are triage.
- The 12GB ceiling for fully-resident inference is roughly a 14B q4 model at 4k context.
- The right knob for sub-VRAM models is
--n-gpu-layers; for over-VRAM models, the right knob is "use a smaller quant."
What does --n-gpu-layers actually do when you offload to CPU?
llama.cpp builds a layer graph and lets you specify how many of those layers live on the GPU. Set it to the maximum and every layer runs on CUDA; set it to zero and every layer runs on CPU; set it to anything in between and the model is split, with intermediate activations crossing the PCIe bus on every forward pass.
The catch: activations are small but the weights dominate. When a layer runs on CPU, every token streams that layer's weights from system RAM into the CPU cache. With DDR4-3200 dual-channel pushing roughly 51GB/s theoretical bandwidth, the CPU starves on every layer. The RTX 3060's GDDR6 pushes 360GB/s in comparison, per the TechPowerUp specs — a ~7x gap.
How much does a Ryzen 7 5800X slow down generation versus pure GPU on an RTX 3060 12GB?
Measured against public llama.cpp threads for a 14B q4_K_M model at 4k context:
| Layers on CPU (of 40) | tok/s on 5800X + 3060 | Slowdown vs pure GPU |
|---|---|---|
| 0 (all on GPU) | 38 | 1.0x |
| 4 | 16 | 2.4x slower |
| 8 | 9 | 4.2x slower |
| 16 | 5 | 7.6x slower |
| 40 (all on CPU) | 4 | 9.5x slower |
The headline: the first few offloaded layers are not "a little slower." They are 2-4x slower. The model becomes noticeably laggy at four layers and uncomfortable at eight.
The lesson: if you can quantize one tier harder to keep the model fully GPU-resident, do it. q4_K_M to q4_K_S, or q5 to q4, almost always wins versus splitting the model.
Spec-delta table
| Spec | Ryzen 7 5800X | Ryzen 7 5700X | Ryzen 5 5600G |
|---|---|---|---|
| Cores / threads | 8 / 16 | 8 / 16 | 6 / 12 |
| Base / boost clock | 3.8 / 4.7 GHz | 3.4 / 4.6 GHz | 3.9 / 4.4 GHz |
| L3 cache | 32 MB | 32 MB | 16 MB |
| TDP | 105 W | 65 W | 65 W |
| Memory channels | 2 | 2 | 2 |
| Theoretical RAM BW (DDR4-3200) | ~51 GB/s | ~51 GB/s | ~51 GB/s |
| Integrated graphics | None | None | Radeon Vega 7 |
| Typical street price | ~$190 | ~$160 | ~$130 |
| Best for | High-load inference + gaming | Efficient inference build | No-dGPU server |
Per AMD's product page, all three target Socket AM4 and ship with the same Zen 3 microarchitecture, so the per-clock IPC is identical. The Ryzen 7 5700X is the price-conscious pick; the Ryzen 5 5600G is the budget no-dGPU option.
Quantization matrix
Practical fit on a 12GB RTX 3060 at 4k context (single-user):
| Model size | q2_K | q3_K_M | q4_K_M | q5_K_M | q6_K | q8_0 | fp16 |
|---|---|---|---|---|---|---|---|
| 7B | Fits | Fits | Fits | Fits | Fits | Fits | Spills |
| 8B | Fits | Fits | Fits | Fits | Fits | Fits | Spills |
| 13B-14B | Fits | Fits | Fits | Tight | Spills | Spills | Spills |
| 27B-32B | Fits | Tight | Spills | Spills | Spills | Spills | Spills |
| 70B | Spills | Spills | Spills | Spills | Spills | Spills | Spills |
"Tight" means it fits at 4k context but spills at 8k. "Spills" means at least one layer is parked in CPU RAM with the corresponding throughput hit. For a 14B q5, the spill is a handful of layers — still painful (the 4-layer column in the previous section).
Prefill versus generation: where the CPU hurts most
llama.cpp separates prefill (the initial pass that ingests the prompt) from generation (the per-token loop). Prefill is compute-bound and parallelizable; generation is memory-bound and serial. The CPU loses on both, but the gap is larger on generation because each token must touch all weights again.
Concretely, on a 14B q4 at all-CPU, prefill is about 1.5-2x slower than the GPU equivalent, and generation is the 9-10x slower number quoted above. If you have a workflow that asks one big question and reads a small answer, the CPU penalty is felt mostly at the front. If you ask short questions and read long answers, the CPU penalty is the wall.
Context-length impact
A longer context increases KV-cache size, which competes with weights for VRAM. On a 14B q4 model, the KV cache at 4k context takes roughly 1.5GB; at 8k it doubles. That pushes the spillover point earlier — what fits at 4k may not at 8k.
The pragmatic move: cap context at the smallest length your task needs. Coding agents rarely need more than 8k; chat-with-doc workflows can spike higher. KV-cache quantization (q8 or q4 for the cache) helps as well, at a small quality cost.
Perf-per-dollar and perf-per-watt
| Build | Approx cost | Tok/s at 14B q4 | Tok/s/dollar | Tok/s/watt |
|---|---|---|---|---|
| 3060 + 5800X, all on GPU | $450 | 38 | 0.084 | 0.18 |
| 3060 + 5800X, half on CPU | $450 | 7 | 0.016 | 0.04 |
| 3060 + 5700X, all on GPU | $420 | 38 | 0.090 | 0.21 |
| 3060 + 5600G, all on GPU | $390 | 36 | 0.092 | 0.22 |
| 4060 Ti 16GB + 5700X | $620 | 42 (14B q5) | 0.068 | 0.16 |
| 4070 12GB + 5700X | $700 | 60 | 0.086 | 0.17 |
The lesson is repeated: any layer on CPU collapses tok/s/dollar. The best move when you outgrow 12GB is a 16GB GPU, not a meatier CPU.
Bottom line: when CPU offload is worth it
CPU offload is the right answer in exactly three cases:
- The model is run rarely and a few seconds per token is acceptable. Background indexing, overnight summarization, batch tasks.
- A larger quantization is unavailable. Some niche models ship only in q5 or q8, and the q4 you would want does not exist yet.
- The CPU is already paid for, and the alternative is buying a second GPU. A $190 Ryzen 7 5800X you already own is cheaper than a $500 GPU upgrade.
CPU offload is the wrong answer when you can drop a quant tier and stay fully GPU-resident. q4_K_M to q4_K_S costs you a few percent of perplexity and gives you the entire VRAM budget back.
Common pitfalls
- Treating
--n-gpu-layersas a free dial. Each layer offloaded has a non-linear cost. - Buying a 16-core CPU for offload. Past eight cores the memory channels become the bottleneck.
- Pairing the 5800X with single-channel DDR4. Two sticks, not one — single-channel halves memory bandwidth.
- Ignoring power. CPU-heavy inference draws the full TDP under load; a stock cooler thermally throttles a 5800X within minutes.
- Overlooking RAM speed. DDR4-3200 versus DDR4-2400 is a ~30% bandwidth swing on the AM4 platform.
When NOT to upgrade the CPU
If your inference workload sits entirely on the GPU — your model is 8B or 12B q4 and stays VRAM-resident — a bigger CPU is wasted. Save the $30-$60 between the 5700X and the 5800X and put it toward a future GPU upgrade.
Bottom line
The Ryzen 7 5800X is a solid all-purpose AM4 CPU for a 12GB RTX 3060 inference rig. But its real job is to keep the rest of the system fast — not to bail out the GPU during spillover. CPU offload is a stopgap, not an architecture. The cleanest upgrade path off a 12GB ceiling is a 16-24GB GPU; until you take it, the right discipline is to keep models GPU-resident and pick the quant that lets you do it.
Related guides
- 48GB DDR5 or 12GB VRAM? What Actually Speeds Up Local LLMs
- Ryzen 7 5800X vs 5700X for Gaming and Streaming
- Intel i7-9700K vs Ryzen 7 5700X for a Budget 2026 Gaming Build
- Best Budget GPU for Local 12B-14B LLM Inference
Citations and sources
- AMD — Ryzen 7 5800X product page
- llama.cpp — single-user inference benchmark threads
- TechPowerUp — GeForce RTX 3060 specs and bandwidth
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
Why memory bandwidth, not core count, is the binding constraint
Single-batch local LLM inference is a memory-bandwidth-bound workload. The CPU streams weights from DDR4 (or DDR5) on every forward pass, and the cores spend most of their time waiting on memory. The implication: a 16-core CPU does not run a CPU-offloaded model meaningfully faster than an 8-core CPU on the same memory controller and the same memory speed. Per the public llama.cpp benchmark threads, the measured gap between a 12-core 5900X and an 8-core 5800X on a CPU-only 14B q4 model is under 10% in single-batch generation — and that's despite the 50% jump in core count.
The right lever is memory speed and channel count. DDR4-3200 dual-channel on AM4 delivers ~51 GB/s theoretical bandwidth; DDR4-3600 (overclocked) delivers ~58 GB/s; DDR5-6000 on AM5 delivers ~96 GB/s. The platform jump from DDR4 to DDR5 is the meaningful CPU-offload upgrade, not the core-count jump within the same platform.
This is why community builders running serious CPU-offloaded inference often skip the 5800X entirely and either stay on a cheaper 5700X (saving money for the next GPU upgrade) or move to a DDR5 platform (where the memory bandwidth materially helps).
The KV-cache math
KV cache size at 14B q4, by context length:
| Context | KV cache (fp16) | KV cache (q8) | KV cache (q4) |
|---|---|---|---|
| 2k | 0.75 GB | 0.4 GB | 0.2 GB |
| 4k | 1.5 GB | 0.8 GB | 0.4 GB |
| 8k | 3.0 GB | 1.6 GB | 0.8 GB |
| 16k | 6.0 GB | 3.2 GB | 1.6 GB |
Quantizing the KV cache to q8 or q4 frees significant VRAM at a modest quality cost. For a builder fighting the 12GB ceiling on a 3060 12GB, enabling KV-cache quantization in llama.cpp (--cache-type-k q8_0 --cache-type-v q8_0) is often the difference between fitting and spilling.
Common community gotchas on AM4 CPU offload
- Loading the model in q5 by default. The runner downloaded the highest quant available; switch to q4_K_M explicitly.
- Running with the BIOS at stock memory speed (typically 2400 or 2666 MT/s). Enable XMP/DOCP to get the rated 3200 or 3600 speed your kit ships with.
- Pairing the 5800X with a budget B550 board that limits memory to 4 sticks at low speed. Two sticks at the rated speed beats four sticks at downclocked speed.
- Forgetting to set thread count. llama.cpp defaults to physical core count; some workloads prefer thread count - 2 to leave room for the OS.
- Not pinning the inference process to physical cores. On Windows, Process Lasso; on Linux,
taskset— both improve CPU-side throughput a few percent.
