Skip to main content
CPU Offload for Local LLMs: Does a Ryzen 7 5800X Help?

CPU Offload for Local LLMs: Does a Ryzen 7 5800X Help?

When a 14B model spills off 12GB of VRAM, how much do the eight cores of a 5800X actually save you?

Spilling layers onto a Ryzen 7 5800X cuts a 14B model from 38 tok/s to about 4. Here is when CPU offload is worth it and when to just quantize harder.

Yes, a Ryzen 7 5800X helps when a local LLM spills layers off a 12GB RTX 3060. But "helps" is doing a lot of work in that sentence. The brutal truth is that one offloaded layer cuts generation from ~38 tokens per second to about 12, and a fully CPU-resident 14B q4 model lands around 4 tok/s on this CPU. That is not a gentle slope; that is a cliff.

When VRAM runs out

Every local builder runs into this. The model card said "12B q4" and the model loaded fine, but the moment you stretched the context window to 8k or asked for a slightly larger sibling — a 14B, a 27B, an Mixtral 8x7B — the loader started parking layers in system RAM and your tok/s collapsed. This is the moment CPU offload becomes a real engineering decision rather than a flag in a config file.

This piece is for the builder who is already running an RTX 3060 12GB on AM4 and is weighing how much CPU to throw at the spillover problem. The cited measurements are from public llama.cpp benchmark threads on the llama.cpp GitHub and from manufacturer specs published by AMD and NVIDIA / TechPowerUp.

Key Takeaways

  • Even one offloaded layer drops single-user tok/s by ~70% on a 12-14B model.
  • Memory bandwidth, not core count, is the CPU-side bottleneck. The 5800X and 5700X land within 5-8% of each other.
  • A bigger GPU beats more CPU on this workload. RAM upgrades are triage.
  • The 12GB ceiling for fully-resident inference is roughly a 14B q4 model at 4k context.
  • The right knob for sub-VRAM models is --n-gpu-layers; for over-VRAM models, the right knob is "use a smaller quant."

What does --n-gpu-layers actually do when you offload to CPU?

llama.cpp builds a layer graph and lets you specify how many of those layers live on the GPU. Set it to the maximum and every layer runs on CUDA; set it to zero and every layer runs on CPU; set it to anything in between and the model is split, with intermediate activations crossing the PCIe bus on every forward pass.

The catch: activations are small but the weights dominate. When a layer runs on CPU, every token streams that layer's weights from system RAM into the CPU cache. With DDR4-3200 dual-channel pushing roughly 51GB/s theoretical bandwidth, the CPU starves on every layer. The RTX 3060's GDDR6 pushes 360GB/s in comparison, per the TechPowerUp specs — a ~7x gap.

How much does a Ryzen 7 5800X slow down generation versus pure GPU on an RTX 3060 12GB?

Measured against public llama.cpp threads for a 14B q4_K_M model at 4k context:

Layers on CPU (of 40)tok/s on 5800X + 3060Slowdown vs pure GPU
0 (all on GPU)381.0x
4162.4x slower
894.2x slower
1657.6x slower
40 (all on CPU)49.5x slower

The headline: the first few offloaded layers are not "a little slower." They are 2-4x slower. The model becomes noticeably laggy at four layers and uncomfortable at eight.

The lesson: if you can quantize one tier harder to keep the model fully GPU-resident, do it. q4_K_M to q4_K_S, or q5 to q4, almost always wins versus splitting the model.

Spec-delta table

SpecRyzen 7 5800XRyzen 7 5700XRyzen 5 5600G
Cores / threads8 / 168 / 166 / 12
Base / boost clock3.8 / 4.7 GHz3.4 / 4.6 GHz3.9 / 4.4 GHz
L3 cache32 MB32 MB16 MB
TDP105 W65 W65 W
Memory channels222
Theoretical RAM BW (DDR4-3200)~51 GB/s~51 GB/s~51 GB/s
Integrated graphicsNoneNoneRadeon Vega 7
Typical street price~$190~$160~$130
Best forHigh-load inference + gamingEfficient inference buildNo-dGPU server

Per AMD's product page, all three target Socket AM4 and ship with the same Zen 3 microarchitecture, so the per-clock IPC is identical. The Ryzen 7 5700X is the price-conscious pick; the Ryzen 5 5600G is the budget no-dGPU option.

Quantization matrix

Practical fit on a 12GB RTX 3060 at 4k context (single-user):

Model sizeq2_Kq3_K_Mq4_K_Mq5_K_Mq6_Kq8_0fp16
7BFitsFitsFitsFitsFitsFitsSpills
8BFitsFitsFitsFitsFitsFitsSpills
13B-14BFitsFitsFitsTightSpillsSpillsSpills
27B-32BFitsTightSpillsSpillsSpillsSpillsSpills
70BSpillsSpillsSpillsSpillsSpillsSpillsSpills

"Tight" means it fits at 4k context but spills at 8k. "Spills" means at least one layer is parked in CPU RAM with the corresponding throughput hit. For a 14B q5, the spill is a handful of layers — still painful (the 4-layer column in the previous section).

Prefill versus generation: where the CPU hurts most

llama.cpp separates prefill (the initial pass that ingests the prompt) from generation (the per-token loop). Prefill is compute-bound and parallelizable; generation is memory-bound and serial. The CPU loses on both, but the gap is larger on generation because each token must touch all weights again.

Concretely, on a 14B q4 at all-CPU, prefill is about 1.5-2x slower than the GPU equivalent, and generation is the 9-10x slower number quoted above. If you have a workflow that asks one big question and reads a small answer, the CPU penalty is felt mostly at the front. If you ask short questions and read long answers, the CPU penalty is the wall.

Context-length impact

A longer context increases KV-cache size, which competes with weights for VRAM. On a 14B q4 model, the KV cache at 4k context takes roughly 1.5GB; at 8k it doubles. That pushes the spillover point earlier — what fits at 4k may not at 8k.

The pragmatic move: cap context at the smallest length your task needs. Coding agents rarely need more than 8k; chat-with-doc workflows can spike higher. KV-cache quantization (q8 or q4 for the cache) helps as well, at a small quality cost.

Perf-per-dollar and perf-per-watt

BuildApprox costTok/s at 14B q4Tok/s/dollarTok/s/watt
3060 + 5800X, all on GPU$450380.0840.18
3060 + 5800X, half on CPU$45070.0160.04
3060 + 5700X, all on GPU$420380.0900.21
3060 + 5600G, all on GPU$390360.0920.22
4060 Ti 16GB + 5700X$62042 (14B q5)0.0680.16
4070 12GB + 5700X$700600.0860.17

The lesson is repeated: any layer on CPU collapses tok/s/dollar. The best move when you outgrow 12GB is a 16GB GPU, not a meatier CPU.

Bottom line: when CPU offload is worth it

CPU offload is the right answer in exactly three cases:

  • The model is run rarely and a few seconds per token is acceptable. Background indexing, overnight summarization, batch tasks.
  • A larger quantization is unavailable. Some niche models ship only in q5 or q8, and the q4 you would want does not exist yet.
  • The CPU is already paid for, and the alternative is buying a second GPU. A $190 Ryzen 7 5800X you already own is cheaper than a $500 GPU upgrade.

CPU offload is the wrong answer when you can drop a quant tier and stay fully GPU-resident. q4_K_M to q4_K_S costs you a few percent of perplexity and gives you the entire VRAM budget back.

Common pitfalls

  • Treating --n-gpu-layers as a free dial. Each layer offloaded has a non-linear cost.
  • Buying a 16-core CPU for offload. Past eight cores the memory channels become the bottleneck.
  • Pairing the 5800X with single-channel DDR4. Two sticks, not one — single-channel halves memory bandwidth.
  • Ignoring power. CPU-heavy inference draws the full TDP under load; a stock cooler thermally throttles a 5800X within minutes.
  • Overlooking RAM speed. DDR4-3200 versus DDR4-2400 is a ~30% bandwidth swing on the AM4 platform.

When NOT to upgrade the CPU

If your inference workload sits entirely on the GPU — your model is 8B or 12B q4 and stays VRAM-resident — a bigger CPU is wasted. Save the $30-$60 between the 5700X and the 5800X and put it toward a future GPU upgrade.

Bottom line

The Ryzen 7 5800X is a solid all-purpose AM4 CPU for a 12GB RTX 3060 inference rig. But its real job is to keep the rest of the system fast — not to bail out the GPU during spillover. CPU offload is a stopgap, not an architecture. The cleanest upgrade path off a 12GB ceiling is a 16-24GB GPU; until you take it, the right discipline is to keep models GPU-resident and pick the quant that lets you do it.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Why memory bandwidth, not core count, is the binding constraint

Single-batch local LLM inference is a memory-bandwidth-bound workload. The CPU streams weights from DDR4 (or DDR5) on every forward pass, and the cores spend most of their time waiting on memory. The implication: a 16-core CPU does not run a CPU-offloaded model meaningfully faster than an 8-core CPU on the same memory controller and the same memory speed. Per the public llama.cpp benchmark threads, the measured gap between a 12-core 5900X and an 8-core 5800X on a CPU-only 14B q4 model is under 10% in single-batch generation — and that's despite the 50% jump in core count.

The right lever is memory speed and channel count. DDR4-3200 dual-channel on AM4 delivers ~51 GB/s theoretical bandwidth; DDR4-3600 (overclocked) delivers ~58 GB/s; DDR5-6000 on AM5 delivers ~96 GB/s. The platform jump from DDR4 to DDR5 is the meaningful CPU-offload upgrade, not the core-count jump within the same platform.

This is why community builders running serious CPU-offloaded inference often skip the 5800X entirely and either stay on a cheaper 5700X (saving money for the next GPU upgrade) or move to a DDR5 platform (where the memory bandwidth materially helps).

The KV-cache math

KV cache size at 14B q4, by context length:

ContextKV cache (fp16)KV cache (q8)KV cache (q4)
2k0.75 GB0.4 GB0.2 GB
4k1.5 GB0.8 GB0.4 GB
8k3.0 GB1.6 GB0.8 GB
16k6.0 GB3.2 GB1.6 GB

Quantizing the KV cache to q8 or q4 frees significant VRAM at a modest quality cost. For a builder fighting the 12GB ceiling on a 3060 12GB, enabling KV-cache quantization in llama.cpp (--cache-type-k q8_0 --cache-type-v q8_0) is often the difference between fitting and spilling.

Common community gotchas on AM4 CPU offload

  • Loading the model in q5 by default. The runner downloaded the highest quant available; switch to q4_K_M explicitly.
  • Running with the BIOS at stock memory speed (typically 2400 or 2666 MT/s). Enable XMP/DOCP to get the rated 3200 or 3600 speed your kit ships with.
  • Pairing the 5800X with a budget B550 board that limits memory to 4 sticks at low speed. Two sticks at the rated speed beats four sticks at downclocked speed.
  • Forgetting to set thread count. llama.cpp defaults to physical core count; some workloads prefer thread count - 2 to leave room for the OS.
  • Not pinning the inference process to physical cores. On Windows, Process Lasso; on Linux, taskset — both improve CPU-side throughput a few percent.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Why does a model run so slowly once it spills to CPU?
CPU memory bandwidth is the bottleneck. A Ryzen 7 5800X paired with DDR4-3200 dual-channel has roughly 51GB/s of theoretical memory bandwidth — well under a tenth of the RTX 3060's 360GB/s VRAM bandwidth. Every layer that runs on CPU has to stream weights from RAM on every token, so tok/s drops by an order of magnitude or more for any layer count past a handful. The split is brutal even at one offloaded layer.
Is a Ryzen 7 5800X better than a 5700X for offloaded inference?
Marginally. The 5800X clocks slightly higher and has more thermal headroom under sustained load, but both ship eight cores and sixteen threads on the same memory controller. Per public llama.cpp threads, the measured CPU-side speedup between the two is on the order of 5-8% in single-batch generation. Memory channels matter more than core count past eight cores, so the upgrade case for the 5800X is gaming, not inference.
Should I buy a Ryzen 5 5600G instead to save money?
Only if you also drop the discrete GPU plan. The 5600G is a six-core APU with integrated graphics and a stripped PCIe layout that limits the GPU slot to Gen3 x8 on most boards. For pure CPU-offload inference the six cores cost about 15% in generation throughput versus the 5800X's eight cores, and the IGP buys you exactly nothing on a serious local model. The 5600G is the right pick for a server-style box without a dGPU, not for a 3060 build.
Is more system RAM or a bigger GPU the better upgrade?
A bigger GPU, every time, if you can swing it. Adding RAM lets you load a model that previously OOM'd, but generation still runs at single-digit tok/s for the CPU portion. A GPU with 16GB or 24GB of VRAM lets you keep the entire model on the GPU side of the boundary, where tok/s is 30-40x higher. RAM is a triage upgrade; VRAM is a tier upgrade.
What size model can a 12GB RTX 3060 run without any CPU offload?
Practical ceilings at single-user inference: 8B at q4 fits comfortably with 4-5GB of headroom for KV cache and context. 14B at q4 fits at short context (around 4k tokens) but pushes 11.5GB at 8k context, leaving zero headroom. 32B at q4 spills to CPU on this card — no way to keep all layers on GPU. The 14B band is the practical ceiling of a single 12GB consumer GPU.

Sources

— SpecPicks Editorial · Last verified 2026-06-16

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View price →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →