For local LLM inference, 12 GB of GPU VRAM beats 48 GB of DDR5 system RAM almost every time. VRAM bandwidth on an RTX 3060 (360 GB/s) is roughly 6–7× faster than DDR5-6000 dual-channel (~96 GB/s), and any weight that has to live in system RAM gets pulled across PCIe before the GPU can use it. Buy a card that fits your model in VRAM; only spend on more system RAM after you've maxed out GPU VRAM headroom.
Why the question even comes up
It is tempting to look at a $90 kit of 48 GB DDR5 and a $300 12 GB RTX 3060 and assume the obvious win is to pile on system RAM. After all, 48 GB > 12 GB, and a 30B-class model needs more than 12 GB of weight storage anyway. Why pay for a GPU upgrade when you can pay a quarter as much for system memory?
The answer is bandwidth, and it is not close. As of 2026, on a typical mid-range desktop with an AM4 platform and the AMD Ryzen 7 5800X, DDR5 is not even on the table — you're on DDR4-3200. On AM5 with DDR5-6000 in dual channel, your system memory bandwidth tops out around 96 GB/s. The 12 GB GDDR6 on a MSI RTX 3060 Ventus 2X 12G sits at 360 GB/s, per the TechPowerUp GPU database. For a memory-bandwidth-bound workload — which local LLM inference is, almost entirely — that's the whole story.
This piece walks through where the lines actually fall: how llama.cpp splits work between CPU and GPU, what offloading to RAM costs you in tok/s, when more system RAM still helps, and the concrete numbers for a 3060 + 32 GB DDR5 build versus a hypothetical 3060 + 96 GB upgrade.
Key takeaways
- Local LLM inference is memory-bandwidth-bound, not compute-bound. The number that matters is GB/s of the device holding the active weights.
- GDDR6 on a 12 GB RTX 3060 is ~3.7× the bandwidth of DDR5-6000 dual channel, and ~10× the bandwidth of DDR4-3200. Any layer that lives in system RAM runs at a small fraction of its in-VRAM speed.
- Weights that don't fit in VRAM are offloaded via PCIe. PCIe 4.0 x16 caps at ~32 GB/s; the active portion of the model that crosses PCIe per token throttles total tok/s hard.
- The break-even ratio for "model in VRAM" vs "partially offloaded" is roughly 3–6× faster tok/s. On a 3060 with a 13B Q4_K model fully in VRAM you'll see ~28–35 tok/s; the same model with even one layer in system RAM drops to ~15–22 tok/s; a 30B-class model with half its layers in RAM falls to ~3–6 tok/s.
- Spend the next $90 on a bigger GPU before more system RAM if your goal is faster local inference. A used 16 GB card (e.g. RTX 4060 Ti 16GB) or even a second 12 GB 3060 (split-mode inference) does more for tok/s than 48 GB extra DDR5.
What does "memory-bandwidth bound" actually mean?
When llama.cpp generates a single token, it has to multiply the entire active layer set against the residual stream activations. For a 13B Q4_K_M model, that's roughly 8 GB of weights touched once per token. On a 360 GB/s memory bus that's a hard physical lower bound of ~45 ms per token, or ~22 tok/s, before any compute, sampler, or KV-cache overhead.
The compute matters too — but not nearly as much as people think. Even a slow Ampere card has tens of TFLOPS of FP16 throughput, while a single forward pass needs roughly tens of GFLOPs of arithmetic per token (sparse activations, low arithmetic intensity). The chip is mostly idle waiting for memory.
The corollary that matters here: if half your weights live in DDR5 at 96 GB/s and half live in GDDR6 at 360 GB/s, your effective end-to-end memory bandwidth is dominated by the slow tier. The fast tier waits for the slow tier. So adding more system RAM to "fit a bigger model" does not help if the bottleneck becomes "feed weights from DDR5 through PCIe into VRAM, then run them, then evict, then refetch."
What happens when llama.cpp can't fit the whole model in VRAM?
llama.cpp's -ngl N flag splits the model: the first N transformer layers go to GPU VRAM, the remainder run on CPU using system RAM. There is no PCIe-based weight streaming for layers you mark CPU-resident; the CPU runs them in place against system DRAM.
For a 30B model with 60 layers at Q4_K_M (~17 GB total weights) on a 3060 12GB system:
-ngl 60: all on GPU. Won't fit (17 GB > 12 GB). Error or crash.-ngl 40: 40 layers on GPU (~11 GB VRAM), 20 layers on CPU (~5.7 GB RAM). Each token does 20 layers of CPU math at ~50–96 GB/s.-ngl 0: pure CPU. Each token runs all 60 layers against DDR.
Concrete numbers from public llama.cpp testing on a Ryzen 7 5800X + DDR4-3200 + RTX 3060 12GB, Q4_K_M, ~13B model:
| Split | tok/s | Notes |
|---|---|---|
Fully in VRAM (-ngl 33/33) | 28–35 | Best case |
| 30/33 layers on GPU | 18–24 | One CPU offload tier |
| 24/33 layers on GPU | 11–16 | Half-and-half-ish |
| Pure CPU | 3–6 | DDR4-3200 only |
The same test on a 5950X + DDR5-6000 (AM5) shows pure-CPU numbers around 6–10 tok/s — better, because DDR5 is faster than DDR4, but still 3–5× slower than fully-in-VRAM on the same 3060.
This is the central point: DDR5 is faster than DDR4, but it is not in the same league as GDDR6. Spending money on more system RAM lets you fit a bigger model, but the inference speed of that bigger model on system RAM is roughly the same per-token regardless of whether you have 32 GB or 96 GB of it.
Spec-delta: GDDR6 vs DDR5 vs DDR4
| Memory tier | Typical config | Bandwidth | Latency to active core |
|---|---|---|---|
| GDDR6 on RTX 3060 12GB | 192-bit @ 15 Gbps | 360 GB/s | tens of ns (on-package) |
| GDDR6X on RTX 3070+ | 256-bit @ 19 Gbps | ~600 GB/s | tens of ns |
| HBM3 on datacenter cards | 5120-bit @ 5.2 GT/s | ~3,000 GB/s | tens of ns |
| DDR5-6000 dual channel | 2× 32-bit @ 6 Gbps | ~96 GB/s | 60–80 ns + PCIe hop |
| DDR5-7200 dual channel | 2× 32-bit @ 7.2 Gbps | ~115 GB/s | 60–80 ns |
| DDR4-3200 dual channel | 2× 32-bit @ 3.2 Gbps | ~51 GB/s | 80–100 ns |
| PCIe 4.0 x16 (data path) | x16 @ 16 GT/s | ~32 GB/s | 200+ ns transaction |
The takeaway: even if you upgraded an AM4 5800X build to a modern AM5 platform with DDR5-7200, your system memory bandwidth would still be ~30% of the 3060's VRAM bandwidth. That ratio is what determines token-per-second when layers are split.
Prefill (prompt processing) vs generation (decode)
Local LLM workloads split into two phases:
- Prefill processes the entire input prompt at once. It is mostly compute-bound. On a 3060 you see hundreds of tok/s during prefill even for long context.
- Generation produces tokens one at a time. It is memory-bandwidth-bound. On the same 3060 you see tens of tok/s on a 13B model.
Both phases run faster when the model fits in VRAM, because both still need to fetch weights from memory. Prefill is less sensitive to the bandwidth gap because it amortizes the fetch across many sequence positions (the same weights get reused across the entire prompt in a single forward pass). Generation can't amortize — every token re-reads every weight.
The implication: long-prompt, short-output workloads (RAG, code review, document summarization) are less penalized by partial offload. Short-prompt, long-output workloads (chat, agent loops, story generation) get hammered by it. Most local-LLM users are doing the latter, which is why "fit in VRAM" is such a load-bearing requirement.
When does more system RAM actually help?
System RAM upgrades pay off in three specific cases:
- You need to load a model that's larger than VRAM. Even with the speed penalty, running Llama-3-70B at 2–4 tok/s on a 96 GB DDR5 box is preferable to "won't load." Operating system + KV cache + a 40 GB Q3 quant of 70B fits in 64 GB but is uncomfortable in 32 GB.
- You're running KV cache for long context. A 128k-context Llama-3.1-8B at fp16 KV uses roughly 32 GB of KV cache on top of weights. You can spill that to system RAM, and llama.cpp handles it gracefully — but you want the RAM to spill into.
- You're running other things alongside inference — a browser with 200 tabs, ComfyUI, multiple containers. 32 GB total is uncomfortably tight on a modern desktop running a 12 GB VRAM model plus normal workflows.
Outside of those three cases, more system RAM does not move tok/s. A 3060 12GB with 32 GB DDR4 runs the same 13B Q4 at the same speed as a 3060 12GB with 96 GB DDR5 — what changes is the size of the model you can attempt at all, and what compromises you accept in tok/s when you do.
The "two GPUs vs one big GPU + more RAM" question
If you have $300 to spend and an existing 12 GB 3060 system, the choice usually comes down to:
- +$90: 48 GB DDR5 kit (assumes AM5 board). Unlocks larger model loading at slow tok/s.
- +$280: Second RTX 3060 12GB. Doubles VRAM to 24 GB; llama.cpp supports tensor-parallel split across two cards on one box.
- +$450: Used RTX 3090 24GB. Replaces the 3060 with a card that has 936 GB/s bandwidth and 24 GB VRAM — single-card win for both speed and capacity.
- +$650: Used RTX 4090 24GB. Best per-token performance on a consumer card; arguably the canonical local-LLM GPU until a 5090 used pipeline establishes.
For pure tok/s the rank is: 4090 > 3090 > dual-3060 > single 3060 + more RAM. The first three are all dramatically ahead of the last. If your only constraint is budget and you must keep the 3060, the second-3060 path is the right next move; the system-RAM path is a distant fallback.
Common pitfalls
- Buying DDR5 to "speed up the GPU." It doesn't. The GPU has its own faster memory. System RAM only matters when work crosses to the CPU side.
- Mixing capacities/speeds. Dual-rank vs single-rank, mismatched kits, or 4 sticks at full speed can drop your DDR5 from rated 6000 to 4400 MT/s on most boards. Always run two matched sticks if you want rated speed.
- Ignoring the PSU. A second 3060 needs another ~170 W of headroom plus a separate PCIe power cable. A 550 W gold PSU is tight for a dual-3060 system under load.
- Loading at fp16 when Q4_K_M is fine. Quantization to 4-bit is the cheapest way to fit bigger models. Q4_K_M loses very little quality on most ~7B–13B models.
- Forgetting the Western Digital 1TB WD Blue SN550 NVMe SSD. Model files balloon: a single 70B Q4 model is ~40 GB, a typical local-LLM library hits 200 GB fast. NVMe is non-negotiable for load times.
When NOT to spend on either
If your existing 3060 + 32 GB DDR4 box already handles your workload — say, you run a 13B-class assistant at 25 tok/s and that's enough — don't spend on either upgrade. The marginal joy of going from 25 to 30 tok/s on the same model size is low. Spend on the GPU when you want a model the 3060 can't run; spend on RAM when you literally cannot load a model you need.
Bottom line
For local LLM inference, the ranking that matters is: fit-in-VRAM > more VRAM > faster VRAM > more system RAM > faster system RAM. A 12 GB RTX 3060 is a better local-LLM machine than a 96 GB DDR5 system without a discrete GPU, full stop. If you're choosing between $90 of DDR5 and $280 of GPU and you care about tok/s, the GPU wins by a wide margin.
The exception is the small slice of users who need to load very large models for occasional use — a researcher who runs a 70B at 3 tok/s once a week beats a researcher who can't run it at all. For that use case, the extra system RAM is worth the modest cost, but it doesn't change the everyday inference math: when you run, you want the weights to live in VRAM.
Related guides
- Best GPU for LLaMA 70B local inference — what it actually takes to run a 70B-class model with the model in VRAM.
- RTX 3060 12GB vs 3060 Ti 8GB for local LLM — the same "VRAM > bandwidth" argument inside the 30-series.
- Qwen 3 6.35B on the RTX 3060 12GB — what the 3060 actually does on a current 7B-class workload.
- Heterogeneous GPU weighting and layer splitting — when mismatched GPUs in one box actually help.
Citations and sources
- NVIDIA GeForce RTX 3060 product page
- TechPowerUp — GeForce RTX 3060 12 GB spec page
- llama.cpp on GitHub — reference inference engine
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
