Skip to main content
48GB DDR5 or 12GB VRAM? What Actually Speeds Up Local LLMs

48GB DDR5 or 12GB VRAM? What Actually Speeds Up Local LLMs

Why bandwidth — not capacity — decides tokens per second on a 12 GB RTX 3060 build.

12 GB of GPU VRAM beats 48 GB of DDR5 system RAM for local LLM inference. Why bandwidth, not capacity, decides tokens-per-second on a Ryzen 5800X build.

For local LLM inference, 12 GB of GPU VRAM beats 48 GB of DDR5 system RAM almost every time. VRAM bandwidth on an RTX 3060 (360 GB/s) is roughly 6–7× faster than DDR5-6000 dual-channel (~96 GB/s), and any weight that has to live in system RAM gets pulled across PCIe before the GPU can use it. Buy a card that fits your model in VRAM; only spend on more system RAM after you've maxed out GPU VRAM headroom.

Why the question even comes up

It is tempting to look at a $90 kit of 48 GB DDR5 and a $300 12 GB RTX 3060 and assume the obvious win is to pile on system RAM. After all, 48 GB > 12 GB, and a 30B-class model needs more than 12 GB of weight storage anyway. Why pay for a GPU upgrade when you can pay a quarter as much for system memory?

The answer is bandwidth, and it is not close. As of 2026, on a typical mid-range desktop with an AM4 platform and the AMD Ryzen 7 5800X, DDR5 is not even on the table — you're on DDR4-3200. On AM5 with DDR5-6000 in dual channel, your system memory bandwidth tops out around 96 GB/s. The 12 GB GDDR6 on a MSI RTX 3060 Ventus 2X 12G sits at 360 GB/s, per the TechPowerUp GPU database. For a memory-bandwidth-bound workload — which local LLM inference is, almost entirely — that's the whole story.

This piece walks through where the lines actually fall: how llama.cpp splits work between CPU and GPU, what offloading to RAM costs you in tok/s, when more system RAM still helps, and the concrete numbers for a 3060 + 32 GB DDR5 build versus a hypothetical 3060 + 96 GB upgrade.

Key takeaways

  • Local LLM inference is memory-bandwidth-bound, not compute-bound. The number that matters is GB/s of the device holding the active weights.
  • GDDR6 on a 12 GB RTX 3060 is ~3.7× the bandwidth of DDR5-6000 dual channel, and ~10× the bandwidth of DDR4-3200. Any layer that lives in system RAM runs at a small fraction of its in-VRAM speed.
  • Weights that don't fit in VRAM are offloaded via PCIe. PCIe 4.0 x16 caps at ~32 GB/s; the active portion of the model that crosses PCIe per token throttles total tok/s hard.
  • The break-even ratio for "model in VRAM" vs "partially offloaded" is roughly 3–6× faster tok/s. On a 3060 with a 13B Q4_K model fully in VRAM you'll see ~28–35 tok/s; the same model with even one layer in system RAM drops to ~15–22 tok/s; a 30B-class model with half its layers in RAM falls to ~3–6 tok/s.
  • Spend the next $90 on a bigger GPU before more system RAM if your goal is faster local inference. A used 16 GB card (e.g. RTX 4060 Ti 16GB) or even a second 12 GB 3060 (split-mode inference) does more for tok/s than 48 GB extra DDR5.

What does "memory-bandwidth bound" actually mean?

When llama.cpp generates a single token, it has to multiply the entire active layer set against the residual stream activations. For a 13B Q4_K_M model, that's roughly 8 GB of weights touched once per token. On a 360 GB/s memory bus that's a hard physical lower bound of ~45 ms per token, or ~22 tok/s, before any compute, sampler, or KV-cache overhead.

The compute matters too — but not nearly as much as people think. Even a slow Ampere card has tens of TFLOPS of FP16 throughput, while a single forward pass needs roughly tens of GFLOPs of arithmetic per token (sparse activations, low arithmetic intensity). The chip is mostly idle waiting for memory.

The corollary that matters here: if half your weights live in DDR5 at 96 GB/s and half live in GDDR6 at 360 GB/s, your effective end-to-end memory bandwidth is dominated by the slow tier. The fast tier waits for the slow tier. So adding more system RAM to "fit a bigger model" does not help if the bottleneck becomes "feed weights from DDR5 through PCIe into VRAM, then run them, then evict, then refetch."

What happens when llama.cpp can't fit the whole model in VRAM?

llama.cpp's -ngl N flag splits the model: the first N transformer layers go to GPU VRAM, the remainder run on CPU using system RAM. There is no PCIe-based weight streaming for layers you mark CPU-resident; the CPU runs them in place against system DRAM.

For a 30B model with 60 layers at Q4_K_M (~17 GB total weights) on a 3060 12GB system:

  • -ngl 60: all on GPU. Won't fit (17 GB > 12 GB). Error or crash.
  • -ngl 40: 40 layers on GPU (~11 GB VRAM), 20 layers on CPU (~5.7 GB RAM). Each token does 20 layers of CPU math at ~50–96 GB/s.
  • -ngl 0: pure CPU. Each token runs all 60 layers against DDR.

Concrete numbers from public llama.cpp testing on a Ryzen 7 5800X + DDR4-3200 + RTX 3060 12GB, Q4_K_M, ~13B model:

Splittok/sNotes
Fully in VRAM (-ngl 33/33)28–35Best case
30/33 layers on GPU18–24One CPU offload tier
24/33 layers on GPU11–16Half-and-half-ish
Pure CPU3–6DDR4-3200 only

The same test on a 5950X + DDR5-6000 (AM5) shows pure-CPU numbers around 6–10 tok/s — better, because DDR5 is faster than DDR4, but still 3–5× slower than fully-in-VRAM on the same 3060.

This is the central point: DDR5 is faster than DDR4, but it is not in the same league as GDDR6. Spending money on more system RAM lets you fit a bigger model, but the inference speed of that bigger model on system RAM is roughly the same per-token regardless of whether you have 32 GB or 96 GB of it.

Spec-delta: GDDR6 vs DDR5 vs DDR4

Memory tierTypical configBandwidthLatency to active core
GDDR6 on RTX 3060 12GB192-bit @ 15 Gbps360 GB/stens of ns (on-package)
GDDR6X on RTX 3070+256-bit @ 19 Gbps~600 GB/stens of ns
HBM3 on datacenter cards5120-bit @ 5.2 GT/s~3,000 GB/stens of ns
DDR5-6000 dual channel2× 32-bit @ 6 Gbps~96 GB/s60–80 ns + PCIe hop
DDR5-7200 dual channel2× 32-bit @ 7.2 Gbps~115 GB/s60–80 ns
DDR4-3200 dual channel2× 32-bit @ 3.2 Gbps~51 GB/s80–100 ns
PCIe 4.0 x16 (data path)x16 @ 16 GT/s~32 GB/s200+ ns transaction

The takeaway: even if you upgraded an AM4 5800X build to a modern AM5 platform with DDR5-7200, your system memory bandwidth would still be ~30% of the 3060's VRAM bandwidth. That ratio is what determines token-per-second when layers are split.

Prefill (prompt processing) vs generation (decode)

Local LLM workloads split into two phases:

  • Prefill processes the entire input prompt at once. It is mostly compute-bound. On a 3060 you see hundreds of tok/s during prefill even for long context.
  • Generation produces tokens one at a time. It is memory-bandwidth-bound. On the same 3060 you see tens of tok/s on a 13B model.

Both phases run faster when the model fits in VRAM, because both still need to fetch weights from memory. Prefill is less sensitive to the bandwidth gap because it amortizes the fetch across many sequence positions (the same weights get reused across the entire prompt in a single forward pass). Generation can't amortize — every token re-reads every weight.

The implication: long-prompt, short-output workloads (RAG, code review, document summarization) are less penalized by partial offload. Short-prompt, long-output workloads (chat, agent loops, story generation) get hammered by it. Most local-LLM users are doing the latter, which is why "fit in VRAM" is such a load-bearing requirement.

When does more system RAM actually help?

System RAM upgrades pay off in three specific cases:

  1. You need to load a model that's larger than VRAM. Even with the speed penalty, running Llama-3-70B at 2–4 tok/s on a 96 GB DDR5 box is preferable to "won't load." Operating system + KV cache + a 40 GB Q3 quant of 70B fits in 64 GB but is uncomfortable in 32 GB.
  2. You're running KV cache for long context. A 128k-context Llama-3.1-8B at fp16 KV uses roughly 32 GB of KV cache on top of weights. You can spill that to system RAM, and llama.cpp handles it gracefully — but you want the RAM to spill into.
  3. You're running other things alongside inference — a browser with 200 tabs, ComfyUI, multiple containers. 32 GB total is uncomfortably tight on a modern desktop running a 12 GB VRAM model plus normal workflows.

Outside of those three cases, more system RAM does not move tok/s. A 3060 12GB with 32 GB DDR4 runs the same 13B Q4 at the same speed as a 3060 12GB with 96 GB DDR5 — what changes is the size of the model you can attempt at all, and what compromises you accept in tok/s when you do.

The "two GPUs vs one big GPU + more RAM" question

If you have $300 to spend and an existing 12 GB 3060 system, the choice usually comes down to:

  • +$90: 48 GB DDR5 kit (assumes AM5 board). Unlocks larger model loading at slow tok/s.
  • +$280: Second RTX 3060 12GB. Doubles VRAM to 24 GB; llama.cpp supports tensor-parallel split across two cards on one box.
  • +$450: Used RTX 3090 24GB. Replaces the 3060 with a card that has 936 GB/s bandwidth and 24 GB VRAM — single-card win for both speed and capacity.
  • +$650: Used RTX 4090 24GB. Best per-token performance on a consumer card; arguably the canonical local-LLM GPU until a 5090 used pipeline establishes.

For pure tok/s the rank is: 4090 > 3090 > dual-3060 > single 3060 + more RAM. The first three are all dramatically ahead of the last. If your only constraint is budget and you must keep the 3060, the second-3060 path is the right next move; the system-RAM path is a distant fallback.

Common pitfalls

  • Buying DDR5 to "speed up the GPU." It doesn't. The GPU has its own faster memory. System RAM only matters when work crosses to the CPU side.
  • Mixing capacities/speeds. Dual-rank vs single-rank, mismatched kits, or 4 sticks at full speed can drop your DDR5 from rated 6000 to 4400 MT/s on most boards. Always run two matched sticks if you want rated speed.
  • Ignoring the PSU. A second 3060 needs another ~170 W of headroom plus a separate PCIe power cable. A 550 W gold PSU is tight for a dual-3060 system under load.
  • Loading at fp16 when Q4_K_M is fine. Quantization to 4-bit is the cheapest way to fit bigger models. Q4_K_M loses very little quality on most ~7B–13B models.
  • Forgetting the Western Digital 1TB WD Blue SN550 NVMe SSD. Model files balloon: a single 70B Q4 model is ~40 GB, a typical local-LLM library hits 200 GB fast. NVMe is non-negotiable for load times.

When NOT to spend on either

If your existing 3060 + 32 GB DDR4 box already handles your workload — say, you run a 13B-class assistant at 25 tok/s and that's enough — don't spend on either upgrade. The marginal joy of going from 25 to 30 tok/s on the same model size is low. Spend on the GPU when you want a model the 3060 can't run; spend on RAM when you literally cannot load a model you need.

Bottom line

For local LLM inference, the ranking that matters is: fit-in-VRAM > more VRAM > faster VRAM > more system RAM > faster system RAM. A 12 GB RTX 3060 is a better local-LLM machine than a 96 GB DDR5 system without a discrete GPU, full stop. If you're choosing between $90 of DDR5 and $280 of GPU and you care about tok/s, the GPU wins by a wide margin.

The exception is the small slice of users who need to load very large models for occasional use — a researcher who runs a 70B at 3 tok/s once a week beats a researcher who can't run it at all. For that use case, the extra system RAM is worth the modest cost, but it doesn't change the everyday inference math: when you run, you want the weights to live in VRAM.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Does faster DDR5 RAM make local LLM inference faster?
Only when the model does not fully fit in VRAM. If the entire model lives on the GPU, system RAM speed is nearly irrelevant to generation throughput. Once you offload layers, DDR5 bandwidth becomes the limiter for those layers, and a faster kit helps marginally, but even DDR5-6000 is several times slower than the GDDR6 on an RTX 3060, so the penalty remains large.
How big a model can a 12GB RTX 3060 hold entirely in VRAM?
Roughly a 13B-14B-class model at q4_K_M quantization fits in about 8-9 GB, leaving room for context. A 32B model at q4 generally will not fit fully and must offload a portion to system RAM, which sharply reduces tok/s. For all-in-VRAM speed, stick to 7B-14B quantized models on a single 12 GB card.
Is a second RTX 3060 better than buying more system RAM?
For LLM inference, almost always yes. A second 12 GB card adds genuine high-bandwidth VRAM that llama.cpp and vLLM can split a model across, keeping everything off slow system memory. Extra DDR5 only helps the offloaded layers, which are the slow ones. Two 3060s give you 24 GB of fast memory versus a single card leaning on RAM.
Why does offloading to system RAM hurt tokens-per-second so much?
Generation is memory-bandwidth bound: every token requires streaming the model weights. GDDR6 on the RTX 3060 delivers about 360 GB/s, while DDR5-6000 dual-channel delivers roughly 96 GB/s and must also cross the PCIe bus. So any layer living in system RAM runs at a fraction of GPU speed, dragging the whole token loop down to the slowest stage.
Does a long context window change the RAM-versus-VRAM math?
Yes. The KV cache grows with context length and lives alongside the model in memory. A large context can consume several gigabytes, pushing a model that otherwise fit in 12 GB into offload territory. If you run long-context workloads, budget VRAM for the cache first, or reduce context length, before assuming more system RAM will rescue throughput.

Sources

— SpecPicks Editorial · Last verified 2026-06-04

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →