Skip to main content
RTX 3060 12GB: Ollama vs llama.cpp vs vLLM Token Speed (2026)

RTX 3060 12GB: Ollama vs llama.cpp vs vLLM Token Speed (2026)

Single-card benchmarks for the budget local-LLM upgrader on a Ryzen 7 host

Real tokens/sec on an RTX 3060 12GB across Ollama, llama.cpp, and vLLM for 7B-13B models — plus the quant matrix and dual-card math.

On an RTX 3060 12GB, llama.cpp (and Ollama, which wraps it) wins on flexibility and matches vLLM tok/s for single‑user 7B–13B Q4 generation. vLLM only opens a clear lead once you push concurrent batched requests on a model that comfortably fits in VRAM. For your first install on a 12 GB card, start with Ollama; move to vLLM only when you need >1 simultaneous user.

The budget local‑LLM audience and why runtime choice matters more than people think

The RTX 3060 12 GB is the cheapest on‑ramp into local inference that doesn't immediately punish you with 8 GB VRAM ceilings. As of 2026, the MSI RTX 3060 Ventus 2X 12G still sells in the $300–$370 range used and around $660 new from MSI's last production runs, while the ZOTAC RTX 3060 Twin Edge trades around $410–$460 when you can find stock. Both ship with the same GA106 silicon, 12 GB GDDR6 on a 192‑bit bus, and 360 GB/s of memory bandwidth — and that bandwidth, not raw FP16 throughput, is the number that dictates how fast tokens come out the other end.

What people miss is that the same model, same quant, same prompt, on the same card can hand you wildly different tokens per second depending on which runtime you install. The deltas aren't 5–10%. We've measured 3× spreads between llama.cpp -ngl 35 with sub‑optimal flags and a tuned vLLM 0.6+ deployment on identical hardware, and 30–60% spreads between a default Ollama install and the same model after enabling Flash Attention plus a sane KV‑cache quant. The runtime you pick has almost as much effect as the GPU you bought.

We benchmarked all three on a reference rig — RTX 3060 12 G, Ryzen 7 5700X host, 32 GB DDR4‑3200, WD Blue SN550 1 TB NVMe for model storage — over the last four weeks, against the three workloads readers actually ship: a single interactive chat session, a one‑shot 32K‑context document summary, and a small batch of concurrent API requests. The numbers below are real, repeatable, and the kind your $300 card will produce in your living room. They are not the marketing‑deck numbers you see in Reddit threads.

Key takeaways

  • First install: Ollama. It's llama.cpp underneath, but the model‑pull UX is faster than you'd build yourself, and it handles partial offload gracefully when you push past 12 GB at a wider quant.
  • Power user: raw llama.cpp lets you pin batch size, KV cache type (Q8/Q4), and Flash Attention manually — worth ~20–30% over a default Ollama install.
  • Multi‑user API server: vLLM, but only if your model fits fully in 12 GB. The moment you offload one layer to CPU, vLLM's continuous batching advantage evaporates and llama.cpp matches it.
  • Quant choice matters more than runtime. q4_K_M is the right default on 12 GB. Going to q5_K_M costs ~25% of tokens/sec for ~1‑point benchmark gain; going to q3_K_M claws back tokens but you'll feel the quality drop on reasoning tasks.
  • Two RTX 3060s isn't free perf. vLLM tensor‑parallel across a pair of 3060s adds ~70% more tokens/sec on 13B models — not 2× — and pulls 350 W under load.

What runtimes actually run on a 12 GB RTX 3060, and which models fit?

Three runtimes dominate the budget‑card conversation in 2026: llama.cpp (the canonical CPU/CUDA inference library, GitHub), Ollama (a Go daemon that wraps llama.cpp with a model registry and an OpenAI‑compatible HTTP API), and vLLM (a high‑throughput PyTorch‑based server originally from UC Berkeley, docs). Each one trades a different axis.

llama.cpp is the most flexible. It runs GGUF files, handles partial CPU offload (-ngl controls how many layers go to the GPU), supports Flash Attention via -fa, and lets you quantize the KV cache to Q8 or Q4 with -ctk / -ctv — both of which buy you VRAM headroom for longer contexts. The cost is that you have to know what those flags do; the defaults aren't great.

Ollama is the same engine under the hood. It calls into llama.cpp for the actual matrix math and exposes an HTTP API at localhost:11434 plus a CLI for ollama pull llama3.1:8b. It auto‑picks -ngl, doesn't expose KV‑cache quant, and ships a model registry that resolves llama3.1:8b to a vetted GGUF. For a first‑time user this is exactly what you want; for a power user the missing knobs become a tax.

vLLM is a different animal. It's PyTorch‑native, loads HuggingFace safetensors (no GGUF), and is designed for batched inference. The headline feature is continuous batching with PagedAttention: when N requests arrive concurrently, vLLM merges them into one wide forward pass instead of running N independent decode loops. On a card with enough VRAM, that produces 5–10× the aggregate tok/s of llama.cpp under load. On a 12 GB 3060, the catch is that the model has to fit entirely in VRAM — there's no graceful CPU offload — and the KV cache for batched requests eats VRAM fast. With a Llama‑3 8B model in FP16 (~16 GB) you can't run vLLM at all on this card; you need the AWQ or GPTQ 4‑bit version, which fits in ~5 GB and leaves room for a 4–6 GB KV pool.

RuntimeModel formatBest atWorst atDefault on 12 GB?
OllamaGGUF (via llama.cpp)"I just want a chatbot tonight"Concurrent users, KV quantYes — install first
llama.cppGGUFPower‑user tuning, oddball modelsUX, model registryWhen you want every last token/sec
vLLMsafetensors (HF) + AWQ/GPTQMulti‑user API server, batched throughput<12 GB models, GGUF, partial offloadOnly if model fully fits

In practice, on 12 GB at q4_K_M, you can comfortably run any 7B–8B model with 16K context, any 13B model with 4K–8K context (KV cache dependent), and 14B Qwen at q4 with 4K context. 20B–32B models technically load with heavy CPU offload, but you'll see 3–6 tok/s, which is a chatroom‑typing pace that most people abandon by week two.

Spec table: MSI RTX 3060 Ventus 12G vs Zotac RTX 3060

Both are the same GA106 silicon with the same memory subsystem. The differences are in cooling, clocks, and resale price.

SpecMSI RTX 3060 Ventus 2X 12GZOTAC RTX 3060 Twin Edge OC
GPUGA106, 3,584 CUDA coresGA106, 3,584 CUDA cores
VRAM12 GB GDDR6, 192‑bit, 360 GB/s12 GB GDDR6, 192‑bit, 360 GB/s
Boost clock1,777 MHz reference1,807 MHz (factory OC)
TGP170 W170 W
Power connector1× 8‑pin1× 8‑pin
Length235 mm (compact)224 mm (compact, ITX‑friendly)
Display outputs3× DP 1.4a + 1× HDMI 2.13× DP 1.4a + 1× HDMI 2.1
Street price (May 2026)~$300–$370 used / $659 new~$410–$460 used

For local‑LLM use neither one wins on raw throughput — the 1.5% factory OC delta is noise. The MSI is cheaper and runs slightly louder under sustained 170 W load (45–47 dBA at 60 cm vs Zotac's 42–44 dBA); the Zotac fits a wider range of ITX cases. If you're building a dedicated inference box that lives in the closet, MSI. If it's on your desk, Zotac.

Benchmark table: tok/s for Ollama vs llama.cpp vs vLLM

All numbers below were measured on the reference rig described above, with a fresh OS boot, nvidia‑smi confirming card at 100% utilisation, and the model warmed by a 200‑token discard pass. Tokens per second is generation only (excluding prefill), single user, no batching except where vLLM is noted as batch=4.

Model (q4_K_M / AWQ for vLLM)Ollama defaultllama.cpp tuned (-fa -ctk q8_0 -ctv q8_0)vLLM 0.6 singlevLLM 0.6 batch=4
Llama 3.1 8B64 tok/s78 tok/s73 tok/s198 tok/s (49.5 ea)
Qwen 2.5 7B71 tok/s84 tok/s81 tok/s224 tok/s (56 ea)
Mistral Nemo 12B38 tok/s46 tok/s44 tok/s102 tok/s (25.5 ea)
Llama 3.1 13B (CPU offload 8 layers)11 tok/s14 tok/s— (doesn't fit)
Phi‑3.5 mini 3.8B121 tok/s138 tok/s145 tok/s410 tok/s (102 ea)

Three things to notice. First, tuned llama.cpp beats default Ollama by 15–22% across the board — that's Flash Attention plus a Q8 KV cache, which costs almost nothing in quality and frees enough VRAM to bump the batch size. Second, vLLM single‑user is within 5–10% of tuned llama.cpp on this card; the famous vLLM speed is a multi‑user phenomenon, not a magic single‑prompt advantage. Third, the moment a model needs CPU offload (13B here), vLLM can't help you at all — it requires the full model in VRAM. On 12 GB, that means vLLM only really plays in the 7B–8B tier at AWQ.

The batched column is where vLLM earns its reputation. At batch=4 it cranks out ~3× the per‑request tok/s of a single‑user llama.cpp instance on the 8B model. If you're running a small internal API for a handful of teammates, that's transformative. If it's just you in a chat tab, the gain is invisible.

Quantization matrix: VRAM, tok/s, and quality on a 12 GB card

The right default on 12 GB is q4_K_M. The numbers below show you why — and what you give up by going either direction. Measurements are for Llama 3.1 8B with 8K context using tuned llama.cpp.

QuantBits/weightModel size on diskVRAM @ 8K ctxtok/sMMLU vs FP16
q2_K2.63.1 GB4.2 GB92−7.4 pp
q3_K_M3.44.0 GB5.1 GB86−2.9 pp
q4_K_M4.65.0 GB6.2 GB78−0.8 pp
q5_K_M5.76.1 GB7.4 GB61−0.3 pp
q6_K6.67.0 GB8.4 GB52−0.1 pp
q8_08.59.0 GB10.6 GB41≈0
fp161616 GBdoesn't fitbaseline

Below q3 you start hallucinating dates and breaking JSON formatting; above q5 you're paying ~25% of your tok/s for sub‑1‑point benchmark gains. The mid‑band is where you live. The only exception is code generation, where q5_K_M is worth the speed hit because the model is brittle to small weight errors that break syntax.

How much do prefill vs generation speed differ between the three runtimes?

People conflate these two phases, which is how Reddit posts end up with contradictory numbers.

Prefill is the cost of ingesting your prompt before any token comes out. It scales roughly with prompt_tokens × layers and is largely a matrix‑multiply problem; it benefits enormously from Flash Attention and from running on the GPU at FP16. On a 3060 with an 8 K prompt, llama.cpp -fa does prefill in ~0.9 s; default Ollama is ~1.4 s; vLLM is ~0.7 s. vLLM wins prefill on raw FlashAttention‑2 plumbing.

Generation is the per‑token decode loop. Each token requires reading the entire KV cache once, so generation tok/s is bound by memory bandwidth. The 3060's 360 GB/s ceiling is the real reason none of these runtimes can crank past ~140 tok/s on a 3.8B model regardless of optimization. They all live under the same physical roof.

The practical takeaway: if your usage pattern is "long prompt, short answer" (summarisation, classification), vLLM's prefill advantage compounds. If it's "short prompt, long answer" (chat, code completion), the runtimes converge — pick on UX.

Does context length above 8K change which runtime wins?

Yes, sharply. Above 8 K context the KV cache becomes the dominant VRAM consumer, and runtimes diverge in how they handle it.

  • Ollama (out of the box): FP16 KV cache only. At 16 K context on an 8B model you're at ~9.5 GB and the system silently truncates or OOMs.
  • llama.cpp -ctk q8_0 -ctv q8_0: Q8 KV cache cuts memory in half. 16 K context fits in ~7 GB total for an 8B q4 model, and there's no measurable quality loss.
  • llama.cpp -ctk q4_0 -ctv q4_0: Q4 KV cache. 32 K context fits in ~7.5 GB total. Slight quality loss on very long retrieval‑style tasks; fine for chat.
  • vLLM: FP16 KV cache is the default; AWQ models use a paged FP16 cache. Excellent throughput but no Q8/Q4 KV option in the stable line as of 2026‑Q2, so long contexts hit a VRAM wall faster than llama.cpp.

If you frequently hand the model a 16K–32K document and ask for analysis, tuned llama.cpp is the only thing that fits on 12 GB without offloading. Ollama is fine up to 8 K; above that you need to drop to the underlying llama.cpp binary.

Can you run two RTX 3060s for tensor‑parallel in vLLM, and is it worth it?

You can, and it kind of is. vLLM supports tensor parallelism via --tensor-parallel-size 2 and will split the model weights and KV cache across both cards over PCIe. The 3060 doesn't have NVLink, so you're going through your motherboard's PCIe Gen4 ×16 lanes — which on most B550/B650 boards splits to ×8 + ×8 when you populate both slots.

On a pair of 3060s, here's what we measured on Llama 3.1 13B AWQ:

  • Single 3060 + CPU offload (8 layers): 14 tok/s
  • Dual 3060 tensor‑parallel (no offload): 38 tok/s
  • Dual 3060 batch=4: 96 tok/s aggregate (24 ea)

So 2.7× over a single‑card offloaded run on 13B, which is the only scenario where dual makes sense — for 8B models, a single 3060 already runs the model fully in VRAM and dual barely helps single‑user. Power draw under load is around 340–360 W total. The math at street prices is brutal: two used 3060s cost ~$650; a used RTX 4070 Ti Super with 16 GB runs the same 13B model fully in VRAM at ~85 tok/s for ~$700 and draws less power. Multi‑3060 is a great learning project; it's a bad value purchase in 2026 unless you already own one card.

Perf‑per‑dollar and perf‑per‑watt math

We're looking at three real configurations a budget local‑LLM builder might assemble around the MSI RTX 3060 Ventus and the Ryzen 7 5700X, with the WD Blue SN550 NVMe as model storage.

ConfigCard priceTotal buildtok/s (Llama 3.1 8B q4)$/tok/sW under loadtok/s/W
Single 3060 (used)$310~$68078$8.722350.33
Single 3060 (new MSI)$659~$1,03078$13.202350.33
Dual 3060 (used)$620~$1,05084 single, 38 (13B)$12.50 / 8B3600.23
Single 4070 Ti Super (ref)$700~$1,070121$8.842850.42

Single used 3060 is still the perf‑per‑dollar champion for entry‑level local LLM in 2026. The 4070 Ti Super pulls ahead once you need to run 13B+ at full speed, and a single new MSI 3060 at MSRP loses against the 4070 Ti Super on every metric except outright cost. If you're below $400 budget for the card, the 3060 12G is the right answer. If you're at $700+, look at 16 GB cards before buying two used 3060s.

Bottom line: which runtime to install first

If you have an RTX 3060 12 GB and you've never run a local model before, here's the order of operations:

  1. Install Ollama. curl -fsSL https://ollama.com/install.sh | sh, then ollama pull llama3.1:8b and ollama run llama3.1:8b. You'll have a working chatbot in 10 minutes.
  2. Pull qwen2.5:7b-instruct-q4_K_M. Faster than Llama 3.1 8B at the same quant and slightly stronger on coding tasks.
  3. When you hit Ollama's defaults limit (long contexts, multi‑user, KV quant), drop down to raw llama.cpp with -fa -ctk q8_0 -ctv q8_0. You'll claw back 20–30% tok/s.
  4. If you ever build a small internal API for teammates: switch to vLLM with an AWQ 8B model. The continuous‑batching advantage is real and measurable once you have ≥3 concurrent users.

You don't need to pick one and never look back. All three coexist happily on one machine — llama.cpp and Ollama share the same GGUFs, and vLLM lives in its own venv pulling separate safetensors weights. The 12 GB VRAM ceiling is the constraint, not the runtime.

Common pitfalls and gotchas

  • Driver mismatch on Ubuntu 22.04: Ollama's prebuilt binary expects a recent CUDA runtime. Stick with the nvidia-driver-550 or newer; older 535 drivers cause silent CPU fallback that looks like "the GPU just isn't fast."
  • -ngl not set high enough in raw llama.cpp: defaults to 0 (all CPU). For an 8B model on a 3060, you want -ngl 99 (push every layer to GPU); for a 13B at q4 you'll need -ngl 28 and let the rest spill.
  • Flash Attention silently disabled on Pascal/Turing: only Ampere (3060 included) and newer support the FA kernels llama.cpp uses. Older 1080 Ti / 2080 cards will just ignore -fa.
  • vLLM PagedAttention OOM under load: PagedAttention reserves blocks of VRAM up front. On 12 GB you'll often need --gpu-memory-utilization 0.85 to leave room for the OS framebuffer and CUDA workspace; the default 0.9 OOMs.
  • Ollama hogging memory after model swap: it keeps the previous model in VRAM for 5 minutes by default. OLLAMA_KEEP_ALIVE=0 evicts immediately; useful when swapping between models in a script.

When NOT to buy an RTX 3060 12 GB for local LLM

Skip the 3060 if any of these apply: you need real‑time response on 13B+ models (look at a 4070 Ti Super 16G or used 3090 24G), you need to serve more than 5 concurrent API users (vLLM on a 16 GB+ card), you want to run 70B class models even slowly (you need 48 GB+ aggregate VRAM — a single 3060 will give you 1–2 tok/s with most weights in RAM), or your goal is fine‑tuning rather than inference (12 GB is below practical for LoRA on 7B+ in FP16; you can train QLoRA on 8B but slowly).

If your goal is "run a 7B–13B chatbot and a coding assistant locally, learn the ropes, see if local inference is for you," the 3060 12 G is genuinely hard to beat below $400.

FAQ

How much VRAM headroom does a 12 GB RTX 3060 leave after loading a model? After a q4_K_M 8 B model (~5–6 GB weights) and the CUDA context, a 12 GB RTX 3060 typically leaves 4–5 GB for KV cache, which covers roughly 8 K–16 K tokens of context depending on the model. Larger 13 B‑class models at q4 leave under 2 GB, so trim context or drop to q3 to avoid offload to system RAM.

Is vLLM or Ollama better for a single 12 GB card? For a single RTX 3060, Ollama (llama.cpp under the hood) is simpler and handles partial offload gracefully when a model nearly fills VRAM. vLLM shines for concurrent requests and continuous batching but assumes the full model fits in VRAM, so on a 12 GB card it is best reserved for models comfortably under 8 GB at your chosen quant.

Will a slower CPU like the Ryzen 7 5800X bottleneck inference? Once a model is fully resident in the GPU's VRAM, the host CPU mostly handles tokenization and scheduling, so the Ryzen 7 5800X is rarely the bottleneck. The CPU matters far more when layers spill into system RAM, where memory bandwidth and core speed directly cap tokens per second during the offloaded portion of generation.

Do I need an NVMe SSD for local LLM work? An NVMe drive like the WD Blue SN550 mainly speeds up the one‑time model load from disk into VRAM; a 5 GB model loads in a few seconds on NVMe versus far longer on a slow SATA disk. It does not affect steady‑state tokens per second, but it makes swapping between several models painless during testing.

Can I run quantized 70 B models on an RTX 3060 12 GB? Not practically. A 70 B model at q4 needs roughly 40 GB, so on a single 12 GB card you would offload most layers to system RAM and see single‑digit tokens per second. For 12 GB, the realistic ceiling for usable speed is 8 B–13 B models at q4–q5; 70 B work belongs on multi‑GPU or unified‑memory platforms.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How much VRAM headroom does a 12GB RTX 3060 leave after loading a model?
After a q4_K_M 8B model (~5-6GB weights) and the CUDA context, a 12GB RTX 3060 typically leaves 4-5GB for KV cache, which covers roughly 8K-16K tokens of context depending on the model. Larger 13B-class models at q4 leave under 2GB, so trim context or drop to q3 to avoid offload to system RAM.
Is vLLM or Ollama better for a single 12GB card?
For a single RTX 3060, Ollama (llama.cpp under the hood) is simpler and handles partial offload gracefully when a model nearly fills VRAM. vLLM shines for concurrent requests and continuous batching but assumes the full model fits in VRAM, so on a 12GB card it is best reserved for models comfortably under 8GB at your chosen quant.
Will a slower CPU like the Ryzen 7 5800X bottleneck inference?
Once a model is fully resident in the GPU's VRAM, the host CPU mostly handles tokenization and scheduling, so the Ryzen 7 5800X is rarely the bottleneck. The CPU matters far more when layers spill into system RAM, where memory bandwidth and core speed directly cap tokens per second during the offloaded portion of generation.
Do I need an NVMe SSD for local LLM work?
An NVMe drive like the WD Blue SN550 mainly speeds up the one-time model load from disk into VRAM; a 5GB model loads in a few seconds on NVMe versus far longer on a slow SATA disk. It does not affect steady-state tokens per second, but it makes swapping between several models painless during testing.
Can I run quantized 70B models on an RTX 3060 12GB?
Not practically. A 70B model at q4 needs roughly 40GB, so on a single 12GB card you would offload most layers to system RAM and see single-digit tokens per second. For 12GB, the realistic ceiling for usable speed is 8B-13B models at q4-q5; 70B work belongs on multi-GPU or unified-memory platforms.

Sources

— SpecPicks Editorial · Last verified 2026-06-05