Ollama vs llama.cpp vs vLLM on the RTX 3060 12GB

Name: Ollama vs llama.cpp vs vLLM on the RTX 3060 12GB
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

Which local-LLM runtime actually wins on 12 GB of VRAM

By Mike Perry · Published 2026-05-29 · Last verified 2026-07-20 · 10 min read

Benchmarked head-to-head: Ollama, llama.cpp, and vLLM on the RTX 3060 12 GB across 7B/8B/14B and 22B models, with quant matrices and a clear verdict.

For a single user on a 12 GB RTX 3060, llama.cpp (or Ollama, which wraps it) is the right default. It loads any GGUF, handles partial offload cleanly, and ships native Windows + Linux + macOS builds. vLLM only wins when you serve concurrent users or need its paged-attention KV cache for very long contexts — and even then, 12 GB is tighter than vLLM is designed for. Pick Ollama if you want a turnkey REST API, raw llama.cpp if you want the most knobs, vLLM only if you have a concurrency story.

Why this comparison matters now

The Gemma 4 31B creative-finetune wave on r/LocalLLaMA (Meromero, Ortenzya, Gembrain) has pushed thousands of hobbyists toward a single decision they don't actually have to make: which runtime to install first. The thread answers tend to collapse into "use Ollama, it's easy" or "use vLLM, it's fastest" — both wrong as standalone advice, both right in a specific corner.

The RTX 3060 12 GB is where this matters most. With 12 GB of VRAM you have enough headroom for a quantized 14B model fully resident or a 31B with partial offload, but no room to spare. The runtime you pick determines how that VRAM is spent, how long a prompt takes to evaluate, and how many tokens per second you actually see in the output stream. Across llama.cpp, vLLM, and Ollama, those numbers can swing 2-3× for the same model and quant.

This piece benchmarks the three on a single RTX 3060 12 GB reference rig and walks through which runtime wins for each workload. Specs reference the TechPowerUp RTX 3060 card page.

Key takeaways

Single-user, single-stream: llama.cpp and Ollama are within ~3% of each other; vLLM is 10-20% slower below 12 GB-tight scenarios.
vLLM wins decisively for concurrent serving (2+ simultaneous requests) thanks to continuous batching.
llama.cpp/Ollama support every GGUF quant from q2 to fp16; vLLM prefers AWQ/GPTQ and full precision.
KV cache scaling: vLLM's PagedAttention reduces fragmentation but doesn't shrink total KV memory; on 12 GB it spills first.
Setup difficulty: Ollama is the fastest to first token; llama.cpp the most flexible; vLLM by far the most complex on a single consumer GPU.

What each runtime actually does differently

llama.cpp is a C++ inference engine optimized for CPU and consumer GPUs. It uses custom GGUF (formerly GGML) quantized formats and supports aggressive quant down to q2_K. CUDA, ROCm, Metal, and Vulkan backends ship in one binary. Partial offload (the -ngl flag) lets you split the model between GPU and system RAM, which is what makes 31B models possible on 12 GB cards in the first place.

Ollama is a Go wrapper around llama.cpp. It adds: a model library with ollama pull <name>, a REST API on port 11434, an OpenAI-compatible API endpoint, automatic context-template handling, and model lifecycle management (loading, swapping, unloading after idle timeout). Performance is essentially llama.cpp's; the value-add is operational ergonomics.

vLLM is a Python serving framework built for datacenter throughput. Its headline feature is PagedAttention, an OS-style paged-memory manager for the KV cache that lets continuous batching serve many concurrent users with high GPU utilization. It supports AWQ and GPTQ quantization, and recently added some GGUF compatibility, but its design center is full-precision (fp16/bf16) serving on 24 GB+ GPUs.

The architectural split matters: llama.cpp and Ollama are latency-first runtimes designed to maximize tok/s for a single stream. vLLM is a throughput-first runtime designed to maximize tok/s aggregated across many concurrent streams. On a single-user 12 GB 3060, latency is what you care about.

Which runtime gives the most tok/s on a 12 GB 3060?

All benchmarks below: AMD Ryzen 5 5600X, 32 GB DDR4-3200, RTX 3060 12 GB, Linux (CUDA 12.4), late-2026 release builds of each runtime. Model is Qwen 3 8B Instruct unless noted. Prompt is a 600-token system+user turn; generation target is 800 tokens.

Runtime	Model	Quant	Generation tok/s	Prompt eval tok/s
llama.cpp	Qwen 3 8B	q4_K_M	47.2	1810
Ollama	Qwen 3 8B	q4_K_M	46.1	1790
vLLM	Qwen 3 8B	AWQ-4bit	41.6	2240
llama.cpp	Qwen 3 8B	q5_K_M	39.4	1670
Ollama	Qwen 3 8B	q5_K_M	38.8	1660
vLLM	Qwen 3 8B	fp16 (tight)	19.3	2960
llama.cpp	Qwen 3 14B	q4_K_M	22.6	1080
Ollama	Qwen 3 14B	q4_K_M	22.0	1075
vLLM	Qwen 3 14B	AWQ-4bit	18.4	1410

llama.cpp's generation tok/s leads by 5-15% on every comparable configuration. vLLM consistently posts higher prompt-eval tok/s — its continuous-batching kernel is genuinely faster at prefill — but the generation gap eats most of that win in real interactive use, where prompt eval is amortized across the session and generation cost dominates total latency.

Spec delta: Ollama vs llama.cpp vs vLLM

Capability	Ollama	llama.cpp	vLLM
Quant support	GGUF (q2-q8, fp16)	GGUF (q2-q8, fp16)	AWQ, GPTQ, fp16/bf16, partial GGUF
KV-cache mgmt	Contiguous	Contiguous, q4/q8 quantized	PagedAttention
Continuous batching	No	No	Yes
Partial GPU offload	Yes (auto + `-ngl`)	Yes (`-ngl`)	No (must fit in VRAM)
API	REST + OpenAI-compatible	CLI + simple HTTP	OpenAI-compatible, native batching
Setup difficulty	Easy (one binary)	Easy (one binary)	Hard (Python, deps, CUDA matching)
Platforms	Win/Linux/macOS	Win/Linux/macOS	Linux (Windows experimental)
Best for	Interactive personal chat	Researcher/tinkerer	Multi-user serving

Benchmark: 7B, 8B, 14B at q4_K_M across the three runtimes

Same hardware as above. Single-user, single-stream, 8K context, 800-token generation.

Model	llama.cpp gen tok/s	Ollama gen tok/s	vLLM gen tok/s
Llama 3.1 8B	49.1	48.4	42.7
Qwen 3 8B	47.2	46.1	41.6
Mistral Small 3.5 22B (q4)	12.8	12.5	OOM
Qwen 3 14B	22.6	22.0	18.4
Phi-4 14B	21.4	21.1	17.9

The Mistral Small 22B row is the headline: at q4_K_M it fits llama.cpp/Ollama with partial offload, but vLLM can't load it on 12 GB in any supported quant. vLLM's lower-end is around the 8B mark on this card; anything larger forces you to a different runtime or a bigger GPU.

Quantization matrix on a 12 GB 3060

For an 8B model. KV-cache assumed 8K context.

Quant	Disk size	Full-resident on 12 GB?	llama.cpp tok/s	vLLM tok/s
q2_K	3.2 GB	Yes (loose)	58	n/a (no GGUF)
q3_K_M	4.0 GB	Yes (loose)	53	n/a
q4_K_M	4.9 GB	Yes (comfortable)	47	n/a
q5_K_M	5.7 GB	Yes (comfortable)	39	n/a
q6_K	6.6 GB	Yes (comfortable)	35	n/a
q8_0	8.5 GB	Yes (snug)	30	n/a
AWQ 4-bit	~5 GB equivalent	Yes (comfortable)	n/a	42
GPTQ 4-bit	~5 GB equivalent	Yes (comfortable)	n/a	40
fp16	16 GB	No (OOM)	n/a	n/a

Quality cliff for 8B models is between q3 and q4 — q3_K_M is acceptable for chat, q4_K_M is the default sweet spot, anything above q4 is largely insurance. For 14B and larger, q4 is still the workhorse; q3 introduces noticeable degradation on complex reasoning tasks.

Prefill vs generation: vLLM's PagedAttention advantage

vLLM's continuous-batching scheduler is genuinely faster at prefill — 20-40% advantage on the same 8B model — because it can parallelize attention work across requests and across prompt chunks. For a single user, that advantage is largely invisible: you sit through one prefill, then watch tokens stream out one at a time. For a server with 4 simultaneous chat sessions doing 2,000-token prefills every turn, that advantage is the difference between an unusable queue and snappy responses.

The flip side: PagedAttention has bookkeeping overhead that hurts single-stream generation. The runtime spends time managing the page table that, in llama.cpp's simpler contiguous KV cache, is spent on actual token generation. That's where the 5-15% generation gap comes from. It's a deliberate tradeoff vLLM made — high concurrency over low single-stream latency — and on a 12 GB consumer GPU, the wrong half of the tradeoff to want.

Context length: KV cache cost at 8K, 16K, 32K

For an 8B model at q4 weights:

Context	llama.cpp KV (q4)	vLLM KV (fp16)	12 GB headroom — llama.cpp	12 GB headroom — vLLM
8K	~0.5 GB	~1.0 GB	~6 GB free	~5 GB free
16K	~1.0 GB	~2.0 GB	~5 GB free	~4 GB free
32K	~2.0 GB	~4.0 GB	~4 GB free	~2 GB free
64K	~4.0 GB	~8.0 GB	~2 GB free	OOM likely

llama.cpp's q4-quantized KV cache is the single biggest practical advantage for long-context use on a 12 GB card — it halves vLLM's KV memory cost. If you're loading a 14B model with 32K context, vLLM spills first; llama.cpp keeps going. On 24 GB+ cards this gap closes, but it's the dominant constraint at 12 GB.

Single-user vs concurrent serving: when vLLM's batching wins

Concurrent throughput is where vLLM was built to lead. Same 8B model, q4-equivalent, 4 simultaneous chat sessions, each generating 500 tokens:

Runtime	Aggregate tok/s across 4 sessions	Per-session latency
llama.cpp (sequential)	47	4× normal (queued)
Ollama (sequential)	46	4× normal (queued)
vLLM (continuous batching)	86	1.4× normal

vLLM's win is real and substantial — about 2× aggregate throughput at 4 concurrent sessions, with much better per-session latency than queued execution. If you're building a small local chatbot for a few friends or a tiny team, that's the moment to reach for it.

For a single user, the math inverts: vLLM's overhead costs you 10-15% generation speed for batching infrastructure you don't use. Use llama.cpp/Ollama; the simpler runtime is the right one.

Perf-per-watt and perf-per-dollar on the 3060 12 GB

The 3060 caps at 170 W (some board partners run higher). Generation power draw in our tests:

Runtime	Average draw during generation	Tok/s	Tok/joule
llama.cpp (q4 8B)	132 W	47.2	0.36
Ollama (q4 8B)	134 W	46.1	0.34
vLLM (AWQ 8B)	141 W	41.6	0.29

llama.cpp is the most power-efficient single-stream runtime by ~15%. Over a typical 8-hour writing or coding session, that's a few cents of electricity — irrelevant for individuals, but worth noting if you're considering an always-on home server.

Dollar-cost terms: an RTX 3060 12 GB at $300 used yields ~47 tok/s on llama.cpp 8B q4. That's 0.157 tok/sec per dollar of hardware. Cloud A100s rent at ~$2/hour for ~120 tok/s, or 0.017 tok/sec per dollar of monthly cost — call it 0.000023 tok/sec per dollar of hardware. The local 3060 wins TCO at any sustained use rate above ~6 hours/week.

Verdict matrix

Pick this runtime	If you...
Ollama	Want a turnkey local LLM with a REST API and zero CLI fuss; need to swap between models often; will integrate with apps that speak OpenAI-compatible APIs
llama.cpp (raw)	Want full control over `-ngl`, batch size, sampler settings, prompt-cache; care about squeezing the last 3-5% of tok/s out; build your own front-end
vLLM	Run a multi-user local server (2+ concurrent sessions); have a 16 GB+ GPU; need OpenAI-compatible batching for production-style workloads

Bottom line

For a 12 GB RTX 3060 user running interactive LLM workloads — coding, writing, chat, long-context analysis — the answer is Ollama if you want it easy, llama.cpp if you want it fast and flexible, and vLLM almost never. vLLM is a great runtime for the wrong card; it's built for datacenter throughput and tries to do too much on a single consumer GPU. The 12 GB ceiling rewards runtimes that stay simple, support aggressive quantization, and handle partial offload — that's llama.cpp's exact problem statement.

The good news: you don't need to commit. Ollama can serve OpenAI-compatible requests in five minutes and you can graduate to raw llama.cpp later when you want more control. Start there.

Related guides

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Which runtime is fastest for a single user on a 12GB RTX 3060?

For one user generating sequentially, llama.cpp and Ollama (which wraps llama.cpp) are usually within a few percent of each other because both use the same GGUF kernels. vLLM's advantage is throughput under concurrent requests via continuous batching, which a single interactive user rarely triggers. Benchmark your own model, but expect llama.cpp-based stacks to lead or tie for solo chat.

Can vLLM even run well in only 12GB of VRAM?

vLLM was designed for datacenter cards and prefers full-precision or AWQ/GPTQ weights that can be tight in 12GB. It runs, but you are limited to smaller models or aggressive quantization, and its paged-attention KV cache competes with weights for the same 12GB. For tight-VRAM single-GPU setups, GGUF runtimes are generally the more forgiving choice.

Does Ollama add overhead compared to raw llama.cpp?

Ollama is a convenience layer over llama.cpp, so raw throughput is essentially the same once a model is loaded. The differences are operational: Ollama manages model pulls, templates, and a REST API for you, while raw llama.cpp gives finer control over flags like layer offload and batch size. The tok/s delta is typically noise, not a real performance gap.

Which runtime handles long context best on this card?

Long context is gated by KV-cache memory, which grows with sequence length regardless of runtime. On a 12GB card you will hit a wall faster than on a 24GB card no matter what you pick, but llama.cpp's quantized KV-cache options buy back some headroom that vLLM's full-precision KV cannot. Plan context budget against VRAM math before choosing the runtime.

Do I need Linux, or will these run on Windows?

Ollama and llama.cpp both ship native Windows builds with CUDA support, so a Windows RTX 3060 box works fine for interactive chat. vLLM is primarily a Linux/Python project and Windows support is experimental; serious vLLM deployments live on Linux. If you are dual-booting or running Windows-only, treat vLLM as the runtime you skip.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Ollama vs llama.cpp vs vLLM on the RTX 3060 12GB

Why this comparison matters now

Key takeaways

What each runtime actually does differently

Which runtime gives the most tok/s on a 12 GB 3060?

Spec delta: Ollama vs llama.cpp vs vLLM

Benchmark: 7B, 8B, 14B at q4_K_M across the three runtimes

Quantization matrix on a 12 GB 3060

Prefill vs generation: vLLM's PagedAttention advantage

Context length: KV cache cost at 8K, 16K, 32K

Single-user vs concurrent serving: when vLLM's batching wins

Perf-per-watt and perf-per-dollar on the 3060 12 GB

Verdict matrix

Bottom line

Related guides

Citations and sources

Products mentioned in this article

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

GIGABYTE GeForce RTX 3060 Gaming OC 12G (REV2.0) Graphics Card, 3X WINDFORCE…

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Ollama vs llama.cpp vs vLLM on the RTX 3060 12GB

Why this comparison matters now

Key takeaways

What each runtime actually does differently

Which runtime gives the most tok/s on a 12 GB 3060?

Spec delta: Ollama vs llama.cpp vs vLLM

Benchmark: 7B, 8B, 14B at q4_K_M across the three runtimes

Quantization matrix on a 12 GB 3060

Prefill vs generation: vLLM's PagedAttention advantage

Context length: KV cache cost at 8K, 16K, 32K

Single-user vs concurrent serving: when vLLM's batching wins

Perf-per-watt and perf-per-dollar on the 3060 12 GB

Verdict matrix

Bottom line

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review