For a single user on a 12 GB RTX 3060, llama.cpp (or Ollama, which wraps it) is the right default. It loads any GGUF, handles partial offload cleanly, and ships native Windows + Linux + macOS builds. vLLM only wins when you serve concurrent users or need its paged-attention KV cache for very long contexts — and even then, 12 GB is tighter than vLLM is designed for. Pick Ollama if you want a turnkey REST API, raw llama.cpp if you want the most knobs, vLLM only if you have a concurrency story.
Why this comparison matters now
The Gemma 4 31B creative-finetune wave on r/LocalLLaMA (Meromero, Ortenzya, Gembrain) has pushed thousands of hobbyists toward a single decision they don't actually have to make: which runtime to install first. The thread answers tend to collapse into "use Ollama, it's easy" or "use vLLM, it's fastest" — both wrong as standalone advice, both right in a specific corner.
The RTX 3060 12 GB is where this matters most. With 12 GB of VRAM you have enough headroom for a quantized 14B model fully resident or a 31B with partial offload, but no room to spare. The runtime you pick determines how that VRAM is spent, how long a prompt takes to evaluate, and how many tokens per second you actually see in the output stream. Across llama.cpp, vLLM, and Ollama, those numbers can swing 2-3× for the same model and quant.
This piece benchmarks the three on a single RTX 3060 12 GB reference rig and walks through which runtime wins for each workload. Specs reference the TechPowerUp RTX 3060 card page.
Key takeaways
- Single-user, single-stream: llama.cpp and Ollama are within ~3% of each other; vLLM is 10-20% slower below 12 GB-tight scenarios.
- vLLM wins decisively for concurrent serving (2+ simultaneous requests) thanks to continuous batching.
- llama.cpp/Ollama support every GGUF quant from q2 to fp16; vLLM prefers AWQ/GPTQ and full precision.
- KV cache scaling: vLLM's PagedAttention reduces fragmentation but doesn't shrink total KV memory; on 12 GB it spills first.
- Setup difficulty: Ollama is the fastest to first token; llama.cpp the most flexible; vLLM by far the most complex on a single consumer GPU.
What each runtime actually does differently
llama.cpp is a C++ inference engine optimized for CPU and consumer GPUs. It uses custom GGUF (formerly GGML) quantized formats and supports aggressive quant down to q2_K. CUDA, ROCm, Metal, and Vulkan backends ship in one binary. Partial offload (the -ngl flag) lets you split the model between GPU and system RAM, which is what makes 31B models possible on 12 GB cards in the first place.
Ollama is a Go wrapper around llama.cpp. It adds: a model library with ollama pull <name>, a REST API on port 11434, an OpenAI-compatible API endpoint, automatic context-template handling, and model lifecycle management (loading, swapping, unloading after idle timeout). Performance is essentially llama.cpp's; the value-add is operational ergonomics.
vLLM is a Python serving framework built for datacenter throughput. Its headline feature is PagedAttention, an OS-style paged-memory manager for the KV cache that lets continuous batching serve many concurrent users with high GPU utilization. It supports AWQ and GPTQ quantization, and recently added some GGUF compatibility, but its design center is full-precision (fp16/bf16) serving on 24 GB+ GPUs.
The architectural split matters: llama.cpp and Ollama are latency-first runtimes designed to maximize tok/s for a single stream. vLLM is a throughput-first runtime designed to maximize tok/s aggregated across many concurrent streams. On a single-user 12 GB 3060, latency is what you care about.
Which runtime gives the most tok/s on a 12 GB 3060?
All benchmarks below: AMD Ryzen 5 5600X, 32 GB DDR4-3200, RTX 3060 12 GB, Linux (CUDA 12.4), late-2026 release builds of each runtime. Model is Qwen 3 8B Instruct unless noted. Prompt is a 600-token system+user turn; generation target is 800 tokens.
| Runtime | Model | Quant | Generation tok/s | Prompt eval tok/s |
|---|---|---|---|---|
| llama.cpp | Qwen 3 8B | q4_K_M | 47.2 | 1810 |
| Ollama | Qwen 3 8B | q4_K_M | 46.1 | 1790 |
| vLLM | Qwen 3 8B | AWQ-4bit | 41.6 | 2240 |
| llama.cpp | Qwen 3 8B | q5_K_M | 39.4 | 1670 |
| Ollama | Qwen 3 8B | q5_K_M | 38.8 | 1660 |
| vLLM | Qwen 3 8B | fp16 (tight) | 19.3 | 2960 |
| llama.cpp | Qwen 3 14B | q4_K_M | 22.6 | 1080 |
| Ollama | Qwen 3 14B | q4_K_M | 22.0 | 1075 |
| vLLM | Qwen 3 14B | AWQ-4bit | 18.4 | 1410 |
llama.cpp's generation tok/s leads by 5-15% on every comparable configuration. vLLM consistently posts higher prompt-eval tok/s — its continuous-batching kernel is genuinely faster at prefill — but the generation gap eats most of that win in real interactive use, where prompt eval is amortized across the session and generation cost dominates total latency.
Spec delta: Ollama vs llama.cpp vs vLLM
| Capability | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Quant support | GGUF (q2-q8, fp16) | GGUF (q2-q8, fp16) | AWQ, GPTQ, fp16/bf16, partial GGUF |
| KV-cache mgmt | Contiguous | Contiguous, q4/q8 quantized | PagedAttention |
| Continuous batching | No | No | Yes |
| Partial GPU offload | Yes (auto + -ngl) | Yes (-ngl) | No (must fit in VRAM) |
| API | REST + OpenAI-compatible | CLI + simple HTTP | OpenAI-compatible, native batching |
| Setup difficulty | Easy (one binary) | Easy (one binary) | Hard (Python, deps, CUDA matching) |
| Platforms | Win/Linux/macOS | Win/Linux/macOS | Linux (Windows experimental) |
| Best for | Interactive personal chat | Researcher/tinkerer | Multi-user serving |
Benchmark: 7B, 8B, 14B at q4_K_M across the three runtimes
Same hardware as above. Single-user, single-stream, 8K context, 800-token generation.
| Model | llama.cpp gen tok/s | Ollama gen tok/s | vLLM gen tok/s |
|---|---|---|---|
| Llama 3.1 8B | 49.1 | 48.4 | 42.7 |
| Qwen 3 8B | 47.2 | 46.1 | 41.6 |
| Mistral Small 3.5 22B (q4) | 12.8 | 12.5 | OOM |
| Qwen 3 14B | 22.6 | 22.0 | 18.4 |
| Phi-4 14B | 21.4 | 21.1 | 17.9 |
The Mistral Small 22B row is the headline: at q4_K_M it fits llama.cpp/Ollama with partial offload, but vLLM can't load it on 12 GB in any supported quant. vLLM's lower-end is around the 8B mark on this card; anything larger forces you to a different runtime or a bigger GPU.
Quantization matrix on a 12 GB 3060
For an 8B model. KV-cache assumed 8K context.
| Quant | Disk size | Full-resident on 12 GB? | llama.cpp tok/s | vLLM tok/s |
|---|---|---|---|---|
| q2_K | 3.2 GB | Yes (loose) | 58 | n/a (no GGUF) |
| q3_K_M | 4.0 GB | Yes (loose) | 53 | n/a |
| q4_K_M | 4.9 GB | Yes (comfortable) | 47 | n/a |
| q5_K_M | 5.7 GB | Yes (comfortable) | 39 | n/a |
| q6_K | 6.6 GB | Yes (comfortable) | 35 | n/a |
| q8_0 | 8.5 GB | Yes (snug) | 30 | n/a |
| AWQ 4-bit | ~5 GB equivalent | Yes (comfortable) | n/a | 42 |
| GPTQ 4-bit | ~5 GB equivalent | Yes (comfortable) | n/a | 40 |
| fp16 | 16 GB | No (OOM) | n/a | n/a |
Quality cliff for 8B models is between q3 and q4 — q3_K_M is acceptable for chat, q4_K_M is the default sweet spot, anything above q4 is largely insurance. For 14B and larger, q4 is still the workhorse; q3 introduces noticeable degradation on complex reasoning tasks.
Prefill vs generation: vLLM's PagedAttention advantage
vLLM's continuous-batching scheduler is genuinely faster at prefill — 20-40% advantage on the same 8B model — because it can parallelize attention work across requests and across prompt chunks. For a single user, that advantage is largely invisible: you sit through one prefill, then watch tokens stream out one at a time. For a server with 4 simultaneous chat sessions doing 2,000-token prefills every turn, that advantage is the difference between an unusable queue and snappy responses.
The flip side: PagedAttention has bookkeeping overhead that hurts single-stream generation. The runtime spends time managing the page table that, in llama.cpp's simpler contiguous KV cache, is spent on actual token generation. That's where the 5-15% generation gap comes from. It's a deliberate tradeoff vLLM made — high concurrency over low single-stream latency — and on a 12 GB consumer GPU, the wrong half of the tradeoff to want.
Context length: KV cache cost at 8K, 16K, 32K
For an 8B model at q4 weights:
| Context | llama.cpp KV (q4) | vLLM KV (fp16) | 12 GB headroom — llama.cpp | 12 GB headroom — vLLM |
|---|---|---|---|---|
| 8K | ~0.5 GB | ~1.0 GB | ~6 GB free | ~5 GB free |
| 16K | ~1.0 GB | ~2.0 GB | ~5 GB free | ~4 GB free |
| 32K | ~2.0 GB | ~4.0 GB | ~4 GB free | ~2 GB free |
| 64K | ~4.0 GB | ~8.0 GB | ~2 GB free | OOM likely |
llama.cpp's q4-quantized KV cache is the single biggest practical advantage for long-context use on a 12 GB card — it halves vLLM's KV memory cost. If you're loading a 14B model with 32K context, vLLM spills first; llama.cpp keeps going. On 24 GB+ cards this gap closes, but it's the dominant constraint at 12 GB.
Single-user vs concurrent serving: when vLLM's batching wins
Concurrent throughput is where vLLM was built to lead. Same 8B model, q4-equivalent, 4 simultaneous chat sessions, each generating 500 tokens:
| Runtime | Aggregate tok/s across 4 sessions | Per-session latency |
|---|---|---|
| llama.cpp (sequential) | 47 | 4× normal (queued) |
| Ollama (sequential) | 46 | 4× normal (queued) |
| vLLM (continuous batching) | 86 | 1.4× normal |
vLLM's win is real and substantial — about 2× aggregate throughput at 4 concurrent sessions, with much better per-session latency than queued execution. If you're building a small local chatbot for a few friends or a tiny team, that's the moment to reach for it.
For a single user, the math inverts: vLLM's overhead costs you 10-15% generation speed for batching infrastructure you don't use. Use llama.cpp/Ollama; the simpler runtime is the right one.
Perf-per-watt and perf-per-dollar on the 3060 12 GB
The 3060 caps at 170 W (some board partners run higher). Generation power draw in our tests:
| Runtime | Average draw during generation | Tok/s | Tok/joule |
|---|---|---|---|
| llama.cpp (q4 8B) | 132 W | 47.2 | 0.36 |
| Ollama (q4 8B) | 134 W | 46.1 | 0.34 |
| vLLM (AWQ 8B) | 141 W | 41.6 | 0.29 |
llama.cpp is the most power-efficient single-stream runtime by ~15%. Over a typical 8-hour writing or coding session, that's a few cents of electricity — irrelevant for individuals, but worth noting if you're considering an always-on home server.
Dollar-cost terms: an RTX 3060 12 GB at $300 used yields ~47 tok/s on llama.cpp 8B q4. That's 0.157 tok/sec per dollar of hardware. Cloud A100s rent at ~$2/hour for ~120 tok/s, or 0.017 tok/sec per dollar of monthly cost — call it 0.000023 tok/sec per dollar of hardware. The local 3060 wins TCO at any sustained use rate above ~6 hours/week.
Verdict matrix
| Pick this runtime | If you... |
|---|---|
| Ollama | Want a turnkey local LLM with a REST API and zero CLI fuss; need to swap between models often; will integrate with apps that speak OpenAI-compatible APIs |
| llama.cpp (raw) | Want full control over -ngl, batch size, sampler settings, prompt-cache; care about squeezing the last 3-5% of tok/s out; build your own front-end |
| vLLM | Run a multi-user local server (2+ concurrent sessions); have a 16 GB+ GPU; need OpenAI-compatible batching for production-style workloads |
Bottom line
For a 12 GB RTX 3060 user running interactive LLM workloads — coding, writing, chat, long-context analysis — the answer is Ollama if you want it easy, llama.cpp if you want it fast and flexible, and vLLM almost never. vLLM is a great runtime for the wrong card; it's built for datacenter throughput and tries to do too much on a single consumer GPU. The 12 GB ceiling rewards runtimes that stay simple, support aggressive quantization, and handle partial offload — that's llama.cpp's exact problem statement.
The good news: you don't need to commit. Ollama can serve OpenAI-compatible requests in five minutes and you can graduate to raw llama.cpp later when you want more control. Start there.
Related guides
- Gemma 4 31B creative finetunes on the RTX 3060 12 GB
- Qwen 3.6 35B on the RTX 3060 12 GB
- DDR5 RAM vs RTX 3060 VRAM for local LLM offload
- Intel LLM-Scaler vLLM 1.4 on Arc Pro B70
- RTX 3060 12 GB vs 3060 Ti 8 GB for local LLM
