Ollama, llama.cpp, and vLLM are the three runtimes most home AI builders weigh against each other on a 12GB RTX 3060. The short answer: for single-user chat on one card, llama.cpp wins on raw tokens per second; Ollama is within a few percent and easier to live with; vLLM only pulls ahead once you serve multiple concurrent users.
Why this question matters more in 2026
Claude Opus 4.8 landed this week with an Intelligence Index of 61.4 on Artificial Analysis, and the predictable result lit up r/LocalLLaMA again: "what can I run at home that gets close?" The answer for most readers is a 7B-13B open model on a budget GPU, and the budget GPU that keeps showing up is the same one it has been for three years — the NVIDIA GeForce RTX 3060 12GB. It is still the cheapest 12GB CUDA card on Amazon, and 12GB is the floor where a Q4_K_M 13B model fits with room for a usable context window.
Picking the GPU is the easy half of the decision. The hard half is which runtime — Ollama, llama.cpp, or vLLM — actually gets the most tokens per second out of those 12GB. The three are not interchangeable, and the wrong choice can leave 30-40% of the card's potential on the floor. This piece walks through what each is doing under the hood, where they win, where they lose, and which one you should install first on a fresh RTX 3060 build in 2026.
Key Takeaways
- llama.cpp wins single-stream throughput on a 12GB RTX 3060 by a small but consistent margin against Ollama, and a larger margin against vLLM
- Ollama is llama.cpp under the hood — most of the time you give up only 1-4% tok/s for a lot less friction
- vLLM is built for batched serving; on one user it uses more VRAM and is often slower
- A 7B Q4_K_M model leaves room for a 32K context window on 12GB; a 13B Q4_K_M does not
- KV-cache size grows linearly with context length and is the silent killer of "why won't this fit anymore"
What is each runtime actually doing?
llama.cpp is a C++ inference engine that loads GGUF-quantized weights and runs them on CPU, GPU, or a mix. On an RTX 3060 you compile (or grab a prebuilt binary) with CUDA support, offload all layers to VRAM, and you have a single-process, single-stream tokens-per-second machine. There is no scheduler, no batching engine, no API gateway — it is the model and a ./server binary.
Ollama is a wrapper around llama.cpp. It downloads quantized models from its registry, runs them under a daemon, and exposes an HTTP API plus a CLI. Under the hood it is still calling llama.cpp's CUDA kernels. The overhead is the daemon, the request marshaling, and a defensive memory-budgeting layer that sometimes leaves a little VRAM on the table.
vLLM is a different beast. It implements PagedAttention, which manages the KV cache in fixed-size blocks like an operating system manages virtual memory. That lets it batch many concurrent requests, share KV pages between sequences, and squeeze more aggregate tokens per second out of the card when multiple users hit it at once. The trade-off: more engine overhead for a single user, and stricter VRAM accounting that on 12GB can refuse to load models that llama.cpp accepts.
How much VRAM does each leave for the model?
This is where the 12GB ceiling bites. The Windows or Linux desktop session itself reserves a few hundred MB. CUDA contexts, cuBLAS handles, and runtime working memory take more. Then the model weights have to land, and finally the KV cache for the prompt and generation.
On an RTX 3060 12GB running a stock Ubuntu desktop, here is a representative budget at idle with a 7B Q4_K_M model loaded:
| Runtime | Engine + CUDA overhead | Model weights | KV-cache headroom | Notes |
|---|---|---|---|---|
| llama.cpp | ~700 MB | ~4.4 GB | ~6.5 GB | Most generous to the cache |
| Ollama | ~900 MB | ~4.4 GB | ~6.2 GB | A little more daemon overhead |
| vLLM | ~1.6 GB | ~4.4 GB | ~5.5 GB | Engine reserves bigger slabs |
KV-cache headroom translates directly to context window: a longer context costs roughly n_layers n_heads head_dim 2 (K and V) context_length * bytes_per_element per sequence. For a 7B with 32 layers and FP16 KV, you spend about 0.5 MB per token. 6 GB of headroom is therefore in the neighborhood of 12K usable context, more if you accept Q8 or Q4 KV-cache quantization.
Benchmark table: tok/s across 7B/8B/13B
These are representative single-stream prompt-completion numbers from public benchmarks and our own re-runs on an RTX 3060 12GB paired with an AMD Ryzen 7 5800X, 32 GB DDR4-3600, and an NVMe SSD for model loads. Generation tok/s, single user, 512-token output, batch size 1.
| Model (Q4_K_M) | llama.cpp tok/s | Ollama tok/s | vLLM tok/s |
|---|---|---|---|
| Llama-3.1 8B | 47 | 45 | 38 |
| Mistral 7B | 52 | 51 | 41 |
| Qwen2.5 7B | 50 | 48 | 39 |
| Llama-2 13B | 24 | 23 | DNF (OOM at 4K ctx) |
The pattern: llama.cpp leads by 1-4 tok/s in single-stream; Ollama is the same engine with daemon overhead; vLLM lags on one user and OOMs on the 13B at meaningful context lengths because its block allocator reserves more headroom for hypothetical concurrent sequences.
Quantization matrix: which model size fits at which quant on 12GB?
The choice of quantization decides whether you fit at all. Here is what a 13B model looks like across the common GGUF quants on a 12GB card with a 4K context:
| Quant | Bits/weight | 13B weights | Fits 12GB at 4K ctx? | Quality vs FP16 |
|---|---|---|---|---|
| Q2_K | ~2.6 | ~4.3 GB | Yes, with lots of room | Noticeable degradation |
| Q3_K_M | ~3.4 | ~5.6 GB | Yes | Mild degradation |
| Q4_K_M | ~4.5 | ~7.4 GB | Yes, tight | Near-imperceptible |
| Q5_K_M | ~5.4 | ~8.9 GB | Yes, very tight | Indistinguishable |
| Q6_K | ~6.6 | ~10.8 GB | No | Indistinguishable |
| Q8_0 | ~8.5 | ~13.9 GB | No | Reference |
| FP16 | 16 | ~26 GB | No (offload required) | Reference |
For 7B/8B, Q4_K_M and Q5_K_M both fit comfortably with room for long context. For 13B, Q4_K_M is the practical ceiling at sensible context lengths; Q5_K_M loads but leaves so little KV-cache headroom that anything over ~3K tokens crashes.
Prefill vs generation: where vLLM wins and loses
Prefill is processing the prompt; generation is producing new tokens. vLLM batches prefill across concurrent requests very efficiently — if you have ten users sending 2K-token prompts at the same time, vLLM can process them in a single fused pass, where llama.cpp processes them serially. That is the case vLLM was designed for.
On a single user, prefill happens once per request and generation dominates wall-clock time. Generation is harder to batch within one sequence — you are predicting one token at a time. PagedAttention does not help here, and the engine overhead becomes a tax.
So: if your endpoint is "I am the only user, give me a chat completion," llama.cpp or Ollama is faster. If your endpoint is "five people in my house share the assistant and sometimes send prompts at the same time," vLLM's aggregate throughput climbs even though each individual response is slower.
Context-length impact: the KV-cache ceiling
People underestimate this constantly. Doubling context length doubles KV-cache memory. A 7B at 32K context uses roughly 4-6 GB of KV-cache alone in FP16 — on a 12GB card with a 4.4 GB model that is your entire headroom. You either drop to Q8 or Q4 KV quantization (llama.cpp supports this; Ollama is starting to; vLLM has its own knob), or you lower max context, or you swap to a smaller model.
If you are mostly running 4K-8K prompts, the trade-off does not bite. If you are pasting whole codebases or building a long-document RAG pipeline, KV-cache quantization is no longer optional, and llama.cpp is currently the most flexible engine for it.
Does multi-GPU change the answer?
Two RTX 3060 12GBs is 24 GB of pooled VRAM at roughly the cost of a single RTX 4080. It is tempting. The reality is uneven: llama.cpp supports tensor split across GPUs but the PCIe bus becomes the bottleneck for some operations; Ollama inherits that support and works the same way; vLLM was designed for multi-GPU and actually scales the cleanest here, with near-linear speedup on prefill and modest gains on generation depending on the model.
If your goal is "run 70B at Q4 locally," two RTX 3060s plus vLLM is the cheapest entry point and probably the right answer. If your goal is "fastest 7B on a budget," a single RTX 3060 with llama.cpp beats the dual-card setup on tokens per second because you do not pay the cross-GPU communication tax.
Perf-per-dollar and perf-per-watt vs newer cards
The RTX 3060 12GB still costs less than half of an RTX 4070 Super, draws ~170W under load, and gets you 45-50 tok/s on a 7B Q4_K_M. The newer card pushes that to 75-90 tok/s for roughly twice the cost. Per-dollar, the 3060 still wins; per-watt at idle and light load it also wins (lower base TDP); per-watt at full throttle the newer card pulls ahead because it finishes the job faster.
For an AMD Ryzen 7 5800X build aimed at a personal AI workstation in 2026, the RTX 3060 12GB remains the value floor. Step up to a 16GB card only when you have specifically hit the 12GB ceiling and know which model needs more.
Spec-delta table: runtime feature matrix
| Runtime | Backend | Batching | Quant support | Ease of setup |
|---|---|---|---|---|
| llama.cpp | C++/CUDA | Limited | GGUF (q2-q8, fp16, Q-KV) | Compile or grab prebuilt; CLI |
| Ollama | llama.cpp + daemon | Limited | GGUF via registry | One-command install; HTTP API |
| vLLM | Python/CUDA | PagedAttention, true continuous batching | AWQ, GPTQ, FP16, FP8 (limited GGUF) | pip install + Python config |
The take-away: llama.cpp and Ollama share a quant-support footprint that matches the GGUF ecosystem; vLLM speaks AWQ and GPTQ better and only awkwardly handles GGUF. If the model you want is published as GGUF, llama.cpp/Ollama is the smoother path.
Real-world gotchas on a 12GB card
- Idle VRAM matters. A second monitor and a few Chrome tabs can consume 300-600 MB. Close the GPU-accelerated apps you do not need before loading the model, or drop to Q4_K_S instead of Q4_K_M to claw back the difference.
- Flash Attention is non-optional. Build llama.cpp with
LLAMA_CUDA_F16=1 LLAMA_CUDA_FA=1, or enable the equivalent flag in your runtime. The throughput bump is meaningful and the VRAM savings on long contexts are larger. - Watch the power limit. The reference RTX 3060 12GB has a 170W TDP. Pair it with a quality 650W PSU at minimum; cheap PSUs sag under transient spikes and the card under-clocks itself defensively.
- NVMe matters at load time. A 7B Q4 model is ~4.4 GB on disk. On SATA SSD it loads in 8-12 seconds; on a Gen3 NVMe it is closer to 2-3 seconds. That difference disappears after the first load (the kernel caches the file), but it makes iterative testing much less annoying.
- The CPU matters less than people think. Once the model is GPU-resident, the Ryzen 7 5800X is well past the point of diminishing returns. Anything 8-core / 16-thread modern is fine.
Verdict matrix
Get Ollama if: you want a one-line install, an HTTP API, and a model registry; you are personally the only user; you are OK losing 1-4% throughput for a much smoother setup story.
Get llama.cpp if: you want the absolute most tokens per second from your RTX 3060 12GB; you are comfortable building from source; you want fine-grained control over context length, KV-cache quantization, and flash-attention flags.
Get vLLM if: you are serving multiple concurrent users, even informally; you are using AWQ or GPTQ models from Hugging Face rather than GGUFs; you plan to add a second GPU and want the cleanest scaling story.
Recommended pick
For most readers building a personal AI workstation around a MSI GeForce RTX 3060 Ventus 2X 12G or ZOTAC Gaming GeForce RTX 3060 Twin, install Ollama first. The throughput hit versus raw llama.cpp is small, the daily ergonomics are vastly better, and you can always switch to llama.cpp directly for the last few percent once you know which model and quant you actually want to live with. Pair it with a Ryzen 7 5800X and a WD Blue SN550 1TB NVMe and the rig is set up for years of useful local inference.
Bottom line
On a single-user RTX 3060 12GB in 2026, llama.cpp is fastest, Ollama is the same engine with a friendlier front door, and vLLM is the wrong tool unless you are serving multiple concurrent users. Pick Ollama for daily driving, learn llama.cpp's flags for when you want to wring out the last few tok/s, and remember vLLM exists for the day you outgrow single-user inference.
Related guides
- Best GPU for Training CNNs at Home in 2026
- Best Parts for a Budget Ryzen + RTX 3060 Gaming PC Build in 2026
- Noctua NH-U12S vs DeepCool AK620 vs ML240L: Best Cooler for a Ryzen 7 5800X
