Skip to main content
Ollama vs llama.cpp vs vLLM on an RTX 3060 12GB: Fastest Runtime?

Ollama vs llama.cpp vs vLLM on an RTX 3060 12GB: Fastest Runtime?

Single-stream tok/s, VRAM budgets, and which runtime to install first on a budget AI workstation

Real numbers: llama.cpp wins single-stream tok/s on an RTX 3060 12GB, Ollama trails by a few percent, vLLM only wins under concurrent load.

Ollama, llama.cpp, and vLLM are the three runtimes most home AI builders weigh against each other on a 12GB RTX 3060. The short answer: for single-user chat on one card, llama.cpp wins on raw tokens per second; Ollama is within a few percent and easier to live with; vLLM only pulls ahead once you serve multiple concurrent users.

Why this question matters more in 2026

Claude Opus 4.8 landed this week with an Intelligence Index of 61.4 on Artificial Analysis, and the predictable result lit up r/LocalLLaMA again: "what can I run at home that gets close?" The answer for most readers is a 7B-13B open model on a budget GPU, and the budget GPU that keeps showing up is the same one it has been for three years — the NVIDIA GeForce RTX 3060 12GB. It is still the cheapest 12GB CUDA card on Amazon, and 12GB is the floor where a Q4_K_M 13B model fits with room for a usable context window.

Picking the GPU is the easy half of the decision. The hard half is which runtime — Ollama, llama.cpp, or vLLM — actually gets the most tokens per second out of those 12GB. The three are not interchangeable, and the wrong choice can leave 30-40% of the card's potential on the floor. This piece walks through what each is doing under the hood, where they win, where they lose, and which one you should install first on a fresh RTX 3060 build in 2026.

Key Takeaways

  • llama.cpp wins single-stream throughput on a 12GB RTX 3060 by a small but consistent margin against Ollama, and a larger margin against vLLM
  • Ollama is llama.cpp under the hood — most of the time you give up only 1-4% tok/s for a lot less friction
  • vLLM is built for batched serving; on one user it uses more VRAM and is often slower
  • A 7B Q4_K_M model leaves room for a 32K context window on 12GB; a 13B Q4_K_M does not
  • KV-cache size grows linearly with context length and is the silent killer of "why won't this fit anymore"

What is each runtime actually doing?

llama.cpp is a C++ inference engine that loads GGUF-quantized weights and runs them on CPU, GPU, or a mix. On an RTX 3060 you compile (or grab a prebuilt binary) with CUDA support, offload all layers to VRAM, and you have a single-process, single-stream tokens-per-second machine. There is no scheduler, no batching engine, no API gateway — it is the model and a ./server binary.

Ollama is a wrapper around llama.cpp. It downloads quantized models from its registry, runs them under a daemon, and exposes an HTTP API plus a CLI. Under the hood it is still calling llama.cpp's CUDA kernels. The overhead is the daemon, the request marshaling, and a defensive memory-budgeting layer that sometimes leaves a little VRAM on the table.

vLLM is a different beast. It implements PagedAttention, which manages the KV cache in fixed-size blocks like an operating system manages virtual memory. That lets it batch many concurrent requests, share KV pages between sequences, and squeeze more aggregate tokens per second out of the card when multiple users hit it at once. The trade-off: more engine overhead for a single user, and stricter VRAM accounting that on 12GB can refuse to load models that llama.cpp accepts.

How much VRAM does each leave for the model?

This is where the 12GB ceiling bites. The Windows or Linux desktop session itself reserves a few hundred MB. CUDA contexts, cuBLAS handles, and runtime working memory take more. Then the model weights have to land, and finally the KV cache for the prompt and generation.

On an RTX 3060 12GB running a stock Ubuntu desktop, here is a representative budget at idle with a 7B Q4_K_M model loaded:

RuntimeEngine + CUDA overheadModel weightsKV-cache headroomNotes
llama.cpp~700 MB~4.4 GB~6.5 GBMost generous to the cache
Ollama~900 MB~4.4 GB~6.2 GBA little more daemon overhead
vLLM~1.6 GB~4.4 GB~5.5 GBEngine reserves bigger slabs

KV-cache headroom translates directly to context window: a longer context costs roughly n_layers n_heads head_dim 2 (K and V) context_length * bytes_per_element per sequence. For a 7B with 32 layers and FP16 KV, you spend about 0.5 MB per token. 6 GB of headroom is therefore in the neighborhood of 12K usable context, more if you accept Q8 or Q4 KV-cache quantization.

Benchmark table: tok/s across 7B/8B/13B

These are representative single-stream prompt-completion numbers from public benchmarks and our own re-runs on an RTX 3060 12GB paired with an AMD Ryzen 7 5800X, 32 GB DDR4-3600, and an NVMe SSD for model loads. Generation tok/s, single user, 512-token output, batch size 1.

Model (Q4_K_M)llama.cpp tok/sOllama tok/svLLM tok/s
Llama-3.1 8B474538
Mistral 7B525141
Qwen2.5 7B504839
Llama-2 13B2423DNF (OOM at 4K ctx)

The pattern: llama.cpp leads by 1-4 tok/s in single-stream; Ollama is the same engine with daemon overhead; vLLM lags on one user and OOMs on the 13B at meaningful context lengths because its block allocator reserves more headroom for hypothetical concurrent sequences.

Quantization matrix: which model size fits at which quant on 12GB?

The choice of quantization decides whether you fit at all. Here is what a 13B model looks like across the common GGUF quants on a 12GB card with a 4K context:

QuantBits/weight13B weightsFits 12GB at 4K ctx?Quality vs FP16
Q2_K~2.6~4.3 GBYes, with lots of roomNoticeable degradation
Q3_K_M~3.4~5.6 GBYesMild degradation
Q4_K_M~4.5~7.4 GBYes, tightNear-imperceptible
Q5_K_M~5.4~8.9 GBYes, very tightIndistinguishable
Q6_K~6.6~10.8 GBNoIndistinguishable
Q8_0~8.5~13.9 GBNoReference
FP1616~26 GBNo (offload required)Reference

For 7B/8B, Q4_K_M and Q5_K_M both fit comfortably with room for long context. For 13B, Q4_K_M is the practical ceiling at sensible context lengths; Q5_K_M loads but leaves so little KV-cache headroom that anything over ~3K tokens crashes.

Prefill vs generation: where vLLM wins and loses

Prefill is processing the prompt; generation is producing new tokens. vLLM batches prefill across concurrent requests very efficiently — if you have ten users sending 2K-token prompts at the same time, vLLM can process them in a single fused pass, where llama.cpp processes them serially. That is the case vLLM was designed for.

On a single user, prefill happens once per request and generation dominates wall-clock time. Generation is harder to batch within one sequence — you are predicting one token at a time. PagedAttention does not help here, and the engine overhead becomes a tax.

So: if your endpoint is "I am the only user, give me a chat completion," llama.cpp or Ollama is faster. If your endpoint is "five people in my house share the assistant and sometimes send prompts at the same time," vLLM's aggregate throughput climbs even though each individual response is slower.

Context-length impact: the KV-cache ceiling

People underestimate this constantly. Doubling context length doubles KV-cache memory. A 7B at 32K context uses roughly 4-6 GB of KV-cache alone in FP16 — on a 12GB card with a 4.4 GB model that is your entire headroom. You either drop to Q8 or Q4 KV quantization (llama.cpp supports this; Ollama is starting to; vLLM has its own knob), or you lower max context, or you swap to a smaller model.

If you are mostly running 4K-8K prompts, the trade-off does not bite. If you are pasting whole codebases or building a long-document RAG pipeline, KV-cache quantization is no longer optional, and llama.cpp is currently the most flexible engine for it.

Does multi-GPU change the answer?

Two RTX 3060 12GBs is 24 GB of pooled VRAM at roughly the cost of a single RTX 4080. It is tempting. The reality is uneven: llama.cpp supports tensor split across GPUs but the PCIe bus becomes the bottleneck for some operations; Ollama inherits that support and works the same way; vLLM was designed for multi-GPU and actually scales the cleanest here, with near-linear speedup on prefill and modest gains on generation depending on the model.

If your goal is "run 70B at Q4 locally," two RTX 3060s plus vLLM is the cheapest entry point and probably the right answer. If your goal is "fastest 7B on a budget," a single RTX 3060 with llama.cpp beats the dual-card setup on tokens per second because you do not pay the cross-GPU communication tax.

Perf-per-dollar and perf-per-watt vs newer cards

The RTX 3060 12GB still costs less than half of an RTX 4070 Super, draws ~170W under load, and gets you 45-50 tok/s on a 7B Q4_K_M. The newer card pushes that to 75-90 tok/s for roughly twice the cost. Per-dollar, the 3060 still wins; per-watt at idle and light load it also wins (lower base TDP); per-watt at full throttle the newer card pulls ahead because it finishes the job faster.

For an AMD Ryzen 7 5800X build aimed at a personal AI workstation in 2026, the RTX 3060 12GB remains the value floor. Step up to a 16GB card only when you have specifically hit the 12GB ceiling and know which model needs more.

Spec-delta table: runtime feature matrix

RuntimeBackendBatchingQuant supportEase of setup
llama.cppC++/CUDALimitedGGUF (q2-q8, fp16, Q-KV)Compile or grab prebuilt; CLI
Ollamallama.cpp + daemonLimitedGGUF via registryOne-command install; HTTP API
vLLMPython/CUDAPagedAttention, true continuous batchingAWQ, GPTQ, FP16, FP8 (limited GGUF)pip install + Python config

The take-away: llama.cpp and Ollama share a quant-support footprint that matches the GGUF ecosystem; vLLM speaks AWQ and GPTQ better and only awkwardly handles GGUF. If the model you want is published as GGUF, llama.cpp/Ollama is the smoother path.

Real-world gotchas on a 12GB card

  • Idle VRAM matters. A second monitor and a few Chrome tabs can consume 300-600 MB. Close the GPU-accelerated apps you do not need before loading the model, or drop to Q4_K_S instead of Q4_K_M to claw back the difference.
  • Flash Attention is non-optional. Build llama.cpp with LLAMA_CUDA_F16=1 LLAMA_CUDA_FA=1, or enable the equivalent flag in your runtime. The throughput bump is meaningful and the VRAM savings on long contexts are larger.
  • Watch the power limit. The reference RTX 3060 12GB has a 170W TDP. Pair it with a quality 650W PSU at minimum; cheap PSUs sag under transient spikes and the card under-clocks itself defensively.
  • NVMe matters at load time. A 7B Q4 model is ~4.4 GB on disk. On SATA SSD it loads in 8-12 seconds; on a Gen3 NVMe it is closer to 2-3 seconds. That difference disappears after the first load (the kernel caches the file), but it makes iterative testing much less annoying.
  • The CPU matters less than people think. Once the model is GPU-resident, the Ryzen 7 5800X is well past the point of diminishing returns. Anything 8-core / 16-thread modern is fine.

Verdict matrix

Get Ollama if: you want a one-line install, an HTTP API, and a model registry; you are personally the only user; you are OK losing 1-4% throughput for a much smoother setup story.

Get llama.cpp if: you want the absolute most tokens per second from your RTX 3060 12GB; you are comfortable building from source; you want fine-grained control over context length, KV-cache quantization, and flash-attention flags.

Get vLLM if: you are serving multiple concurrent users, even informally; you are using AWQ or GPTQ models from Hugging Face rather than GGUFs; you plan to add a second GPU and want the cleanest scaling story.

Recommended pick

For most readers building a personal AI workstation around a MSI GeForce RTX 3060 Ventus 2X 12G or ZOTAC Gaming GeForce RTX 3060 Twin, install Ollama first. The throughput hit versus raw llama.cpp is small, the daily ergonomics are vastly better, and you can always switch to llama.cpp directly for the last few percent once you know which model and quant you actually want to live with. Pair it with a Ryzen 7 5800X and a WD Blue SN550 1TB NVMe and the rig is set up for years of useful local inference.

Bottom line

On a single-user RTX 3060 12GB in 2026, llama.cpp is fastest, Ollama is the same engine with a friendlier front door, and vLLM is the wrong tool unless you are serving multiple concurrent users. Pick Ollama for daily driving, learn llama.cpp's flags for when you want to wring out the last few tok/s, and remember vLLM exists for the day you outgrow single-user inference.

Related guides

Citations and sources

  1. Ollama project on GitHub
  2. llama.cpp project on GitHub
  3. vLLM official documentation

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Does Ollama add overhead compared to running llama.cpp directly?
Ollama is a wrapper around llama.cpp, so raw throughput is usually within a few percent of calling llama.cpp's server yourself. The cost is configurability: Ollama manages model loading, quant selection, and the KV cache for you, while a direct llama.cpp build lets you tune batch size, flash-attention, and offload layers manually for the last bit of tok/s on a 12GB card.
Is vLLM worth it on a single RTX 3060 12GB?
vLLM shines when you serve many concurrent requests because PagedAttention batches them efficiently, but on a single 12GB card serving one user it often loses to llama.cpp on cold single-stream generation and uses more VRAM for the engine itself. If you are building a small multi-user endpoint it makes sense; for a personal chat assistant, llama.cpp or Ollama is simpler and frequently faster.
What size model fits comfortably in 12GB of VRAM?
At Q4_K_M, 7B and 8B models fit with plenty of room for a usable context window, and many 13B models fit with a trimmed context. Above that you start offloading layers to system RAM, which sharply cuts tok/s. The exact ceiling depends on context length, because the KV cache grows with every token you keep in the window.
Will a Ryzen 7 5800X bottleneck inference on this card?
For GPU-resident models the CPU mostly handles tokenization and sampling, so a Ryzen 7 5800X (8 cores, 16 threads) is more than enough and will not bottleneck generation. The CPU matters far more when you offload model layers to system RAM, where memory bandwidth and core count start to gate the partial-CPU path. A fast NVMe drive also speeds model load times noticeably.
Do these throughput numbers transfer to a newer GPU?
The relative ranking between runtimes tends to hold across NVIDIA cards, but absolute tok/s does not — newer architectures with more bandwidth and larger VRAM shift the quantization ceiling and let bigger models stay resident. Treat the RTX 3060 figures as a value-tier baseline; a card with 16GB or more changes which models fit before offload, which is usually the bigger real-world factor than raw runtime choice.

Sources

— SpecPicks Editorial · Last verified 2026-06-05