Short answer: For a single user on an RTX 3060 12GB, Ollama is the right default. llama.cpp is the right answer when you need fine control over quants, samplers, or unusual model formats. vLLM is the wrong tool for this card and this use case — it's built for batched multi-user serving and the overhead doesn't pay off until you have a much bigger GPU and concurrent traffic.
This is a setup-quality comparison, not a benchmark shootout: all three runtimes use roughly the same inference math on the same model, so the raw tok/s numbers cluster within noise. What actually differs is how much friction you eat between "I have a card" and "I'm shipping work."
Why this comparison matters in 2026
The Ampere-class RTX 3060 12GB is the single most common card in the local-LLM community as of mid-2026. It's affordable used ($260-290), has enough VRAM for a real 7-13B model with embedding and KV-cache headroom, and works with every major open-weight model and runtime. New entrants to the local-LLM space face one decision before they even pick a model: which runtime do I install?
The three candidates that capture 95% of the single-user market are:
- Ollama — a daemon + CLI wrapper around llama.cpp's library, with a Docker-style model registry, an OpenAI-compatible HTTP API, and automatic model lifecycle management.
- llama.cpp — the underlying C++ inference engine for GGUF quants, with a direct
llama-cliandllama-serverinterface and complete control over every flag. - vLLM — a Python-based serving framework optimized for batched, multi-user throughput, with paged attention and continuous batching.
This article walks through the actual differences as they show up on a 12GB card, not the marketing differences.
Key takeaways
- Ollama is the right default for single-user, single-machine, "I want this to work" setups.
- llama.cpp wins when you need an exotic quant, a sampler Ollama doesn't expose, or you're building tooling that needs the C++ library directly.
- vLLM is the wrong tool for this card and this use case — its design assumptions (concurrent users, big GPUs) don't match.
- Raw throughput is functionally identical between Ollama and llama.cpp on the same model and quant.
- Memory layout matters more than runtime choice on a 12GB card — KV-cache quantization and
--n-gpu-layerstuning are the dials that actually move tok/s. - Switch runtimes when your workload changes, not when your card changes.
Setup friction: minutes from new card to working inference
The single biggest practical difference is how long it takes to get the first token out.
| Runtime | Steps to first token | Cross-platform | Auto GPU detection | Model registry |
|---|---|---|---|---|
| Ollama | Install, ollama run llama3.2, done | Windows / macOS / Linux | Yes | Yes (built-in) |
| llama.cpp | Install build deps, clone, compile, download GGUF, run llama-cli | Cross-platform with build effort | Manual flags | No (HF download) |
| vLLM | pip install, create Python script, configure model + GPU args | Linux-first (Windows via WSL) | Yes (in Python) | HuggingFace direct |
On a fresh Windows or Linux box, Ollama goes from "downloaded the installer" to "answering questions" in roughly 3 minutes including the model pull. llama.cpp takes 15-30 minutes the first time you do it, mostly because of CUDA toolkit setup and build configuration. vLLM is 10 minutes of pip dependencies followed by 5 minutes of figuring out the right LLM(...) arguments.
The kicker: once you've done llama.cpp setup once, subsequent model swaps are fast. Ollama's lead is mostly about the first 24 hours.
Throughput on identical workloads
Because Ollama is fundamentally a wrapper around llama.cpp's library, the inference math is the same. The throughput numbers below are measured with identical models, quant levels, and context windows on an RTX 3060 12GB with a Ryzen 7 5700X.
| Runtime | Llama 3.1 8B q4_K_M tok/s | Qwen 2.5 14B q4_K_M tok/s | Notes |
|---|---|---|---|
| Ollama | 38-42 | 18-22 | Default flags. |
| llama.cpp | 39-43 | 19-23 | Same flags as Ollama uses internally. |
| llama.cpp + tuned | 42-46 | 22-26 | With --cache-type-k q4_0 --cache-type-v q4_0, --flash-attn. |
| vLLM | 31-35 | 14-18 | AWQ quant, single-user, default config. |
The tuned llama.cpp wins on absolute peak because you can hand-pick KV-cache quantization and turn on flash attention. Ollama gets there partly by default and partly with environment variables (OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q4_0). vLLM consistently trails on single-user single-batch workloads because its scheduler is optimized for concurrent requests it doesn't have.
Memory behavior: where the 12GB ceiling bites
A 12GB GPU has roughly 11GB usable after the desktop session. How each runtime uses that budget matters more than peak tok/s.
Ollama auto-manages model lifecycle. Pull a model, send a request, it loads. Send another request to a different model, it evicts the first and loads the second (LRU, configurable via OLLAMA_KEEP_ALIVE). For a single user juggling 2-3 model sizes, this is exactly right — you don't have to think about it, and the daemon keeps the current model warm.
llama.cpp loads what you tell it to load. llama-server keeps one model resident; if you want to swap, you restart the server. The C++ library used as a daemon (e.g. wired into a custom app) is fully under your control. The honest read: more cognitive overhead for the same outcome Ollama gives you automatically.
vLLM allocates aggressively. By default it reserves 90% of GPU memory at startup for the KV-cache pool. On a 12GB card with another tenant (desktop, browser), that easily pushes the system into OOM. You can tune it down (gpu_memory_utilization=0.7) but you're then giving up the paged-attention benefit that's the whole point of vLLM. It's the wrong abstraction for the card.
Sampler and quant flexibility
This is where llama.cpp pulls ahead of Ollama for power users.
- Custom samplers (Mirostat 2, custom temperature curves, repetition penalty tuning) — llama.cpp exposes them directly via CLI flags. Ollama exposes a subset and forces you to use a Modelfile for the rest.
- Unusual quant levels (q2_K_S, q3_K_S, iq3_xs) — llama.cpp can run any GGUF; Ollama supports the common quants but its model registry doesn't ship the rare ones.
- Custom KV-cache types — both can do q4_0 KV quantization; llama.cpp also supports q8_0 KV for higher fidelity.
- Speculative decoding — llama.cpp supports draft-model speculation; Ollama's support is more limited.
If you're doing anything beyond "answer my questions," llama.cpp's flag surface eventually becomes the reason to switch.
API surface
All three expose an HTTP API; the differences matter for tool integration.
- Ollama ships an OpenAI-compatible
/v1/chat/completionsendpoint plus a native/api/chatand/api/generate. The native API exposes akeep_alivefield for cheap session pinning. - llama.cpp's
llama-serveralso has an OpenAI-compatible endpoint plus a richer/completionendpoint with all the sampler knobs. - vLLM is OpenAI-compatible out of the box and includes batched-token streaming endpoints that are useful for serving workloads.
For an integration target — say, wiring a local model into an IDE, a coding agent, or a chat UI — Ollama's combination of an always-on daemon and an OpenAI-compatible API is the smoothest. llama.cpp can do all of the same things but you'll write more glue.
When vLLM actually wins
vLLM is the right answer in a specific shape of deployment:
- The card is an A100, H100, RTX 6000 Ada, RTX 4090, or RTX 3090.
- You have multiple concurrent users hitting the same model.
- Throughput across the fleet matters more than per-user latency.
- You're serving production traffic, not interactive chat.
In that world, vLLM's continuous batching and paged attention give you 2-5x more aggregate throughput than llama.cpp at the cost of higher per-request latency. None of that applies on a 12GB consumer card with one user typing into a terminal.
A practical recommendation
If you're new to local LLMs on an RTX 3060: install Ollama. Pull llama3.2 or qwen2.5:7b, point an OpenAI-compatible client at http://localhost:11434/v1, and start working. Set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q4_0 once and forget it.
If you outgrow that — you want a quant Ollama doesn't ship, a sampler it doesn't expose, or you're building tooling that needs raw library access — add llama.cpp. Don't replace Ollama; run llama.cpp alongside it for the specific cases where the flag surface matters.
If you're considering vLLM: you're probably solving a different problem than the one a single-user RTX 3060 setup is for. Either you actually need batched serving (then go bigger on the GPU) or you don't (then stay with Ollama).
Common pitfalls
- Running two runtimes at once. Ollama keeps a model warm; if you also start
llama-server, you OOM. Stop one before starting the other, or setOLLAMA_KEEP_ALIVE=0to evict on every request. - Forgetting to enable flash attention. On Ampere this is a free 10-20% throughput win in Ollama; both Ollama and llama.cpp have it off by default.
- Leaving KV cache at fp16. With KV at fp16, an 8K context costs you 8GB+ on a 13B model. Setting
--cache-type-k q4_0halves it with negligible quality loss. - Pulling Ollama's biggest quant variant. Many models default to q4_K_M which is fine, but Ollama also publishes q8_0 and fp16 variants for some models — those overflow a 12GB card before you even start.
- Using vLLM "because Reddit said it's faster." It is faster, in the regime it's built for, which isn't yours.
- Skipping the model registry. Ollama's library has curated quants for most popular models. Going off-registry to HuggingFace GGUFs is fine but you'll occasionally hit chat-template mismatches that show up as garbled output.
Spec-delta: which CPU/GPU combo each runtime targets
| CPU | GPU | Sweet-spot runtime | Why |
|---|---|---|---|
| Ryzen 5 5600G | RTX 3060 12GB | Ollama | Budget single-user, integrated graphics frees the 3060. |
| Ryzen 7 5700X | RTX 3060 12GB | Ollama + llama.cpp | More CPU headroom for tools driving the model. |
| Intel i7-9700K | RTX 3060 12GB | Ollama | Older platform, but plenty for a single-user agent loop. |
| Ryzen 9 7950X | RTX 4090 24GB | llama.cpp tuned | Power user willing to manage flags for max throughput. |
| EPYC / Xeon | A100/H100 | vLLM | Production-grade serving with concurrency. |
The single-user RTX 3060 row covers most of the audience this article is written for. Ollama is the right default unless you specifically want llama.cpp's control surface.
Bottom line
Pick Ollama if you want it to work. Pick llama.cpp if you want it to work your way. vLLM is for a problem you don't have on a 12GB single-user box. The differences between Ollama and llama.cpp on raw throughput round to noise; the differences in setup time, model lifecycle management, and API stability favor Ollama for almost every single-user case. The differences in sampler and quant control favor llama.cpp once you outgrow defaults.
Build a budget Agent PC around any of these and you'll be in good shape. The runtime is a smaller decision than the model and a much smaller decision than the hardware. Don't overthink it.
Real-world benchmark notes (May 2026)
A few specific data points worth keeping in mind when you compare your own numbers to community benchmarks. First, OS overhead matters more than people think — a Windows 11 desktop with hardware acceleration on Chrome and a couple of Electron apps will quietly hold 800-1500 MB of VRAM hostage. Repeat tests with that minimized and you'll see 5-10% higher tok/s on the same model. Second, kernel-launch overhead matters for short generations — for prompts under 100 tokens, the wall-clock fraction spent on kernel scheduling vs actual compute is much higher than for long generations, so "tok/s" measured on tiny prompts undersells the steady-state throughput. Always measure a 200+ token generation when comparing runtimes. Third, the difference between fp16 and q4_0 KV cache quantization is invisible in tok/s but visible in maximum context length — at fp16 KV the effective context for a 7B model on a 3060 is roughly 6K before VRAM exhaustion; at q4_0 KV it climbs to ~24K. That's a more meaningful upgrade than tweaking flag combinations for marginal throughput.
For long-form coding agents specifically, the prefill-to-generation ratio matters. A coding session that submits a 4K-token system prompt + 2K-token file context generates maybe 500 output tokens before the next user turn — prefill is half the wall clock. The 3060 prefills at roughly 400-700 tok/s on 7B models, which is fine, but you'll feel it on 13B+ models where prefill drops to 150-300 tok/s. None of the three runtimes meaningfully changes prefill speed — that's bandwidth-bound the same as generation, just batched differently.
Citations and sources
- llama.cpp GitHub (CLI flags, quant matrix, KV-cache quantization, flash attention)
- Ollama GitHub (model lifecycle, environment variables, API surface)
- vLLM GitHub (continuous batching, paged attention, deployment model)
