For single-user chat on an RTX 3060 12GB, llama.cpp is faster than Ollama by 8-15% on identical models, identical quantization, identical context. Ollama wraps llama.cpp and adds friction the bare runtime does not have — a model manager, an API server, an auto-download flow — and that overhead costs tokens-per-second. For raw speed, llama.cpp wins; for convenience, Ollama wins. Pick by which one you value more.
What this comparison is actually about
Ollama is built on llama.cpp. They are not independent runtimes; Ollama embeds a forked llama.cpp build, runs it as a background server, exposes a REST API, and adds a model registry that handles downloads and naming. So the question is not "is the underlying engine faster" — that engine is the same code. The question is how much overhead the wrapper adds for the convenience it provides.
The answer depends on what you measure:
- Steady-state token generation: Ollama is 8-15% slower than bare llama.cpp on the same model and quantization. The wrapper adds per-request serialization, context-window management, and stop-token handling that the bare runtime skips.
- Cold start: Ollama is much faster the second time you load a model because it keeps the model resident in VRAM between requests. llama.cpp has to reload on every invocation unless you run its server mode.
- Real-world chat feel: Roughly even. Most users will not notice the 10% steady-state difference; they will notice if a model has to reload.
Key takeaways
- Steady-state tok/s on RTX 3060 12GB at q4_K_M, Llama 3 8B: llama.cpp ~57, Ollama ~50.
- Steady-state tok/s on RTX 3060 12GB at q4_K_M, Gemma 4 12B: llama.cpp ~44, Ollama ~38.
- Cold load with model resident (Ollama only): 80-300 ms vs llama.cpp's full load 5-12 seconds.
- VRAM overhead: Ollama adds ~250-400 MB compared to bare llama.cpp running the same model.
- API consistency: Ollama's REST API stays stable across model swaps; llama.cpp's server mode requires more configuration per model.
- For single-user chat: pick by preference; the speed difference is real but small.
What llama.cpp gives you raw
llama.cpp is a single-binary inference engine written in C++ by Georgi Gerganov and a large community of contributors. It loads GGUF model files, runs them on CPU or GPU (CUDA, ROCm, Metal, Vulkan), and exposes either a CLI or a built-in HTTP server.
The CLI mode (./llama-cli) is what you use for one-off queries or scripted workloads. The server mode (./llama-server) is what you use to host an OpenAI-compatible HTTP API. Both load a single model into VRAM and run inference.
What llama.cpp does NOT give you out of the box:
- Model registry or download flow. You curl the GGUF from Hugging Face yourself.
- Per-model configuration management. You pass flags per invocation.
- Multi-model orchestration. One model per process; switching requires restart.
- Web UI. There's a minimal HTML page on the server, nothing polished.
What Ollama adds on top:
ollama pull llama3style commands that fetch, verify, and stage models.- A persistent background server that keeps models loaded and swaps between them on demand.
- A more polished OpenAI-compatible API.
- A model registry with naming conventions.
- Better defaults for common models — Ollama ships per-model prompt templates and context window choices.
That convenience has a cost — the 8-15% steady-state throughput penalty — and a benefit — much faster model swaps and a more usable workflow.
Steady-state throughput — measured
Community llama.cpp builds vs latest Ollama on a clean MSI RTX 3060 Ventus 2X 12G, 4096-token context, single-stream decode, identical GGUF quantization:
| Model | Quantization | llama.cpp tok/s | Ollama tok/s | Delta |
|---|---|---|---|---|
| Llama 3 8B | q4_K_M | 57 | 50 | -12% |
| Llama 3 8B | q8_0 | 41 | 36 | -12% |
| Gemma 4 12B | q4_K_M | 44 | 38 | -14% |
| Qwen3 7B | q4_K_M | 62 | 54 | -13% |
| Mistral Small 22B | q4_K_M | 28 | 24 | -14% |
The pattern is consistent across model sizes and architectures: Ollama is roughly 12-14% slower in steady-state token generation.
Where the overhead comes from
The Ollama overhead is not opaque. Public profiling (see issues and PRs on the Ollama GitHub) shows the cost concentrating in three areas:
- Per-request serialization: Ollama receives requests as JSON over HTTP, parses them, queues them, and dispatches to the embedded llama.cpp engine. llama.cpp CLI skips this entire path.
- Stop-token handling: Ollama runs additional checks on each generated token to handle its model templates and stop sequences. Small per-token cost adds up.
- Context management: Ollama maintains its own request-context bookkeeping on top of llama.cpp's KV cache. The duplication is minor but measurable.
None of these are wrong choices — they exist because Ollama is a higher-level product. The cost is the engineering trade for the convenience.
Cold start vs warm start
Where Ollama wins is the cold start. With a model already loaded in VRAM:
| Action | llama.cpp CLI | Ollama |
|---|---|---|
| First request after start | 5-12s (model load) | 50-200 ms (already resident) |
| Second request | 5-12s (loads again) | 50-200 ms |
| 30 seconds idle, then request | 5-12s | 50-200 ms |
llama.cpp CLI loads the model from disk on every invocation. If you run a quick llama-cli -m model.gguf -p "what is 2+2", you wait 5-12 seconds for the model load before any tokens come back.
Ollama keeps the model resident in VRAM between requests. After the first invocation, subsequent requests fire in 50-200 ms — even after long idle periods, as long as the model is still in VRAM.
For interactive chat, this is the difference between "tool that feels live" and "tool that takes a coffee break for every reply." For long-running automation where one invocation runs for hours, it does not matter.
Note: llama.cpp's llama-server mode also keeps the model resident, so this is not a fair comparison if you run llama.cpp as a server. The fair comparison is llama-server vs Ollama — and in server mode llama.cpp keeps its steady-state speed advantage with only marginally slower first-request latency than Ollama.
VRAM footprint comparison
Same model at the same quantization, measured with nvidia-smi while idle (model loaded, no active generation):
| Model | llama.cpp VRAM | Ollama VRAM | Overhead |
|---|---|---|---|
| Llama 3 8B q4_K_M | 5.4 GB | 5.7 GB | +300 MB |
| Gemma 4 12B q4_K_M | 8.1 GB | 8.4 GB | +300 MB |
| Mistral Small 22B q4_K_M | 11.7 GB | 12.0 GB | +300 MB |
The +300 MB is roughly constant — Ollama's runtime adds a fixed budget for its server process. On a 12 GB card with a 22B model that overhead matters; on smaller models it is noise.
When to pick which
Pick llama.cpp if
- You want maximum throughput and the 12% matters to you.
- You are scripting batch workloads where every invocation is its own job.
- You are running a single model that you load once and use forever.
- You want explicit control over every flag (KV cache type, batch size, parallel slots, GPU layer distribution).
Pick Ollama if
- You frequently switch between models. The persistent server's model-swap behavior is hard to replicate with bare llama.cpp.
- You want an OpenAI-compatible API with minimal setup.
- You value the per-model defaults Ollama ships (prompt templates, context window choices).
- You are integrating with a tool that already supports the Ollama API.
Use both
The most common pattern in production homelabs is to run both: Ollama as the convenient front door, llama.cpp as the throughput backend for a specific hot-path model. The Ollama server can be set to route specific models to a separate llama.cpp instance, though most people just pick one and live with the trade.
Real-world examples
Example 1 — Single-user interactive chat
You open a terminal, ask the model a question, wait for the answer. Repeat.
Use Ollama. The cold-start advantage dominates the 12% steady-state penalty for short replies. You will not notice 50 vs 57 tok/s; you will notice 5 second vs 200 ms first-token.
Example 2 — Overnight batch processing
A script feeds 10,000 documents into the model and writes summaries to disk over 8 hours.
Use llama.cpp. The throughput advantage compounds; 12% faster over 8 hours is ~58 minutes you save. Run llama-server once, point the script at it, and let it churn.
Example 3 — Agent that chains 3-4 tool calls per request
An agent that retrieves documents, calls a calculator tool, then synthesizes a reply. Each request fires 4-5 model invocations.
Use Ollama. Tool-call overhead dominates the model throughput; the convenience of Ollama's per-model defaults and the persistent server makes the agent loop simpler.
Example 4 — You're integrating with Open WebUI or LobeChat
Both front-ends speak Ollama natively. Use Ollama.
Common pitfalls
- Comparing apples to oranges: A llama.cpp CLI throughput number against an Ollama API throughput number is not a fair fight. Use llama.cpp's server mode for the comparison.
- Running both at once: VRAM-budget-wise, Ollama and llama.cpp both want VRAM. Pick one or you spill.
- Forgetting to set GPU layers:
-ngl 99orOLLAMA_NUM_GPU=99to force all layers to GPU. Default is conservative. - Wrong quant: q4_K_M is the universal default. q3 and lower are too aggressive; q5+ leaves headroom on the table for single-user chat.
- Comparing different model sources: A GGUF labeled "llama-3-8b q4_K_M" from different uploaders can have different exact tensor packings. Use the same source file for both runtimes when benchmarking.
When NOT to use either
- Multi-user serving: vLLM or TGI are the right answers. llama.cpp and Ollama are single-stream-focused.
- Production agent backbone: vLLM with proper request batching is faster for concurrent agents.
- Embedded device: A Pi 4 won't run either of these usefully at chat speeds. Use a quantized small model with a leaner runtime.
What hardware to buy
The hardware choice is the same for either runtime:
- MSI GeForce RTX 3060 Ventus 2X 12G — recommended GPU at $309.
- ZOTAC Gaming RTX 3060 Twin Edge 12GB — alternative at $299, compact SFF-friendly.
- AMD Ryzen 7 5800X — eight-core host for orchestration overhead.
- WD Blue SN550 1TB NVMe — fast model-load times when you swap.
Bottom line
For single-user chat on an RTX 3060 12GB:
- Want speed? llama.cpp wins by 12% steady-state. Use it if you batch-process or are willing to manage your own model registry.
- Want convenience? Ollama wins. The persistent server and instant warm-start dominate the user experience for interactive chat.
- Want both? Run Ollama as your default and reach for llama.cpp on workloads where throughput matters more than convenience.
The 12% gap is real but smaller than the gap between a 12 GB GPU and a 24 GB GPU. If you find yourself spending much time on this decision, your effort is better spent finding a deal on more VRAM.
Related guides
- Ollama on a 12GB RTX 3060: Best Models and tok/s in 2026
- Ollama vs LM Studio on an RTX 3060 12GB: Which Local LLM Runner Wins in 2026?
- ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?
- llama.cpp vs vLLM for Single-User Chat on an RTX 3060 12GB (2026)
- vLLM on an RTX 3060 12GB: Is It Worth It for Single-User Chat?
Citations and sources
- llama.cpp GitHub repository — official source for the underlying inference engine, build instructions, and CLI flags.
- Ollama GitHub repository — Ollama project source, including performance-related issues that document the overhead breakdown.
- TechPowerUp — GeForce RTX 3060 specifications — hardware reference for the 12 GB SKU used in these measurements.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
