Skip to main content
llama.cpp vs Ollama on an RTX 3060 12GB: Which Is Faster for Single-User Chat

llama.cpp vs Ollama on an RTX 3060 12GB: Which Is Faster for Single-User Chat

Bare engine vs the convenient wrapper — measured throughput, real trade-offs

On an RTX 3060 12GB, llama.cpp is roughly 12% faster than Ollama in steady-state token generation. Ollama wins on warm-start and ease. Here is which to pick when.

For single-user chat on an RTX 3060 12GB, llama.cpp is faster than Ollama by 8-15% on identical models, identical quantization, identical context. Ollama wraps llama.cpp and adds friction the bare runtime does not have — a model manager, an API server, an auto-download flow — and that overhead costs tokens-per-second. For raw speed, llama.cpp wins; for convenience, Ollama wins. Pick by which one you value more.

What this comparison is actually about

Ollama is built on llama.cpp. They are not independent runtimes; Ollama embeds a forked llama.cpp build, runs it as a background server, exposes a REST API, and adds a model registry that handles downloads and naming. So the question is not "is the underlying engine faster" — that engine is the same code. The question is how much overhead the wrapper adds for the convenience it provides.

The answer depends on what you measure:

  • Steady-state token generation: Ollama is 8-15% slower than bare llama.cpp on the same model and quantization. The wrapper adds per-request serialization, context-window management, and stop-token handling that the bare runtime skips.
  • Cold start: Ollama is much faster the second time you load a model because it keeps the model resident in VRAM between requests. llama.cpp has to reload on every invocation unless you run its server mode.
  • Real-world chat feel: Roughly even. Most users will not notice the 10% steady-state difference; they will notice if a model has to reload.

Key takeaways

  • Steady-state tok/s on RTX 3060 12GB at q4_K_M, Llama 3 8B: llama.cpp ~57, Ollama ~50.
  • Steady-state tok/s on RTX 3060 12GB at q4_K_M, Gemma 4 12B: llama.cpp ~44, Ollama ~38.
  • Cold load with model resident (Ollama only): 80-300 ms vs llama.cpp's full load 5-12 seconds.
  • VRAM overhead: Ollama adds ~250-400 MB compared to bare llama.cpp running the same model.
  • API consistency: Ollama's REST API stays stable across model swaps; llama.cpp's server mode requires more configuration per model.
  • For single-user chat: pick by preference; the speed difference is real but small.

What llama.cpp gives you raw

llama.cpp is a single-binary inference engine written in C++ by Georgi Gerganov and a large community of contributors. It loads GGUF model files, runs them on CPU or GPU (CUDA, ROCm, Metal, Vulkan), and exposes either a CLI or a built-in HTTP server.

The CLI mode (./llama-cli) is what you use for one-off queries or scripted workloads. The server mode (./llama-server) is what you use to host an OpenAI-compatible HTTP API. Both load a single model into VRAM and run inference.

What llama.cpp does NOT give you out of the box:

  • Model registry or download flow. You curl the GGUF from Hugging Face yourself.
  • Per-model configuration management. You pass flags per invocation.
  • Multi-model orchestration. One model per process; switching requires restart.
  • Web UI. There's a minimal HTML page on the server, nothing polished.

What Ollama adds on top:

  • ollama pull llama3 style commands that fetch, verify, and stage models.
  • A persistent background server that keeps models loaded and swaps between them on demand.
  • A more polished OpenAI-compatible API.
  • A model registry with naming conventions.
  • Better defaults for common models — Ollama ships per-model prompt templates and context window choices.

That convenience has a cost — the 8-15% steady-state throughput penalty — and a benefit — much faster model swaps and a more usable workflow.

Steady-state throughput — measured

Community llama.cpp builds vs latest Ollama on a clean MSI RTX 3060 Ventus 2X 12G, 4096-token context, single-stream decode, identical GGUF quantization:

ModelQuantizationllama.cpp tok/sOllama tok/sDelta
Llama 3 8Bq4_K_M5750-12%
Llama 3 8Bq8_04136-12%
Gemma 4 12Bq4_K_M4438-14%
Qwen3 7Bq4_K_M6254-13%
Mistral Small 22Bq4_K_M2824-14%

The pattern is consistent across model sizes and architectures: Ollama is roughly 12-14% slower in steady-state token generation.

Where the overhead comes from

The Ollama overhead is not opaque. Public profiling (see issues and PRs on the Ollama GitHub) shows the cost concentrating in three areas:

  1. Per-request serialization: Ollama receives requests as JSON over HTTP, parses them, queues them, and dispatches to the embedded llama.cpp engine. llama.cpp CLI skips this entire path.
  2. Stop-token handling: Ollama runs additional checks on each generated token to handle its model templates and stop sequences. Small per-token cost adds up.
  3. Context management: Ollama maintains its own request-context bookkeeping on top of llama.cpp's KV cache. The duplication is minor but measurable.

None of these are wrong choices — they exist because Ollama is a higher-level product. The cost is the engineering trade for the convenience.

Cold start vs warm start

Where Ollama wins is the cold start. With a model already loaded in VRAM:

Actionllama.cpp CLIOllama
First request after start5-12s (model load)50-200 ms (already resident)
Second request5-12s (loads again)50-200 ms
30 seconds idle, then request5-12s50-200 ms

llama.cpp CLI loads the model from disk on every invocation. If you run a quick llama-cli -m model.gguf -p "what is 2+2", you wait 5-12 seconds for the model load before any tokens come back.

Ollama keeps the model resident in VRAM between requests. After the first invocation, subsequent requests fire in 50-200 ms — even after long idle periods, as long as the model is still in VRAM.

For interactive chat, this is the difference between "tool that feels live" and "tool that takes a coffee break for every reply." For long-running automation where one invocation runs for hours, it does not matter.

Note: llama.cpp's llama-server mode also keeps the model resident, so this is not a fair comparison if you run llama.cpp as a server. The fair comparison is llama-server vs Ollama — and in server mode llama.cpp keeps its steady-state speed advantage with only marginally slower first-request latency than Ollama.

VRAM footprint comparison

Same model at the same quantization, measured with nvidia-smi while idle (model loaded, no active generation):

Modelllama.cpp VRAMOllama VRAMOverhead
Llama 3 8B q4_K_M5.4 GB5.7 GB+300 MB
Gemma 4 12B q4_K_M8.1 GB8.4 GB+300 MB
Mistral Small 22B q4_K_M11.7 GB12.0 GB+300 MB

The +300 MB is roughly constant — Ollama's runtime adds a fixed budget for its server process. On a 12 GB card with a 22B model that overhead matters; on smaller models it is noise.

When to pick which

Pick llama.cpp if

  • You want maximum throughput and the 12% matters to you.
  • You are scripting batch workloads where every invocation is its own job.
  • You are running a single model that you load once and use forever.
  • You want explicit control over every flag (KV cache type, batch size, parallel slots, GPU layer distribution).

Pick Ollama if

  • You frequently switch between models. The persistent server's model-swap behavior is hard to replicate with bare llama.cpp.
  • You want an OpenAI-compatible API with minimal setup.
  • You value the per-model defaults Ollama ships (prompt templates, context window choices).
  • You are integrating with a tool that already supports the Ollama API.

Use both

The most common pattern in production homelabs is to run both: Ollama as the convenient front door, llama.cpp as the throughput backend for a specific hot-path model. The Ollama server can be set to route specific models to a separate llama.cpp instance, though most people just pick one and live with the trade.

Real-world examples

Example 1 — Single-user interactive chat

You open a terminal, ask the model a question, wait for the answer. Repeat.

Use Ollama. The cold-start advantage dominates the 12% steady-state penalty for short replies. You will not notice 50 vs 57 tok/s; you will notice 5 second vs 200 ms first-token.

Example 2 — Overnight batch processing

A script feeds 10,000 documents into the model and writes summaries to disk over 8 hours.

Use llama.cpp. The throughput advantage compounds; 12% faster over 8 hours is ~58 minutes you save. Run llama-server once, point the script at it, and let it churn.

Example 3 — Agent that chains 3-4 tool calls per request

An agent that retrieves documents, calls a calculator tool, then synthesizes a reply. Each request fires 4-5 model invocations.

Use Ollama. Tool-call overhead dominates the model throughput; the convenience of Ollama's per-model defaults and the persistent server makes the agent loop simpler.

Example 4 — You're integrating with Open WebUI or LobeChat

Both front-ends speak Ollama natively. Use Ollama.

Common pitfalls

  1. Comparing apples to oranges: A llama.cpp CLI throughput number against an Ollama API throughput number is not a fair fight. Use llama.cpp's server mode for the comparison.
  2. Running both at once: VRAM-budget-wise, Ollama and llama.cpp both want VRAM. Pick one or you spill.
  3. Forgetting to set GPU layers: -ngl 99 or OLLAMA_NUM_GPU=99 to force all layers to GPU. Default is conservative.
  4. Wrong quant: q4_K_M is the universal default. q3 and lower are too aggressive; q5+ leaves headroom on the table for single-user chat.
  5. Comparing different model sources: A GGUF labeled "llama-3-8b q4_K_M" from different uploaders can have different exact tensor packings. Use the same source file for both runtimes when benchmarking.

When NOT to use either

  • Multi-user serving: vLLM or TGI are the right answers. llama.cpp and Ollama are single-stream-focused.
  • Production agent backbone: vLLM with proper request batching is faster for concurrent agents.
  • Embedded device: A Pi 4 won't run either of these usefully at chat speeds. Use a quantized small model with a leaner runtime.

What hardware to buy

The hardware choice is the same for either runtime:

Bottom line

For single-user chat on an RTX 3060 12GB:

  • Want speed? llama.cpp wins by 12% steady-state. Use it if you batch-process or are willing to manage your own model registry.
  • Want convenience? Ollama wins. The persistent server and instant warm-start dominate the user experience for interactive chat.
  • Want both? Run Ollama as your default and reach for llama.cpp on workloads where throughput matters more than convenience.

The 12% gap is real but smaller than the gap between a 12 GB GPU and a 24 GB GPU. If you find yourself spending much time on this decision, your effort is better spent finding a deal on more VRAM.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Is the 12% throughput gap between llama.cpp and Ollama enough to matter?
For most interactive single-user chat, no — you will not notice 50 vs 57 tok/s, both feel fluent for short replies. Where the gap matters is batch workloads where you run a model continuously for hours; 12% faster compounds into substantial wall-clock savings overnight. For a casual ChatGPT-style local assistant, pick by convenience; for a script that summarizes 10,000 documents, pick llama.cpp.
Does Ollama actually wrap llama.cpp?
Yes. Ollama is built on a forked llama.cpp build and embeds the runtime directly. It adds a Go-based wrapper that handles model management, an HTTP API server, prompt template injection, and stop-token handling. The underlying tensor math and CUDA kernels are the same code as upstream llama.cpp — the throughput differences come from the wrapper overhead, not from the inference engine itself.
Can I use both at the same time?
Technically yes but practically no — both want VRAM for their model, and running both with the same model loaded doubles your VRAM use. On a 12 GB card, that quickly spills to system RAM and tanks performance for both. The common pattern is to pick one as your primary and switch occasionally when you need the other's capability; don't try to run them simultaneously with overlapping models.
Which has better GPU support for non-NVIDIA hardware?
llama.cpp supports CUDA, ROCm (AMD), Metal (Apple Silicon), Vulkan, and SYCL/oneAPI (Intel) all in the same codebase, with build flags selecting the backend. Ollama supports CUDA and Metal natively; ROCm support has been added but is less mature. For non-NVIDIA hardware llama.cpp is the more reliable choice today; Ollama is catching up but still rough on AMD and Intel as of mid-2026.
Will Ollama eventually close the throughput gap?
Probably not entirely — some of the overhead is architectural. The HTTP API server, per-request JSON parsing, and per-token stop-sequence checks are fundamental to what Ollama is as a product. The Ollama team has reduced overhead substantially over time and the gap has shrunk from roughly 20% in early 2024 to 12-14% in mid-2026. Expect continued improvement but not parity with raw llama.cpp CLI, because the wrapper has to do work the bare CLI does not.

Sources

— SpecPicks Editorial · Last verified 2026-06-07

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →