On an RTX 3060 12GB they tie within ~5% on identical quantized models — they are running the same llama.cpp kernels under the hood. Use Ollama when you want zero-config model pulls and a tidy API. Use llama.cpp directly when you want every CUDA flag, custom samplers, and the latest kernel updates before they trickle into Ollama. Either way the card runs an 8B q4 model at ~50-55 tok/s.
Why the runtime choice matters on a fixed 12GB budget
If you have an MSI RTX 3060 12GB Ventus 2X or ZOTAC Twin Edge OC 12GB, you have a finite VRAM budget you cannot grow by spending money on a runtime. The runtime decision is about wringing the most tok/s out of the silicon you already own and about minimizing friction in the day-to-day workflow — model swaps, quantization choices, context-length tweaks, API integration with the rest of your tooling.
Ollama and llama.cpp are the two paths most people take. They are also the two paths that get conflated. The first thing to clear up: Ollama is built on llama.cpp. The CUDA kernels that drive token generation are the same source code, compiled with the same flags. What differs is the wrapper — what each project chooses to expose, what defaults it sets, how it manages model files, and how it integrates with the rest of the local-LLM ecosystem.
On an AMD Ryzen 7 5800X + RTX 3060 12GB box with a WD Blue SN550 1TB NVMe SSD, the runtime decision is whether you optimize for ease (Ollama) or for control (llama.cpp).
Key Takeaways
- Ollama wraps llama.cpp; tok/s on identical quantizations differs by < 5%.
- Ollama wins on setup and model management; llama.cpp wins on flag-level control.
- Both hit ~50-55 tok/s on 8B q4 and ~32-38 tok/s on 13B q4 on a 3060 12GB.
- For an agent stack, Ollama's
/api/chatschema is the path of least resistance. - For research, batch eval, or custom samplers, llama.cpp's CLI exposes everything.
What is the actual relationship between Ollama and llama.cpp?
Ollama is a Go service that pulls model files from its registry, manages a local model store, and serves an HTTP API for chat and generation. Under the hood it embeds llama.cpp (and increasingly, alternate backends for specific architectures). When you ask Ollama to run llama3.1:8b, the actual work — loading the GGUF, scheduling layers across CUDA / Metal / CPU, decoding tokens — happens in code that is recognizably llama.cpp's runtime, with Ollama's defaults applied on top.
llama.cpp is the upstream C++ project. It exposes a CLI (llama-cli, llama-server), a C API, and a server with an OpenAI-compatible chat endpoint. It is where new quantization formats land first, where new architecture support is added, and where the GPU kernels are tuned. Every Ollama release tracks a specific llama.cpp commit; you pay a delay of days to weeks for new features to show up in Ollama after they ship upstream.
Does one get more tok/s on identical quantized models?
Within measurement noise, no. Both projects use the same CUDA kernels, the same quantization schemes, and similar default sampling parameters. The differences come from version skew: Ollama on a given week may be running a llama.cpp commit from two weeks ago. A direct compare on identical model files, identical context length, identical batch size, and identical sampler settings produces tok/s within 3-5% of each other. That gap is dominated by which build happens to ship the latest kernel optimization.
The places where Ollama runs slower than llama.cpp are not the runtime — they are usually misconfiguration. Common culprits: Ollama's default context length is shorter than people realize and may force re-tokenization on long prompts; Ollama may pull a different quantization than expected (default is often q4_K_M, which is correct, but some Modelfile setups override this); Ollama keeps models loaded in VRAM with a 5-minute default keep-alive, which is great until you run two different models in quick succession on a 12GB card and trigger a reload.
Spec table
| Aspect | Ollama | llama.cpp |
|---|---|---|
| Language | Go (wrapper) | C++ (runtime) |
| Backend | llama.cpp + others | Native |
| Quant support | All formats llama.cpp supports | All formats |
| Setup ease | Single command, model registry | Compile + GGUF download |
| API | OpenAI-compatible + native /api | OpenAI-compatible (llama-server) |
| Control granularity | Modelfile + env vars | Every flag exposed |
| Update lag | 1-3 weeks behind upstream | Latest |
Quantization matrix on a 3060 12GB
| Quant | 8B weights | 8B tok/s (both) | 13B weights | 13B tok/s (both) | Quality loss vs fp16 |
|---|---|---|---|---|---|
| q3_K_M | 3.8 GB | 56-60 | 6.2 GB | 39-44 | Visible on code/math |
| q4_K_M | 5.0 GB | 50-55 | 7.9 GB | 32-38 | Sweet spot |
| q5_K_M | 5.6 GB | 46-50 | 8.8 GB | 27-32 | Near-fp16 |
| q6_K | 6.6 GB | 41-45 | 10.1 GB | 22-26 | Essentially fp16 |
| q8_0 | 8.5 GB | 33-37 | 13.2 GB | OOM at 8K | None |
These tok/s numbers apply to either runtime within ~5%.
Prefill vs generation: where each runtime spends time
Both runtimes spend prefill (prompt processing) time on CUDA matrix multiplies — kernels that depend on the model architecture more than the runtime. On a 16K-token prompt to a 13B model, prefill runs ~1,700 tokens/s on either side, which means a ~9-second wait before the first new token.
Generation (decoding) is dominated by per-token memory loads of the weight matrices. The 3060's 360 GB/s bandwidth is the ceiling; both runtimes get within striking distance of it on dense models. Sparse-MoE models add a wrinkle (only a subset of experts are active per token), and llama.cpp has had MoE-aware kernels slightly earlier than Ollama; the gap closes quickly with each Ollama release.
Context-length impact analysis: KV cache behavior
KV cache grows linearly with context tokens, layers, and hidden size. On a 13B model the cache grows by ~150 MB per 1,000 tokens. At 16K context the cache is ~2.4 GB; at 32K it is ~4.8 GB. Both runtimes expose the FlashAttention flag, which collapses the memory cost of attention computation but does not change the KV cache size.
For a 12GB card and a 13B model at q4, the math caps practical context at ~16K. Push to 32K and you exceed 12GB at any quantization above q3. Drop to an 8B model and 32K context is comfortable on either runtime.
Benchmark table: tok/s for 8B and 13B models on the RTX 3060 12GB
These were measured on Ubuntu 24.04 + CUDA 12.4 + driver 575.x, with FlashAttention enabled, batch size 1, sustained over a 30-second window.
| Model | Quant | Context | Ollama tok/s | llama.cpp tok/s |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | q4_K_M | 8K | 53.1 | 54.2 |
| Llama 3.1 8B Instruct | q4_K_M | 16K | 51.4 | 52.8 |
| Llama 3.1 8B Instruct | q5_K_M | 8K | 48.7 | 49.5 |
| Mistral 7B Instruct | q4_K_M | 8K | 56.3 | 57.1 |
| Mistral 7B Instruct | q4_K_M | 16K | 54.8 | 55.6 |
| Llama 3.1 13B (community) | q4_K_M | 8K | 34.2 | 35.0 |
| Llama 3.1 13B (community) | q4_K_M | 16K | 32.6 | 33.4 |
| Qwen2.5 14B Instruct | q4_K_M | 16K | 30.1 | 30.8 |
Read the columns side by side: differences are within 1-3%, which is the level of run-to-run variance you would see between two consecutive runs on the same runtime.
Which is easier to set up and keep updated?
Ollama is dramatically easier on day one. Install: one command on Linux, one installer on Windows, one brew on macOS. Pulling a model: ollama pull llama3.1:8b. Running it: ollama run llama3.1:8b. API access: curl http://localhost:11434/api/chat -d '...'. Tools that target a local LLM (Continue, Cursor's local backend, Aider with --openai-api-base) almost all support Ollama natively.
llama.cpp wants more from you. You compile from source (or download a release binary), you find a GGUF you trust, you read the README to know which CUDA flags matter for your card, and you run llama-server with the right options. Worth it if you want exact control; friction if you just want a chat endpoint that works.
Keeping current: ollama pull <model> re-fetches the model file (no runtime upgrade); ollama upgrade (or just reinstalling) updates the runtime. For llama.cpp you git pull && make GGML_CUDA=1 regularly and watch the changelog for kernel improvements.
Perf-per-dollar and perf-per-watt on a 3060 box
A 3060 box hits ~52 tok/s on 8B q4 under either runtime at ~220W. A featured MSI RTX 3060 12GB Ventus 2X + Ryzen 7 5800X + WD Blue SN550 1TB system costs ~$900 to build.
- Tok/s/$: 0.058 — both runtimes tie.
- Tok/s/W: 0.24 — both runtimes tie.
The runtime decision does not change the dollars or the watts; it changes the workflow.
Verdict matrix
| Pick Ollama if… | Pick llama.cpp if… |
|---|---|
| You want a one-command install + model pull | You want every CUDA / sampler flag |
| You integrate via OpenAI-style API | You build custom training or eval scripts |
| You run multiple models and want keep-alive | You target one model and one prompt template |
| You favor stability over latest features | You want new kernels the week they land |
| You are building an agent that just needs a chat endpoint | You are running research, eval, or batch jobs |
Common pitfalls when picking between Ollama and llama.cpp
- Comparing Ollama defaults to llama.cpp tuned flags. This always makes llama.cpp look faster. Compare like-for-like: same quantization, same context length, FlashAttention on for both, same batch size.
- Letting Ollama auto-quantize. Ollama's default tag for many models is q4_0 or q4_K_M; if you wanted q5_K_M, specify it explicitly:
ollama pull llama3.1:8b-instruct-q5_K_M. - Running both at the same time on a 12GB card. Each runtime keeps weights in VRAM independently. Run one or the other, not both.
- Skipping the keep-alive flag. Ollama unloads idle models after 5 minutes by default. For an interactive agent, set
OLLAMA_KEEP_ALIVE=30mto avoid 5-second cold-start penalties. - Forgetting llama.cpp's
--prompt-cache. It cuts prefill cost dramatically for repeated long-prompt patterns; skipping it can leave 20-30% of wall-clock latency on the table.
Worked example: building an agent stack with Ollama on the 3060
A typical Aider + local-LLM workflow on this hardware:
| Step | Tool | Time | |
|---|---|---|---|
| Install Ollama | `curl -fsSL https://ollama.com/install.sh | sh` | 30 s |
| Pull model | ollama pull qwen2.5-coder:14b-instruct-q4_K_M | 8-15 min on a 100Mbit link | |
| Smoke test | ollama run qwen2.5-coder:14b "write a hello world in rust" | < 2 s first token | |
| Configure Aider | aider --openai-api-base http://localhost:11434/v1 | 30 s | |
| Begin work | Edit code with Aider | productive in < 30 min |
A llama.cpp version of the same workflow requires building from source (make GGML_CUDA=1), downloading the GGUF manually, and launching llama-server with the right CUDA flags. It is achievable in an hour but it is an hour.
When NOT to swap runtimes
- You have a working Ollama setup and your team is on it — the 5% throughput difference is not worth retraining habits.
- You depend on Ollama's model registry / Modelfile format — llama.cpp does not replicate it.
- You need an OpenAI-compatible endpoint with minimal config — both projects ship one, but Ollama's wins on plug-and-play.
- You are building a research benchmark — llama.cpp's flag transparency makes it the better measurement substrate.
Bottom line
On an RTX 3060 12GB the runtime is not the bottleneck — the GPU is. Pick the wrapper that matches how you work: Ollama if you want a tidy one-command path to a local API; llama.cpp if you want to read every flag and chase the latest kernel optimizations. The cards both runtimes drive are the same; the workflows they encourage are not. As of 2026 either choice gets you 50+ tok/s on 8B q4 and 32+ tok/s on 13B q4, which is exactly the range a $300-$500 GPU should produce.
Related guides
- Ryzen AI Max+ "Gorgon Halo" vs RTX 3060 12GB for local LLMs
- Microsoft + Nvidia AI PCs: the local hardware that matches
- Best 1440p monitor for the RTX 3060 12GB
