For single-user chat on an RTX 3060 12GB in 2026, llama.cpp is the right default — it's faster on cold starts, has the broadest GGUF model selection, and runs on smaller VRAM budgets at every quantization tier. vLLM wins if you're serving more than one concurrent request or measuring p50/p99 under sustained load, where continuous batching and PagedAttention pull ahead by a wide margin.
Why this matters in 2026
The RTX 3060 12GB is still the entry tier for serious local-LLM work, and the two dominant inference backends — llama.cpp and vLLM — have very different design centers. llama.cpp was built for laptops and single-user chat; vLLM was built for production serving with batched concurrent requests. The 12GB single-user case sits awkwardly between the two, so a head-to-head scoped to that exact workload is what most readers actually need.
This piece is editorial synthesis of public benchmarks, project documentation, and community measurements. We don't run a private testbench — what follows comes from cited sources organized for the 12GB single-user case.
Key takeaways
- For one user typing into a chat window,
llama.cppis faster off the line on a 12GB card. - For multi-request serving (RAG, async agents, multi-tab), vLLM's continuous batching wins.
llama.cpphandles smaller quantizations (q3, q2) that vLLM doesn't natively support.- vLLM needs more VRAM headroom for its KV cache pool; on 12GB that's a real constraint.
- Hardware-side, pair either backend with a MSI RTX 3060 Ventus 2X 12G and a WD Blue SN550 NVMe for fast model loads.
What llama.cpp actually is
<code>llama.cpp</code> is a C++ inference engine that runs quantized GGUF models on CPU, GPU (CUDA, ROCm, Metal, Vulkan), or both. Its design center is "run anywhere, including on phones and laptops." For our purposes that means:
- Tiny memory footprint when idle.
- Fast cold start (model load + first token in seconds).
- Per-token streaming response, optimized for one user.
- Broad quantization support including q2_K, q3_K, q4_K_M, q5_K_M, q6_K, q8_0, fp16, and the newer q4_K_S/q5_K_S variants.
- Batch size of 1 is the optimized path.
The trade-off is that batched throughput is not its strong suit. If you fire two simultaneous prompts at the same llama.cpp instance, the second waits.
What vLLM actually is
vLLM is a Python inference engine designed for high-throughput serving with continuous batching and PagedAttention. It's the default serving backend for most cloud LLM providers. For our 12GB-card use case:
- Continuous batching: multiple in-flight requests share the same forward passes.
- PagedAttention: KV cache is allocated in pages, dramatically increasing concurrent request capacity.
- Higher steady-state throughput than
llama.cppunder any non-trivial concurrent load. - AWQ and GPTQ quantization support (no native q3 or smaller).
- Larger upfront VRAM commit for the KV cache pool.
The trade is that vLLM is a production server, not a chat-window companion. Cold-start is slower; idle memory is higher; single-user p50 latency is usually within 10–20% of llama.cpp but not always faster.
Spec-delta table
| Dimension | llama.cpp | vLLM |
|---|---|---|
| Language | C/C++ with Python bindings | Python with CUDA kernels |
| Default batch size | 1 (optimized) | N (continuous) |
| Quantization | GGUF (q2–fp16, all variants) | AWQ, GPTQ, fp16 |
| KV cache strategy | flat contiguous | paged |
| Cold start | seconds | tens of seconds |
| Single-user p50 latency | lower or equal | within 10–20% |
| Multi-request throughput | poor | excellent |
| VRAM overhead | ~0.5–1 GB | 2–4 GB (KV pool) |
| Streaming | native | native (with config) |
| Vision/multimodal | yes (recent) | yes (vLLM 0.6+) |
Benchmark table on a 12GB RTX 3060
Per public LocalLLaMA threads and the <code>llama.cpp</code> discussion forums, with a 12B-class model at q4_K_M and 4K context, single-user single-shot tokens per second:
| Model + quant | llama.cpp tok/s | vLLM (AWQ-int4) tok/s |
|---|---|---|
| Qwen 3.5 12B | 34–40 | 30–38 |
| Gemma 4 12B | 32–38 | 28–36 |
| Llama 3.5 8B | 50–60 | 45–55 |
| Mistral Small 3 12B | 33–39 | 30–38 |
| Step 3.7 Flash 12B | 30–36 | 35–42 |
Two readers of this table:
- For most models,
llama.cppis 5–15% faster on a single-user single-shot. - vLLM pulls ahead on Step-family models because of the architecture fit with PagedAttention.
- Under concurrent load (4 simultaneous chats), vLLM's effective throughput is roughly 2.5–3×
llama.cppfor the same hardware.
Where vLLM wins on a 12GB card
Once you cross from "one human typing into a chat box" to any of these patterns, vLLM is the right backend:
- A small office of 3–5 people sharing one local LLM.
- An async agent that fires multiple parallel tool-result inferences.
- A RAG pipeline that runs many short prompts back-to-back.
- A coding tool with multi-buffer streaming completions.
- Any serving scenario where p99 latency under load matters.
The reason is continuous batching: in vLLM, two in-flight requests share each forward pass. In llama.cpp, they don't.
Where llama.cpp wins on a 12GB card
For everyday single-user use:
- Faster cold-start when you switch models mid-day.
- Lower idle VRAM (you can run a game on the same card without unloading the model).
- Smaller quantization options if you need to squeeze a 27B+ model onto 12GB.
- Better support for CPU+GPU split inference if you spill past VRAM.
- Simpler debugging — one process, one log file, no Python event loop.
For the is-12GB-VRAM-enough-for-local-LLMs reader on an MSI RTX 3060 12GB running Ollama — which is llama.cpp under the hood — this is the practical sweet spot.
VRAM math: KV cache headroom
The single biggest 12GB constraint when running vLLM is the KV cache pool. vLLM by default reserves a large pool to maximize concurrent request capacity. On a 12GB card with a 7B model in fp16, this leaves you with surprisingly little headroom for context.
| Setup | Model VRAM | KV pool VRAM | Free for context |
|---|---|---|---|
| llama.cpp + Llama 3.5 8B q4_K_M | 4.5 GB | dynamic | ~7 GB → 24K+ context |
| vLLM + Llama 3.5 8B AWQ-int4 | 5 GB | 4 GB pool | ~3 GB → 4 concurrent users at 4K each |
| llama.cpp + Qwen 3.5 12B q4_K_M | 7.5 GB | dynamic | ~4 GB → 8–12K context |
| vLLM + Qwen 3.5 12B AWQ-int4 | 8 GB | 3 GB pool | ~1 GB → tight |
For long-context single-user work, llama.cpp clearly wins on a 12GB card. For multi-user shared deployments at 4K each, vLLM's pool is the point.
Perf-per-dollar + perf-per-watt
Both backends saturate the same RTX 3060 12GB at the same ~170 W full load, so per-token energy is essentially identical. The cost difference is operational complexity:
llama.cppiscmake --build . && ./llama-serverand you're serving.- vLLM is a Python environment, model conversion to AWQ/GPTQ, a config file, and a daemon.
For one developer on one card, the dollar value of operational simplicity is real.
A budget single-user build pairs the MSI Ventus 2X RTX 3060 or ZOTAC Twin Edge OC with a WD Blue SN550 1TB NVMe for fast model loading and a Crucial BX500 1TB SATA SSD for archive.
Common pitfalls
- Picking vLLM for a single chat-window workload. Cold start is slower, idle VRAM is higher, and the throughput edge doesn't show up unless you're batching.
- Picking
llama.cppfor a multi-user RAG pipeline. You'll serialize requests and hit p99 latency cliffs. - Running fp16 on 12GB. Either backend will OOM on a 12B+ model at fp16. Stick to q4 or q5 (
llama.cpp) or AWQ-int4 (vLLM). - Forgetting context KV cost. A long 16K context can eat 2–3 GB of VRAM by itself on a 12B model.
- Not measuring under your real workload. Public benchmarks are useful for orientation but your prompt shape determines which backend wins on your card.
Verdict matrix
Pick llama.cpp if:
- You're one user, one chat window, one or two LLM-using tools.
- You swap models several times a day.
- You need to squeeze a 14B+ model onto a 12GB card via q3/q4 quantization.
- You're running Ollama (which is
llama.cppunder the hood). - You want the simplest possible operational profile.
Pick vLLM if:
- You serve more than one concurrent user or agent.
- You measure p50/p99 latency at concurrency > 1.
- You're building a small office LLM box or a RAG service.
- Your model is well-supported in AWQ-int4 or GPTQ.
Bottom line
For a 12GB RTX 3060 running single-user chat, llama.cpp (via Ollama or direct) is the right default — it's faster off the line, leaner on memory, and dramatically simpler to operate. vLLM is the right answer the moment you cross into multi-request serving. Don't pick vLLM for a chat companion; don't pick llama.cpp for a shared inference service.
Either way, the right card is a 12GB RTX 3060 — see the MSI Ventus 2X for the cheapest path in.
Related guides
- Ollama vs llama.cpp vs vLLM on the RTX 3060 12GB
- Is 12GB VRAM Still Enough for Local LLMs in 2026?
- vLLM vs Ollama on an RTX 3060 12GB: Which Server Wins?
- Q4_K_M Is Fine for Chat, a Trap for Agents
- Best SSD for a Local AI / LLM Workstation in 2026
Citations and sources
- <code>llama.cpp</code> on GitHub — engine source, discussion threads, GGUF spec
- vLLM project on GitHub — engine source, PagedAttention paper, benchmarks
- Artificial Analysis — model benchmarks referenced throughout
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
