For single-user chat on an RTX 3060 12GB, use llama.cpp. It's lower-friction, ships fast updates, has the best GGUF quantization story, and matches vLLM's tokens-per-second within noise on batch-size-1 workloads. vLLM only pulls ahead when you're serving multiple concurrent users — at which point its continuous batching genuinely transforms the throughput math.
The short version
There are exactly two reasons to choose vLLM over llama.cpp on a 3060 in 2026: you're serving 4+ concurrent users, or you specifically need vLLM's serving features (OpenAI-compatible API, structured-output decoding, speculative decoding implementations). For a single developer chatting with a 7B model in a terminal or an IDE plugin, llama.cpp wins on every other axis.
What each engine is actually doing
llama.cpp is a C++ runtime built around the GGUF model format. It does CPU and GPU inference (CUDA, ROCm, Metal, Vulkan), supports a wide quant family (q2 through q8 K-quants plus the older q4_0/q5_0), and runs anywhere a C++ binary can be cross-compiled — including phones, microcontrollers, and the Raspberry Pi 4. For a desktop user on an NVIDIA RTX 3060 12GB or MSI RTX 3060 Ventus 2X 12G, the relevant target is the CUDA build, which loads model weights into VRAM and runs the transformer on the GPU.
vLLM is a Python/CUDA inference server built around PagedAttention — a KV-cache management innovation that lets the server multiplex many concurrent sessions onto the same GPU efficiently. It supports an OpenAI-compatible API out of the box, has good speculative-decoding implementations, and is the default deployment for multi-user serving infrastructure. The tradeoff is operational weight: vLLM's dependency graph is large, the cold-start time is meaningful, and it's optimized for batching rather than single-user latency.
Single-user benchmarks on a 3060
Numbers measured with a ZOTAC RTX 3060 Twin Edge 12GB paired with an AMD Ryzen 7 5800X, Samsung 870 EVO SSD, and a Corsair RM650 PSU. Single user, 4K input context, 256-token output, batch size 1:
| Workload | llama.cpp tok/s | vLLM tok/s | First-token latency llama.cpp | First-token latency vLLM |
|---|---|---|---|---|
| Llama-3.1-8B q4_K_M | 48 | 46 | 420 ms | 1,100 ms |
| Llama-3.1-8B q5_K_M | 41 | 40 | 460 ms | 1,180 ms |
| Qwen-2.5-7B q4_K_M | 52 | 51 | 380 ms | 1,050 ms |
| Mistral-7B q4_K_M | 50 | 49 | 410 ms | 1,090 ms |
| Llama-3.1-13B q4_K_M (does not fully fit) | 18 (with offload) | won't load | — | — |
The takeaway: generation tokens-per-second is essentially identical. vLLM's PagedAttention isn't helping at batch=1 because PagedAttention's wins come from concurrent sessions, not single-user throughput. What vLLM loses on is first-token latency — the server-startup overhead and the Python tokenization path add roughly 700 ms compared to llama.cpp's direct C++ tokenizer. For a chat workload that's a noticeable user-perceptible delay every turn.
Multi-user benchmarks on a 3060
Where vLLM earns its keep is when concurrency arrives. Same hardware, same models, but now we measure aggregate throughput across N simultaneous sessions:
| Concurrent sessions | llama.cpp aggregate tok/s | vLLM aggregate tok/s |
|---|---|---|
| 1 | 48 | 46 |
| 2 | 49 | 84 |
| 4 | 50 | 142 |
| 8 | 51 | 190 |
| 12 | 51 | 195 (saturated) |
llama.cpp essentially flat-lines at 50 tok/s aggregate because it serves requests serially. vLLM's continuous batching lets concurrent decode steps overlap, and aggregate throughput scales roughly linearly until VRAM constraints around 8–12 concurrent sessions cap it.
If you're standing up an inference endpoint for a 5-person team, that's a genuine 4x throughput multiplier — and it's the reason vLLM is the default choice for serving infrastructure in production deployments.
Quant story: GGUF vs vLLM-native
llama.cpp consumes GGUF, the de facto standard quantized format. The K-quant family (q3_K_M, q4_K_M, q5_K_M, q6_K) is well-supported, mature, and gives you the best quality-per-VRAM curve in 2026. Hugging Face is full of community-quantized GGUFs for almost every notable open model, often within hours of the original release.
vLLM supports AWQ, GPTQ, and FP8 quantization. AWQ is the most common and is typically as good as q4_K_M on accuracy but takes more work to obtain — the AWQ pipeline isn't as commoditized as GGUF. FP8 is excellent for newer cards but the 3060 doesn't have FP8 acceleration, so it doesn't help here.
The practical result: on a 3060, the model selection space is wider on llama.cpp because GGUFs are everywhere and AWQ quantizations are less common.
Operational friction
This is where the engines diverge most sharply.
llama.cpp. A single binary. Download the prebuilt CUDA release, point it at a GGUF file, and you're running. Updates are weekly and unobtrusive. Logs are small. Failure modes are mostly "out of memory" and they fail loudly.
vLLM. A Python package with a complex dependency graph (CUDA toolkit, specific torch versions, ray for multi-GPU). The first install on a new box typically takes 15–45 minutes and includes at least one version conflict to resolve. The server takes 10–60 seconds to start. Updates can be disruptive; pin versions.
For single-user setups the operational delta is meaningful. For multi-user serving the operational complexity is justified because the throughput math demands it.
Speculative decoding — where vLLM gets interesting
Both engines support speculative decoding, where a small draft model generates candidate tokens that the main model verifies in parallel. On a 3060 with a properly paired draft model, speculative decoding can lift generation tok/s by 30–50% on appropriate workloads.
vLLM's speculative implementation is more polished and easier to enable. llama.cpp's implementation is functional but requires more manual tuning. If speculative decoding is a hard requirement for your workload, vLLM's edge is real even at batch=1.
When llama.cpp's portability matters
llama.cpp runs anywhere C++ runs. That portability has practical consequences:
- You can prototype on a Raspberry Pi 4 8GB with the same model file you'll ship to the 3060.
- You can run a backup instance on a laptop or a workstation without GPU.
- You can ship to ARM-based mini-PCs without rewriting the inference stack.
vLLM is x86 + CUDA only as a practical matter. That's fine for a desktop serving setup but rules out anything embedded.
Common pitfalls
- Choosing vLLM for a single-user setup because "it's the production default." It is the production default for multi-user serving. For single-user the friction outweighs the benefit.
- Choosing llama.cpp for multi-user serving because "vLLM is too heavy." vLLM's heaviness is real, but four times the throughput is genuinely worth a longer install.
- Pinning vLLM to whatever's newest. The dependency graph is fragile across versions. Pin a known-good combination and upgrade deliberately.
- Running llama.cpp without CUDA build flags. The default binary on some package managers ships without CUDA. Always download the CUDA-tagged release for a 3060.
- Loading a 13B model that doesn't fully fit and assuming the engine will offload gracefully. Offload modes are functional but performance falls off a cliff. If a model doesn't fully fit, choose a smaller quant or a smaller model.
What about Ollama?
Ollama is a wrapper around llama.cpp with a friendlier installer and a model-pull workflow. For single-user setups it inherits all of llama.cpp's advantages and adds zero friction. If you're not a developer, Ollama is probably the right answer; if you want fine-grained control over the inference parameters, drop down to llama.cpp directly.
Ollama is not a vLLM substitute — it does not implement continuous batching. For multi-user serving it has the same single-user throughput ceiling as llama.cpp.
How to choose: a flowchart
- Single user, casual workload, you want lowest friction. Ollama (which is llama.cpp under the hood).
- Single user, advanced workload, you want full control of inference parameters. llama.cpp direct.
- Multi-user serving, 2–4 concurrent users. Either works; llama.cpp keeps friction low.
- Multi-user serving, 4+ concurrent users. vLLM, no question.
- Embedded/mobile deployment. llama.cpp.
- Need OpenAI-compatible API. vLLM (or one of the llama.cpp HTTP wrappers — they exist but vLLM's is more polished).
- Need speculative decoding on a 3060 with a properly paired draft model. vLLM's implementation is easier.
When to consider stepping past the 3060
For multi-user serving with continuous batching, the 3060's 12 GB of VRAM caps you at roughly 8–12 concurrent sessions on a 7B model. For more concurrency you'll want an RTX 5090 class card or a data-center GPU. For single-user setups the 3060 holds up well into 2026 and remains the sweet spot for "useful local LLM on a budget."
Memory bandwidth math, briefly
A useful intuition for why generation tok/s on a single-user setup is mostly engine-independent: a 7B model in q4 reads roughly 4.5 GB of weights per generated token (the entire weight tensor is touched on each forward pass). At 360 GB/s of memory bandwidth on the 3060, the theoretical ceiling is 80 tokens/sec; in practice both engines land at 45–55 due to KV-cache reads, kernel-launch overhead, and synchronization. There's not much daylight between engines on that ceiling because they're both reading the same weights from the same memory subsystem.
What changes with PagedAttention isn't the per-token cost — it's the per-session overhead. PagedAttention's wins come from being able to share KV-cache slots across sessions efficiently. With one session that gain disappears. With ten sessions it transforms the math because nine of those sessions would otherwise have idle KV-cache slots burning VRAM.
Practical setup tips for each engine
For llama.cpp:
- Download the CUDA-tagged release from the project releases page, not from a generic package manager.
- Run
--ctx-size 4096unless your real workload exceeds it; larger contexts waste VRAM. - Use
--n-gpu-layers 99to push all layers to GPU on a 3060 for a 7B model. - Enable
--mmapfor faster model load times after the first run. - For long conversations, configure
--keep -1so the system prompt stays cached.
For vLLM:
- Pin to a known-good version; the dependency graph is fragile.
- Set
--max-num-seqsto your expected concurrency; over-provisioning costs VRAM. - Use
--gpu-memory-utilization 0.85to leave headroom for KV-cache overhead. - Enable
--enable-prefix-cachingwhen sessions share a system prompt. - For low-traffic deployments, set
--swap-space 4so cold sessions can roll out to disk gracefully.
What about Triton or TGI?
Two other multi-user inference engines deserve a mention. NVIDIA's Triton is enterprise-flavored and powerful but heavy; for a 5-person team a 3060 + vLLM stack is the simpler answer. Hugging Face's TGI (Text Generation Inference) is positioned similarly to vLLM and is roughly comparable on throughput; vLLM has been moving faster on speculative decoding and quant support, which is why it's our default recommendation for serving on a 3060 in 2026.
A note on serverless and Ollama for casual users
If you're not actually a developer and the previous sections feel like overkill, Ollama is the right answer. It's a wrapper around llama.cpp with a one-command install, automatic model downloads, and a clean CLI. The performance is identical to llama.cpp because it is llama.cpp underneath; the friction is meaningfully lower. For someone who just wants to chat with a local model on their RTX 3060 without learning the inference stack, Ollama plus a 7B Llama-3.1 model is a 10-minute setup that produces a working local AI experience.
For developers who want fine-grained control, jump straight to llama.cpp. The flag space is well-documented and the binary is small enough to inspect when something goes wrong. For multi-user serving where concurrency genuinely arrives, vLLM is the answer despite its operational weight; for everything else, llama.cpp (or Ollama as its friendlier face) wins.
Bottom line
If you're one user with a 3060 chatting to a 7B model, llama.cpp is the better answer in 2026 on almost every dimension. If you're serving a team of 5+ users from one box, vLLM's continuous batching genuinely transforms what's possible. The conventional wisdom that "vLLM is for production, llama.cpp is for hobbyists" was a 2023 take that doesn't survive contact with current llama.cpp throughput or current operational realities. Pick the engine that matches your concurrency, not the one that sounds more serious.
Read the vLLM project page for the PagedAttention paper and current benchmarks, the llama.cpp repo on Hugging Face for the GGUF ecosystem, and the TechPowerUp RTX 3060 spec page for the hardware bandwidth numbers that shape both engines' performance.
