For single-user local chat on a 12GB GPU like the RTX 3060 12GB, llama.cpp is the right runner. It loads quantized GGUF models in seconds, idles at near-zero VRAM cost, supports the broadest quant range, and ships with first-class CUDA, ROCm, Metal, and Vulkan backends. vLLM is the right runner for multi-user serving — its PagedAttention and continuous batching shine when many concurrent requests share a model — but those wins disappear when only one user is connected. Pick vLLM only if you're hosting an inference endpoint for a team or app; pick llama.cpp for everything else on a single-card desktop.
Why this comparison matters in 2026
The "what should I run my local LLM on" question has settled around two answers: llama.cpp and vLLM. Both are open-source, both run on a RTX 3060 12GB or equivalent 12GB GPU, and both have credible production deployments. The confusion is that they're optimizing for different workloads, and the marketing copy looks similar enough that first-time users assume they're substitutes.
They're not. llama.cpp is a single-binary runner that ingests a quantized GGUF file and serves it efficiently to one client at a time. vLLM is a Python serving stack with PagedAttention, continuous batching, prefix caching, and tensor parallelism — features that earn their keep when you have ten chats happening at once. For a single-user desktop on a 12GB card, those features are either inert or actively costly. This piece is the honest breakdown of when each is correct, with concrete throughput numbers from public benchmarks.
Key takeaways
- llama.cpp wins for single-user chat on a 12GB GPU on every metric that matters at home: startup time, idle VRAM, quant flexibility, and ergonomics.
- vLLM wins when you're serving many users at once — PagedAttention's memory efficiency and continuous batching unlock multi-tenant throughput llama.cpp can't match.
- vLLM's full-precision/16-bit weight assumption makes it a poor fit for 12GB cards beyond ~7B models without aggressive AWQ/GPTQ quantization.
- llama.cpp's GGUF quantization (q4_K_M, q5_K_M, q6_K) packs more model into a 12GB card than vLLM typically does.
- For most single-card local-AI builders in 2026, Ollama (a llama.cpp wrapper) is the simplest entry point.
What each runner actually is
llama.cpp started as a CPU-only C/C++ port of Meta's LLaMA model and grew into the de-facto cross-platform local-inference runtime. It loads quantized GGUF files (the format developed in the llama.cpp ecosystem), supports CUDA, ROCm, Metal, Vulkan, and SYCL backends, and runs as a small static binary or as a server (llama-server). The project's GitHub repository documents the supported quantization formats and the server reference.
vLLM is a Python-based serving stack out of UC Berkeley. Its headline contribution is PagedAttention, described in the original vLLM paper, which manages the KV cache in paged blocks the way an OS manages virtual memory. The result is dramatically better memory efficiency under concurrent requests, which translates to higher request throughput on a fixed GPU. vLLM also implements continuous batching, prefix caching, and tensor parallelism, and supports OpenAI-compatible HTTP endpoints out of the box.
The structural difference: llama.cpp is a runner optimized for one client at a time; vLLM is a serving stack optimized for many.
Spec-delta table
| Dimension | llama.cpp | vLLM |
|---|---|---|
| Primary use case | Single-user local chat / agent | Multi-tenant serving (apps, teams) |
| Model format | GGUF (quantized) | Hugging Face safetensors (typically FP16/BF16; AWQ/GPTQ available) |
| Startup time on a cold model | Seconds | Tens of seconds (PyTorch + model load) |
| Idle VRAM footprint | Low — model + small KV cache | High — preallocated KV cache pool |
| Best quantization range | q2_K through q8_0 + FP16 | AWQ 4-bit, GPTQ 4/8-bit, FP16 |
| Concurrency win | Marginal | Large (PagedAttention + continuous batching) |
| OpenAI-compatible API | Yes (llama-server) | Yes (built-in) |
| Hardware coverage | CUDA, ROCm, Metal, Vulkan, SYCL, CPU | CUDA-first; ROCm support improving |
Single-user throughput — what to expect on a 12GB card
For Llama 3.1 8B class models on an RTX 3060 12GB, public community measurements compiled on r/LocalLLaMA and the llama.cpp project's performance discussions consistently put llama.cpp q4_K_M throughput in the 35–55 tok/s range for single-user chat, with prompt-eval (prefill) several times faster than generation. The model fits with comfortable headroom for an 8K context window.
vLLM on the same card with the same model in AWQ 4-bit also lands in a comparable range for single-request throughput — vLLM's advantage isn't single-request latency, it's request scheduling under load. With one user and one request at a time, you're paying for the PagedAttention machinery without using it.
The real divergence shows up in idle behavior. llama.cpp idles at the size of the model plus a small KV cache — maybe 5–6 GB resident on a 7B q4_K_M model. vLLM preallocates a much larger KV cache pool at startup (its --gpu-memory-utilization default is 0.9, meaning 90% of VRAM is claimed up front). On a 12GB card this is fine for the model you loaded, but it means you can't run other workloads (a small embedding model, a Whisper instance, an SDXL pipeline) on the same card without restarting vLLM.
Concurrency — where vLLM earns its complexity
The real reason vLLM exists is continuous batching with PagedAttention, and that benefit only materializes under concurrent load. The vLLM paper reports 2–24× throughput uplifts on multi-user benchmarks compared to naive batched serving.
For a single user on a desktop, you're never queuing requests behind each other — by the time you've finished reading a reply, you've forgotten there was a queue. So you don't see those uplifts. You see the overhead: longer startup, larger memory baseline, a Python serving stack instead of a static binary, and a CUDA-first hardware footprint that excludes Macs and AMD-Vulkan users.
If you're hosting an internal chat endpoint for a small team — say, 5–20 colleagues hitting the same model — vLLM starts to pay off. The KV cache packing means more users fit in 12GB, and continuous batching keeps the GPU saturated. At that point, you're also probably outgrowing a single RTX 3060 12GB and should think about a 16GB or 24GB card and a proper CPU + cooler on the host — workstation territory.
Quantization fit on a 12GB card
The llama.cpp quantization range is the broader of the two. Practical guidance for Llama 3.1 8B on a 12GB card:
| Quant | VRAM (7B/8B model) | Quality vs FP16 | Notes |
|---|---|---|---|
| q4_K_M | ~5.5 GB | Near-FP16 | Sweet spot for 12GB cards |
| q5_K_M | ~6.5 GB | Very low loss | Quality bump with VRAM to spare |
| q6_K | ~7.5 GB | Effectively FP16 | Larger context fits comfortably |
| q8_0 | ~9 GB | Indistinguishable from FP16 | Limits context window |
| FP16 | ~14 GB | Reference | Does not fit a 12GB card |
vLLM with AWQ 4-bit on a 7B/8B model lands at roughly comparable VRAM to llama.cpp q4_K_M, but with less granularity in quant choices and a heavier serving overhead. vLLM also offers FP16 and BF16 paths for users with larger cards, but those don't fit on a 12GB.
For 13B-class models, the llama.cpp story is "q4_K_M fits with limited context"; the vLLM story is "AWQ 4-bit fits but you're tight against the PagedAttention pool." Both are doable, neither is comfortable.
Real-world workflow gotchas
Three places llama.cpp wins in day-to-day use:
- Model swap is fast.
ollama run llama3.1:8bto switch from one model to another takes a few seconds. vLLM requires a full process restart for a model swap. - Multi-modal stacks share the GPU. If you also run Stable Diffusion or Whisper on the same card, llama.cpp's lower idle VRAM lets the workloads coexist.
- Quant experimentation is one download away. Hugging Face hosts every popular GGUF quant for every popular model. Swapping q4_K_M for q5_K_M is one
ollama pullaway.
Three places vLLM wins:
- Throughput under contention. When ten chats hit the same model simultaneously, vLLM keeps the GPU saturated while llama.cpp serializes.
- OpenAI-compatible API maturity. vLLM's OpenAI endpoint is a closer drop-in for production clients than llama-server's, with fewer compatibility gotchas.
- Production observability. Prometheus metrics, request tracing, and a more "stack"-shaped deployment story.
Verdict matrix
| Use llama.cpp if… | Use vLLM if… |
|---|---|
| You're the only user on the rig | You're hosting an endpoint for a team |
| You want fast model swaps and idle-VRAM headroom | You need maximum throughput under concurrent load |
| You're on a Mac, AMD Vulkan, or Intel Arc | You're on NVIDIA CUDA in production |
| You want a static binary, not a Python stack | You're comfortable running Python services with PyTorch |
| You need q4_K_M, q5_K_M, q6_K flexibility | You're shipping FP16/BF16 or AWQ-quantized models |
Common pitfalls
- Don't run vLLM on a 12GB card for a single-user setup. You'll pay the overhead and get no concurrency benefit.
- Don't fight the PagedAttention pool. vLLM is happiest when it owns most of the GPU's VRAM; if you need to share VRAM with other workloads, use llama.cpp.
- Don't expect vLLM AWQ to outperform llama.cpp q4_K_M on single-user latency. They're in the same neighborhood; choose by workflow, not by marginal tok/s.
- Don't ignore the cooler and CPU on the host. Even GPU-bound inference benefits from a stable host — a six-to-eight-core CPU with a good air cooler and a well-built 8-core like the Ryzen 7 5800X eliminates host-side stalls.
Real-world setup walkthroughs
A concrete comparison helps. Here's what bringing up each runner looks like on a fresh Ubuntu 24.04 box with an RTX 3060 12GB, going from "blank install" to "first reply."
llama.cpp via Ollama (the easy path): install Ollama with the upstream curl script, ollama pull llama3.1:8b-instruct-q4_K_M to grab the model (~4.7 GB download), ollama run llama3.1:8b-instruct-q4_K_M and start typing. End-to-end on a 1 Gbps connection: roughly 8–10 minutes, most of which is the model download. The OpenAI-compatible HTTP server at localhost:11434 is already running — point a client like Continue.dev or Open WebUI at it and you're done.
vLLM (the heavier path): install Python 3.10+, create a virtualenv, pip install vllm (which pulls a multi-gigabyte CUDA-enabled PyTorch stack), then download a Hugging Face model in safetensors or AWQ form. For a 12GB card, you want an AWQ 4-bit quant — community-built quants of Llama 3.1 8B are on the Hub. Launch with python -m vllm.entrypoints.openai.api_server --model <path> --quantization awq --gpu-memory-utilization 0.85. First boot takes 30–60 seconds before the server is ready. The endpoint behaves like OpenAI's /v1/chat/completions and /v1/completions.
The "first reply" wall-clock difference is roughly 12 minutes for llama.cpp vs 25–40 minutes for vLLM the first time you do it. After that, the difference is in model swap latency (llama.cpp wins by a wide margin) and concurrent request behavior (vLLM wins).
When to combine them
A pattern worth knowing: run llama.cpp for your interactive desktop chat, and stand up vLLM only when you need to serve an internal endpoint or a benchmarking job. They don't fight on the same machine if you orchestrate them — the issue is just GPU memory pressure when both are running. On a 12GB card you almost certainly want to pick one and stick with it.
If you're building an agentic system with many parallel tool-call streams (a SWE-bench harness, a parallel research agent, a batch evaluation), vLLM's continuous batching keeps your GPU saturated where llama.cpp would serialize. That kind of workload is exactly the multi-user case in disguise — the "users" are agent threads.
Bottom line
For single-user local chat on a 12GB GPU, llama.cpp is the answer. It's smaller, faster to start, more flexible in quantization, and runs on every backend you might switch to. vLLM is the right pick when you're hosting an endpoint with concurrent users, and at that point your hardware needs to scale up too. Match the runner to the workload — not to the runner's marketing.
Citations and sources
- llama.cpp — GitHub repository
- llama.cpp server example documentation
- llama.cpp — performance discussions
- vLLM — GitHub repository
- vLLM paper — Efficient Memory Management for Large Language Model Serving with PagedAttention
- Ollama — official site
- r/LocalLLaMA — community subreddit
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
