For a single user running local chat on a 12GB RTX 3060, use llama.cpp. It's the easier install, runs natively on Windows, supports the GGUF quant ecosystem that fits 12GB comfortably, and matches or beats vLLM on tokens-per-second when there's exactly one user — because vLLM's main edge, continuous batching, only pays off with concurrent requests.
The llama.cpp-vs-vLLM debate gets the most attention in production deployments where vLLM's batching, paged-attention, and quantization-aware kernels dominate any sensible benchmark. That story changes completely for a single-user desktop. With one prompt at a time, the inference engine's job collapses to "run one token, then the next" — and that's exactly the workload llama.cpp was designed for. Builders coming in from a cloud-deployment mindset routinely pick vLLM out of habit and then complain about the install, the missing Windows support, and the Linux-only AWQ tooling. They shouldn't.
Key takeaways
- For single-user 12GB chat: use llama.cpp. Easier install, full Windows support, GGUF quant ecosystem, equivalent or better single-stream throughput.
- For multi-user or production serving: use vLLM. Continuous batching is the unbeatable feature once you have >2 concurrent users.
- Quant story: llama.cpp owns GGUF; vLLM owns AWQ + GPTQ. Both cover the 4-bit space; the practical difference is tooling friction.
- OS: llama.cpp builds cleanly on Windows, macOS, and Linux. vLLM is Linux-first; Windows support is via WSL2 and feels like it.
- Maintenance: llama.cpp has a smaller dependency surface and a faster startup. vLLM ships a Docker image that's easier to redeploy at scale.
Spec-delta table: llama.cpp vs vLLM on a 12GB card
| Axis | llama.cpp | vLLM |
|---|---|---|
| VRAM model | KV cache + weights (predictable) | PagedAttention (denser, fewer fragments) |
| Batching | Single request at a time | Continuous batching across concurrent requests |
| Quant formats | GGUF (q2-q8, multiple K variants) | AWQ, GPTQ, FP8 (Marlin kernels) |
| Setup | One binary, no Python required | Python venv, CUDA toolkit, larger dep tree |
| OS | Linux, macOS, Windows native | Linux-first; Windows via WSL2 |
| Startup time (8B model) | 1-3 s warm, 5-10 s cold | 20-40 s cold (CUDA graph capture) |
| Concurrent-user benefit | None — same speed | Huge — near-linear throughput up to batch size |
| GPU offload to CPU | Yes, partial layer offload | Limited — designed for full-GPU residency |
Which is faster for one user at a time?
The single-user numbers are closer than people expect. On an RTX 3060 12GB, with an 8B model at q4_K_M:
| Engine | Model | Tok/s (single user) |
|---|---|---|
| llama.cpp (CUDA) | Llama 3.1 8B q4_K_M | 55-72 |
| vLLM (Marlin) | Llama 3.1 8B AWQ-4bit | 58-75 |
| llama.cpp (CUDA) | Qwen3 14B q4_K_M | 30-42 |
| vLLM (Marlin) | Qwen3 14B AWQ-4bit | 32-44 |
| llama.cpp (CUDA) | Qwen3 32B q3_K_M | 8-14 (partial offload) |
| vLLM (Marlin) | Qwen3 32B AWQ-4bit | OOM at 32B on 12GB |
vLLM has a slight edge on the same model size because of its more aggressive kernel work, but the gap is rarely larger than 10% for single-user use. The bigger story is the 32B row: llama.cpp's partial-offload support lets you fit a 32B model on 12GB with real tokens-per-second; vLLM is built around full-GPU residency and will simply OOM. That's a serious consideration if you want to host the largest model your hardware can handle.
Why vLLM's batching advantage disappears with one user
vLLM's signature feature is continuous batching: when one user's request is mid-generation, vLLM can slot in a second user's prompt and feed both through the GPU in the same forward pass. The GPU's tensor cores spend roughly the same time per pass regardless of batch size up to a point, so a 4× batched workload gives nearly 4× total throughput. Production deployments serving multiple concurrent users live or die by this property.
With one user, there's nothing to batch. The GPU does one token, returns, does the next. vLLM's complex scheduler is doing zero useful work. llama.cpp's simpler single-request path runs the same kernel without the scheduling overhead. The numbers reflect that.
If you're building a personal chat rig, the absence of a batching benefit means there's no engineering reason to take on vLLM's setup cost. If you're building something that needs to serve five people from one card simultaneously, the calculation inverts.
Quantization: GGUF vs AWQ/GPTQ on 12GB
GGUF is llama.cpp's native format. It ships in q2 through q8 variants with K-quant tweaks; the ecosystem has stabilized around q4_K_M and q5_K_M as the standard interactive-chat picks. Hundreds of pre-quantized GGUFs ship on Hugging Face for every popular model within days of release. Practically, "find the GGUF on the Bartowski or Quant Factory repo" is the standard workflow.
AWQ and GPTQ are vLLM's preferred quants. Both are 4-bit, both pack weights efficiently for GPU residency, and both produce slightly faster kernels than GGUF q4 on supported GPUs (Marlin kernels in particular are the speed leaders on Ampere and Ada). The tooling is more involved: you typically wait days longer for a community AWQ to appear, and producing your own is a Linux-only Python workflow with calibration steps.
For a 12GB card, both formats fit the same model sizes (4-bit is 4-bit). The choice mostly comes down to which engine you're running.
Setup and maintenance friction
llama.cpp ships as a single statically-linked binary on each platform. You download it, you run it, you're done. Want server mode? Pass --server. Want OpenAI-compatible endpoint? Built in. Want a different model? Point it at the GGUF file. There's no Python environment to maintain, no CUDA toolkit version drift to debug.
vLLM ships as a Python package that pulls in PyTorch, the appropriate CUDA toolkit, NVIDIA's NCCL, xFormers, and a half-dozen other accelerated-attention dependencies. The recommended path is the Docker image, which works well but adds container overhead. A working vLLM install on bare metal involves a CUDA version match dance that bites first-time users. The Phoronix coverage of vLLM benchmarks is excellent on the deployment-side considerations.
For a single-user desktop where you're going to swap models occasionally and want the setup to "just work," llama.cpp is dramatically less painful.
Verdict matrix
Get llama.cpp if you're a single user, want native Windows support, want minimal install friction, want partial CPU offload for models larger than your VRAM, want to use the GGUF ecosystem, value a small dependency surface, or are running on a Mac with Metal acceleration.
Get vLLM if you're serving multiple concurrent users, deploying to a Linux server in production, comfortable with Python deployment, want the absolute fastest kernels (Marlin) for AWQ-quant models, building an internal API serving a team, or you specifically need PagedAttention's fragmentation behavior at high concurrency.
Recommended pick for a single-user RTX 3060 rig
Use llama.cpp. Build it from source or grab the prebuilt binary, download a GGUF (Qwen3 8B q5_K_M is a great default), and start the server mode with --port 11434 --n-gpu-layers 99. You're done. Point Open WebUI, Anything LLM, Aider, Cline, or any other OpenAI-compatible client at http://localhost:11434/v1. Total setup time: 10 minutes. Total dependency maintenance going forward: download the next GGUF when you want a different model.
Perf-per-watt note
The single-stream comparison also matters for power. An RTX 3060 at ~170W during generation consumes roughly the same power on either engine — the GPU is the dominant load and tokens-per-second-per-watt scales with raw throughput. The differentiator is idle: vLLM keeps more of the model graph warm in GPU memory and runs background scheduler threads even when idle, while llama.cpp drops to a few watts between requests. For an always-on personal rig, that idle delta adds up to 1-2 kWh/month — not huge, but worth noting for builds that double as 24/7 home assistants.
Bottom line
For a single user on a 12GB GPU, llama.cpp wins on every axis that matters to a desktop builder: setup ease, native Windows support, GGUF ecosystem, partial-offload flexibility, equivalent single-stream throughput. Save vLLM for the day you decide to host a model for your team. The mistake to avoid is assuming the engine that's best-in-class for production is also best-in-class for the desk — it almost never is.
Frequently asked questions
Is vLLM faster than llama.cpp for one user?
Marginally, sometimes — usually within 5-10% on the same model size when both engines run the same 4-bit quant on a 12GB card. The gap closes once you account for llama.cpp's partial-CPU-offload support, which lets you fit a 32B model on 12GB with real tokens per second. vLLM's true speed advantage shows up only with concurrent users and continuous batching, which a single-user desktop never triggers. For one user at a time the engines are effectively a tie on raw throughput.
Which engine handles 12GB VRAM better?
For full-GPU residency of small-to-medium models (7B-14B), both are fine. For models larger than your VRAM, llama.cpp wins because it supports partial layer offload — you can run a 32B model with most layers on GPU and the rest on CPU, taking a throughput hit but still getting useful tokens per second. vLLM is built around fully-resident models and will OOM rather than spill, which is the correct tradeoff for production but the wrong one for a desktop that wants to push past its VRAM ceiling.
Does vLLM run on Windows?
Not natively. The supported path is WSL2 on Windows, which works but adds a Linux VM layer to maintain and complicates GPU passthrough configuration. llama.cpp ships as a native Windows binary with CUDA support and runs out of the box on a default Windows 11 install with NVIDIA drivers. For a Windows desktop builder, the OS-support gap alone is decisive.
Which quantization formats does each support?
llama.cpp uses GGUF in q2-q8 variants, including K-quants and the newer i-quants. The Hugging Face ecosystem ships pre-quantized GGUFs for nearly every model within days of release. vLLM uses AWQ and GPTQ (both 4-bit) with the Marlin kernels giving the fastest single-stream throughput on Ampere/Ada GPUs. Both cover the practical 4-bit space on 12GB; the friction difference is in tooling — building a GGUF takes minutes, building an AWQ takes a Linux Python workflow with calibration data.
Can a weaker CPU bottleneck either engine?
Mostly no for full-GPU inference — the CPU's job is tokenization, sampling, and orchestration, all of which are cheap. The exception is llama.cpp's partial-CPU-offload mode: when layers spill to system RAM, the CPU's memory bandwidth becomes the bottleneck and a weak CPU can drop tokens per second by 30-50%. A Ryzen 7 5700X or 5800X handles either engine without bottlenecking. A four-core or older budget CPU is fine for vLLM full-GPU work but slow for llama.cpp offload.
Citations and sources
- llama.cpp on GitHub — single-binary local LLM inference engine
- vLLM on GitHub — high-throughput LLM serving for production
- Phoronix — vLLM benchmark coverage across hardware tiers
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
