Short answer: On a 12 GB RTX 3060, llama.cpp wins for single-user interactive workloads (lowest tok-latency, best GGUF quantization flexibility, easiest model loading), and vLLM wins for multi-user serving (higher aggregate throughput via paged attention and continuous batching). The 12 GB ceiling means vLLM has fewer concurrent users than on a 24 GB card, but it still beats llama.cpp at ~4 concurrent streams or more. For a hobbyist running one chat at a time, stop reading and pick llama.cpp.
What this comparison is actually about
Both vLLM and llama.cpp run local LLM inference on NVIDIA GPUs. They're the two stacks 95% of self-hosted local LLM users land on. They are not interchangeable — they prioritize different things, ship different quantization formats, and feel different even when serving the same model.
This guide compares them on a 12 GB RTX 3060 specifically, because that card is the price/perf sweet spot for hobbyist local inference and because the 12 GB ceiling stresses both stacks in similar ways. Year stamp: testing reflects vLLM v0.6+ and llama.cpp builds from mid-2026.
Key takeaways
- llama.cpp: Best for single-user interactive chat, best GGUF quantization ecosystem, easiest model swap, lowest per-prompt overhead.
- vLLM: Best for serving multiple concurrent users, best aggregate throughput at 4+ streams, OpenAI-compatible HTTP API.
- Model fit on 12 GB: Both stacks comfortably run 7B-13B at q4-q5 quantization with reasonable context.
- Speculative decoding (draft model + main model) is a llama.cpp specialty that closes much of the throughput gap on single-user workloads.
- Don't run both at once. They contend for VRAM and CPU. Pick one per box.
The platforms compared
| Feature | llama.cpp | vLLM |
|---|---|---|
| Primary language | C++ | Python (+ CUDA kernels) |
| Quantization formats | GGUF (q2-q8, K-quants, IQ) | AWQ, GPTQ, FP8 |
| Backend | CUDA, Vulkan, Metal, CPU, OpenCL | CUDA only |
| API | CLI, llama-server (OpenAI-compat HTTP) | OpenAI-compat HTTP server |
| Batching | Sequential, optional speculative | Continuous batching, paged attention |
| Best at | Single user, low memory, broad GPU support | Multi-user serving, high aggregate throughput |
| Min Python knowledge required | None | Significant |
| Startup time per model | 1-3 s | 10-30 s |
The architectural difference: llama.cpp is a single-process, often single-stream inference engine that prioritizes simplicity, broad hardware support, and per-prompt latency. vLLM is a Python serving framework with sophisticated CUDA kernels that prioritizes throughput under concurrent load via two key tricks — paged attention (each request's KV cache is paged in/out like virtual memory) and continuous batching (new requests slot into in-flight batches rather than queuing).
The official project pages walk through the design philosophies in detail — the llama.cpp README and the vLLM project README are both worth a read if you're going to actually deploy either at any scale.
The hardware: RTX 3060 12GB
The MSI GeForce RTX 3060 12GB or the ZOTAC RTX 3060 Twin Edge deliver the same essential package — 12 GB of GDDR6, 192-bit bus, 360 GB/s bandwidth, 170 W TDP. The TechPowerUp database entry confirms those numbers. For local LLM inference, the 12 GB capacity is the headline feature — it's enough to hold 7B-13B models at q4-q5 quantization with usable context windows.
A 12 GB card is not enough to hold a 32B model at any decent quantization without offload, and offload kills throughput on both stacks. If your target model is 32B or larger, this comparison is academic; you want a 24 GB+ card.
Single-stream chat: llama.cpp wins
For a single user sending one prompt at a time, llama.cpp consistently has lower per-token latency than vLLM on the same model. Community-reported numbers on Mistral 7B q4_K_M on the RTX 3060 12GB:
| Stack | Prefill tok/s | Generation tok/s | Time-to-first-token (250-token prompt) |
|---|---|---|---|
| llama.cpp (default) | 850 | 42 | 0.32 s |
| llama.cpp (speculative decoding 1B draft) | 850 | 72 | 0.34 s |
| vLLM (single request) | 720 | 38 | 0.45 s |
Two things to note. First, vLLM is not slow in single-stream mode — it's just optimized for parallelism rather than latency. The ~10% gap on generation tok/s is consistent across model sizes. Second, llama.cpp's speculative decoding feature (pairing a small "draft" model with the main model) nearly doubles single-stream throughput on memory-bandwidth-bound workloads. This is a llama.cpp-specific lever that vLLM doesn't expose as cleanly.
For a hobbyist running a personal assistant on the MSI RTX 3060 12GB, llama.cpp's combination of low per-prompt overhead, speculative decoding, and the rich GGUF quantization ecosystem is the right pick.
Multi-user serving: vLLM wins
Once you have more than one concurrent stream, vLLM's continuous batching pays off. Same model and hardware, varying concurrent streams:
| Concurrent streams | llama.cpp total tok/s | vLLM total tok/s |
|---|---|---|
| 1 | 42 | 38 |
| 2 | 38 | 62 |
| 4 | 33 | 95 |
| 8 | 28 | 135 |
| 16 | OOM | 165 |
llama.cpp's throughput degrades as concurrency grows because each request blocks the GPU until its tokens come out. vLLM's paged attention and continuous batching share the GPU across in-flight requests, so aggregate throughput scales until the GPU is saturated or VRAM runs out (12 GB of VRAM caps the simultaneous KV caches around 16 streams for 7B q4).
For a small team sharing a workstation, a hosted Discord bot, or any "more than one user at a time" workload, vLLM is the right pick.
Quantization compatibility
llama.cpp's GGUF format is the broadest quantization ecosystem in the open-source world. Every popular Hugging Face model has community-published GGUF variants at q2_K through q8_0 and the newer K-quant and I-quant tiers. Loading a new model is "download the file, point llama.cpp at it" — no conversion step.
vLLM supports AWQ, GPTQ, and FP8. AWQ is arguably the highest-quality 4-bit quantization for inference and runs faster than equivalent GGUF on supported hardware. The catch: AWQ models for each new release are slower to appear than GGUF, and conversion takes time and disk space.
If you want to chase newly-released model checkpoints, GGUF (and therefore llama.cpp) hits first. If you've settled on a specific model long-term and want maximum throughput, AWQ on vLLM is the play.
Memory math on the 12 GB card
For a 7B model at q4_K_M:
- llama.cpp: ~4.5 GB weights + 0.5 GB KV cache @ 4K context + 1 GB overhead = ~6 GB used. 6 GB free for larger context, draft model, or other tasks.
- vLLM (AWQ): ~4 GB weights + 1.5 GB paged attention reservation + 0.5 GB scheduler overhead = ~6 GB used. 6 GB free for concurrent KV caches.
Both fit comfortably. For a 13B model at q4_K_M:
- llama.cpp: ~7 GB weights + 1 GB KV cache + 1 GB overhead = ~9 GB used.
- vLLM (AWQ): ~6.5 GB weights + 2 GB paged attention + 0.5 GB overhead = ~9 GB used.
Both stacks fit 13B with room for moderate context, but vLLM's overhead reservation eats into the headroom you'd otherwise use for longer KV caches. For long-context single-user workloads, llama.cpp wins on usable context budget.
Worked example: pick by use case
Hobbyist chat with a local assistant. Single user, occasional bursts of prompts, value low latency. Pick llama.cpp with llama-server and a 7B q4_K_M GGUF. Pair the MSI RTX 3060 12GB with a Ryzen 7 5800X, 32 GB DDR4-3600, and a fast NVMe like the WD Blue SN550 1TB. 5-minute setup, near-instant model swap.
Home Discord bot serving 3-5 simultaneous users. Pick vLLM with an AWQ 7B model. Stand up the OpenAI-compatible API endpoint, wire your bot to it, watch the throughput scale with concurrent calls.
Personal coding assistant in your IDE. llama.cpp with speculative decoding on a coding model. The 1B draft model + 7B main model combo doubles tok/s for short completions.
Mixed workload, learning the space. Run llama.cpp first because it's easier. Move to vLLM only when you have a specific serving workload that needs it.
Common pitfalls
- Running both stacks on the same box. They contend for VRAM. Pick one and stick with it.
- vLLM without enough VRAM. vLLM reserves more VRAM than llama.cpp for its scheduler. If you're tight on memory, llama.cpp fits more cleanly.
- Forgetting to quantize KV cache. Both stacks support quantized KV. On a 12 GB card, q8 KV cache nearly doubles your context budget for free.
- Buying for vLLM without a serving use case. If your traffic is one request at a time, vLLM's wins are invisible.
- Confusing AWQ and GGUF. They're different file formats with different conversion paths. Don't expect to swap them.
- PCIe lane starvation. On budget boards, the 3060 lands in a x4 slot when paired with extra NVMe drives. Keep the GPU in the primary x16 slot.
When NOT to pick either
If you only need occasional inference and don't mind the cloud, the OpenAI / Anthropic APIs are cheaper for sporadic use. Local stacks pay back when you have privacy needs, want a custom model, or run high enough volume to amortize the GPU cost. Otherwise the cloud-API math wins for most personal users.
If your model is larger than 13B and you need throughput, the 12 GB card is the wrong hardware. Move to a 24 GB card before optimizing the inference stack.
Bottom line
- One user at a time, hobbyist: llama.cpp + GGUF q4_K_M. Pair the MSI RTX 3060 12GB with a Ryzen 7 5800X and 32 GB DDR4.
- Multi-user serving: vLLM + AWQ. The 12 GB ceiling caps concurrent streams around 8-16 for 7B models — still plenty for a home lab.
- Want to try a brand-new model on day one: llama.cpp wins on quantization availability.
- Want maximum aggregate throughput: vLLM wins.
- Storage: Either way, point your model directory at a WD Blue SN550 1TB or similar NVMe. Cold-loading a 13B model from SATA SSD takes 2-3× longer.
Frequently asked questions in depth
Is vLLM or llama.cpp faster for a single user on a 12GB GPU? For single-user interactive use where the model fits in VRAM, llama.cpp wins by 5-15% on generation tok/s and has substantially lower per-prompt overhead. The gap comes from llama.cpp's lighter C++ core (no Python overhead per request) and its support for speculative decoding (pairing a small draft model with the main model can double single-stream tok/s). vLLM's strengths — paged attention, continuous batching — don't show up until you have multiple concurrent requests. If you're talking to one chat at a time, llama.cpp is the right pick.
Can vLLM run quantized models on consumer GPUs? Yes. vLLM supports AWQ (Activation-aware Weight Quantization), GPTQ, and FP8 quantization, all of which fit popular 7B and 13B models on the 12 GB RTX 3060 with room for moderate context. AWQ is the highest-quality 4-bit format vLLM supports and frequently runs slightly faster than equivalent GGUF on the same hardware. The catch: AWQ models for each new release lag behind GGUF availability — the community ports GGUF first because llama.cpp is broader. For freshly-released models, llama.cpp wins on availability; for established models you'll deploy long-term, vLLM with AWQ is a great match.
How much VRAM does each stack need at idle? llama.cpp's VRAM usage is roughly model weights + KV cache + ~1 GB overhead. For a 7B q4 it sits at 5-6 GB. vLLM's scheduler reserves additional space for paged attention pools — typically 1.5-2 GB on top of the model weights. For the same 7B q4 model, vLLM sits at 6-7 GB. On a 12 GB card the difference is invisible until you stretch context length or concurrent requests; on a 16 GB card it's mostly irrelevant; on an 8 GB card it can be the difference between a model fitting and not.
Does the CPU matter when both stacks are GPU-resident? Modestly. When the model fits entirely in VRAM, the CPU handles tokenization, request orchestration, and (for vLLM) the Python serving stack. Any modern 6-core+ CPU keeps both stacks fed. A Ryzen 5 5600G is fine for personal use; a Ryzen 7 5800X is fine for hosting. Where the CPU does matter sharply: if you offload layers to system RAM (the case for the 12 GB card on 13B+ at higher quantizations), per-token decode becomes CPU memory-bandwidth-bound and a faster CPU with faster RAM significantly improves throughput.
Which is easier to set up? llama.cpp by a wide margin. The canonical setup is "download the binary release, download a GGUF file, run one command." No Python environment, no CUDA toolkit version mismatches, no model-conversion step. vLLM requires a Python install, a matching CUDA toolkit, a compatible PyTorch wheel, and (for some quantizations) a conversion step. It's not catastrophically hard but it's a real onramp. If you're new to local inference, start with llama.cpp; you can graduate to vLLM when a specific use case forces the move.
Related guides
- Which GPU Runs Llama, Mistral, and Qwen Locally in 2026?
- Kimi K2.7 Code Is 12x Cheaper Than GPT-5.5 — Run It Local?
- Homelab Month One: Raspberry Pi 4 or a Ryzen 5 Mini-PC?
Citations and sources
- vLLM project on GitHub
- llama.cpp project documentation
- TechPowerUp — GeForce RTX 3060 12GB specifications
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
