Short answer: For a model that fits entirely in the RTX 3060 12GB's VRAM, ExLlamaV2 typically delivers 1.4-1.8× the tokens per second of llama.cpp with full GPU offload. For models that need partial CPU offload, llama.cpp closes the gap or wins outright because its CPU code path is faster than ExLlamaV2's. Both runtimes are mature, both have active communities, and most local-LLM users end up with both installed for different workloads.
This piece is editorial synthesis of the upstream repository documentation, TheBloke's published quant benchmarks on Hugging Face, and community throughput threads on r/LocalLLaMA. No first-party benchmarks are reported.
Key takeaways
- ExLlamaV2 is GPU-native, fastest for models that fit fully in VRAM.
- llama.cpp is hybrid CPU+GPU, more flexible for larger models that need partial offload.
- On a 12GB RTX 3060, a 7B-13B model fits comfortably in VRAM at 4-bit and ExLlamaV2 wins.
- A 14B model at q4 is tight; llama.cpp's partial-offload path becomes useful.
- A 32B model can't fit; both runtimes will work but with substantial CPU offload — llama.cpp typically wins.
What are these two projects?
ExLlamaV2 is a CUDA-optimized inference engine for quantized LLM models, maintained by turboderp and a small group of contributors. It uses a custom quantization format (EXL2) and is designed to extract maximum throughput from NVIDIA GPUs by hand-tuning kernels for tensor cores. It's narrower in scope than llama.cpp — it does GPU inference well, doesn't try to do everything else — and that focus shows in benchmarks.
llama.cpp is a broader project, originally written by Georgi Gerganov, that supports CPU inference, GPU offload via multiple backends (CUDA, ROCm, Metal, Vulkan, SYCL), and runs on almost everything from a Raspberry Pi to an H100 cluster. It uses GGUF quantization, supports hybrid CPU+GPU execution where layers can split between the two, and is the substrate underneath most user-facing local-LLM apps (Ollama, LM Studio, KoboldCPP, etc.).
The trade is exactly what you'd expect: ExLlamaV2 is faster in its narrow sweet spot, llama.cpp is more flexible and works everywhere.
Quantization format differences
| Aspect | ExLlamaV2 (EXL2) | llama.cpp (GGUF) |
|---|---|---|
| Primary target | NVIDIA GPU | CPU + GPU hybrid |
| Bit width | 2.0-8.0 bpw (continuous) | q2_K, q3_K_M, q4_K_M, q5_K_M, q6_K, q8_0, fp16 (discrete) |
| Mixed-precision per layer | Yes (calibrated) | Yes (K-quants) |
| CPU offload | Limited | First-class |
| File format | Single .safetensors + config | Single .gguf |
| Community support | Smaller | Larger |
The continuous bits-per-weight in EXL2 is genuinely useful — you can tune the quant level precisely to fit your VRAM budget. A 13B model in EXL2 at 4.5 bpw fits in 12GB more comfortably than the same model in GGUF q4_K_M (~4.6 bpw equivalent) because the EXL2 quantizer can distribute precision more finely across layers.
Throughput benchmarks (community-reported)
Approximate generation tok/s on a 12GB RTX 3060 with a quiet Ryzen 7 5700X host:
| Model | Quant level | ExLlamaV2 tok/s | llama.cpp tok/s (full GPU) | llama.cpp tok/s (partial offload) |
|---|---|---|---|---|
| Llama 3.1 8B | 4 bpw / q4_K_M | 95–110 | 60–75 | n/a (fits in VRAM) |
| Mistral 7B | 4 bpw / q4_K_M | 100–120 | 70–85 | n/a |
| Qwen3 14B | 4 bpw / q4_K_M | 60–75 | 40–55 | 25–35 (with offload) |
| Llama 3.2 11B | 5 bpw / q5_K_M | 55–70 | 38–48 | n/a |
| Qwen3 32B | 4 bpw / q4_K_M | n/a (OOM) | n/a (OOM) | 8–14 |
Numbers vary substantially with batch size, context length, draft-spec speculation, and exact quant level — these are mid-range community-reported figures, not promises. The pattern is consistent: ExLlamaV2 wins by 40-60% when everything fits; llama.cpp's hybrid mode is the only option for larger models.
Memory layout: what fits in 12GB
The math for VRAM usage on a 12GB RTX 3060:
- ~1 GB reserved for the OS and other apps (assume monitor connected)
- ~11 GB available for the model + KV cache + activations
At 4 bpw / q4_K_M:
| Model size | Weights | KV cache (4k context) | Total VRAM | Fits 12GB? |
|---|---|---|---|---|
| 7B | 4.0 GB | 0.5 GB | ~4.8 GB | Yes, comfortably |
| 8B | 4.6 GB | 0.6 GB | ~5.5 GB | Yes, comfortably |
| 11B | 6.3 GB | 0.7 GB | ~7.5 GB | Yes |
| 13B | 7.5 GB | 0.9 GB | ~9.0 GB | Yes |
| 14B | 8.1 GB | 1.0 GB | ~9.7 GB | Yes, tight |
| 20B | 11.5 GB | 1.4 GB | ~13.5 GB | No (overflow) |
| 32B | 18.4 GB | 2.2 GB | ~21.0 GB | No (heavy overflow) |
The 14B model is right at the edge of what fits with reasonable context. For 14B you want ExLlamaV2 at exactly 4 bpw, with cache_q4 enabled and context capped at 4-8k. Going to 16k context starts to push past the budget.
Installation and setup
ExLlamaV2 setup on a typical Ryzen + RTX 3060 Linux box:
- Install CUDA 12.x and matching PyTorch
pip install exllamav2- Download an EXL2-quantized model from Hugging Face (e.g., from turboderp's repos)
- Run with
python -m exllamav2.server --model <path>
llama.cpp setup:
git clone https://github.com/ggerganov/llama.cppcmake -B build -DGGML_CUDA=ON && cmake --build build --config Release- Download a GGUF-quantized model from TheBloke's HF repos or similar
- Run with
./build/bin/llama-server -m <path> -ngl 99 --port 8080
The llama.cpp install is more steps but gives you a binary with no Python runtime dependencies — useful for deployment. ExLlamaV2's Python install is faster to get running for development.
Where ExLlamaV2 wins
- Pure-GPU throughput on models that fit in VRAM. Consistently 40-60% faster than llama.cpp on the same hardware.
- Continuous quant level. EXL2 lets you tune bpw precisely; GGUF is discrete steps.
- KV cache quantization. EXL2's cache_q4 / cache_q8 options let you fit longer contexts than GGUF can manage on the same VRAM.
- Tensor-core utilization. Hand-tuned kernels actually use the RTX 3060's tensor cores effectively, where llama.cpp's CUDA path is more generic.
Where llama.cpp wins
- Models that don't fit in VRAM. Partial offload (some layers on GPU, rest on CPU) is fundamental to llama.cpp's architecture; in ExLlamaV2 it's a limp-mode.
- CPU-only inference. llama.cpp's CPU kernels are some of the fastest in the open-source world; ExLlamaV2 has no CPU mode worth using.
- AMD/Apple/multi-backend portability. Same GGUF file runs on CUDA, ROCm, Metal, Vulkan, and CPU.
- Ecosystem. Ollama, LM Studio, KoboldCPP, llamafile, and most other user-friendly local-LLM apps wrap llama.cpp.
- Streaming response handling, function calling, structured output. llama.cpp has had more time to bake in the server-side niceties.
Practical recipe: which one for which job
| Use case | Recommended runtime |
|---|---|
| 7B-8B chat on RTX 3060, max tok/s | ExLlamaV2 |
| 13B chat on RTX 3060, max tok/s | ExLlamaV2 at 4-4.5 bpw |
| 14B chat on RTX 3060, comfortable context | ExLlamaV2 at 4 bpw + cache_q4 |
| 20-32B with partial CPU offload | llama.cpp (partial GPU layers) |
| AMD GPU (RX 7600 XT, etc.) | llama.cpp with ROCm or Vulkan |
| Apple Silicon | llama.cpp with Metal |
| Multi-platform deployment | llama.cpp |
| Embedding in Ollama / LM Studio | llama.cpp (already there) |
| Custom integration in Python | Either; ExLlamaV2 if you want max throughput |
Storage and host considerations
LLM model files are large. A typical setup:
- WD Blue SN550 1TB NVMe for the OS and a working set of 4-6 quantized models. Fast load times matter when you're switching models frequently.
- Crucial BX500 1TB SATA SSD for the model library — older, less-used models live here. SATA is fine because model load is once per session.
- Host CPU like a Ryzen 7 5700X — 8 cores is enough headroom for llama.cpp's CPU layers if you do partial offload, and quiet enough thermally for a desktop in a quiet room.
Common pitfalls
- Wrong quant for the runtime. GGUF in ExLlamaV2 doesn't work; EXL2 in llama.cpp doesn't work. Download the right format.
- OOM from too-large context. A 14B model at 4 bpw fits at 4k context but not at 16k. Cap context to what your VRAM supports.
- Wrong tensor parallel split. Both runtimes can run multi-GPU; getting the layer split wrong wastes one GPU's worth of throughput.
- Old CUDA versions. Both runtimes target CUDA 12.x. Older 11.x installs work but you lose some kernels.
- Driver mismatches. NVIDIA driver versions matter for tensor-core kernels. Stay current.
When NOT to use either
If your workload is one-shot batch inference at scale, use vLLM instead — it's not what either of these is for. If your workload is fine-tuning, use Hugging Face transformers or Axolotl, not ExLlamaV2/llama.cpp (they're inference-only). If your workload is chat with a hosted API, use the API; running locally is for control, privacy, or cost reasons, not speed.
Bottom line
For a 12GB RTX 3060 in 2026, ExLlamaV2 wins on pure throughput for 7B-13B models that fit in VRAM, llama.cpp wins on flexibility and on larger models that need CPU offload. Most local-LLM tinkerers install both, pick the runtime per session, and don't pick a side. The right answer is "both."
Related guides
- Running a 1-Trillion-Parameter LLM on 768GB of Cheap Optane
- AMD Instinct MI300X vs Radeon RX 7600 XT: Datacenter vs Desk
Citations and sources
- GitHub — turboderp/exllamav2 repository
- GitHub — ggerganov/llama.cpp repository
- Hugging Face — TheBloke (community quant publisher)
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
