ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?

Name: ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?
Item: MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060
Author: Mike Perry

Two of the most popular local-LLM runtimes — and which one actually delivers more tokens per second on a budget 12GB GPU.

By Mike Perry · Published 2026-06-05 · Last verified 2026-07-01 · 10 min read

ExLlamaV2 and llama.cpp both run 12B-class models on a 12GB RTX 3060. Here's what each runtime is for, where they win, and how the tokens-per-second math actually shakes out.

Short answer: For a model that fits entirely in the RTX 3060 12GB's VRAM, ExLlamaV2 typically delivers 1.4-1.8× the tokens per second of llama.cpp with full GPU offload. For models that need partial CPU offload, llama.cpp closes the gap or wins outright because its CPU code path is faster than ExLlamaV2's. Both runtimes are mature, both have active communities, and most local-LLM users end up with both installed for different workloads.

This piece is editorial synthesis of the upstream repository documentation, TheBloke's published quant benchmarks on Hugging Face, and community throughput threads on r/LocalLLaMA. No first-party benchmarks are reported.

Key takeaways

ExLlamaV2 is GPU-native, fastest for models that fit fully in VRAM.
llama.cpp is hybrid CPU+GPU, more flexible for larger models that need partial offload.
On a 12GB RTX 3060, a 7B-13B model fits comfortably in VRAM at 4-bit and ExLlamaV2 wins.
A 14B model at q4 is tight; llama.cpp's partial-offload path becomes useful.
A 32B model can't fit; both runtimes will work but with substantial CPU offload — llama.cpp typically wins.

What are these two projects?

ExLlamaV2 is a CUDA-optimized inference engine for quantized LLM models, maintained by turboderp and a small group of contributors. It uses a custom quantization format (EXL2) and is designed to extract maximum throughput from NVIDIA GPUs by hand-tuning kernels for tensor cores. It's narrower in scope than llama.cpp — it does GPU inference well, doesn't try to do everything else — and that focus shows in benchmarks.

llama.cpp is a broader project, originally written by Georgi Gerganov, that supports CPU inference, GPU offload via multiple backends (CUDA, ROCm, Metal, Vulkan, SYCL), and runs on almost everything from a Raspberry Pi to an H100 cluster. It uses GGUF quantization, supports hybrid CPU+GPU execution where layers can split between the two, and is the substrate underneath most user-facing local-LLM apps (Ollama, LM Studio, KoboldCPP, etc.).

The trade is exactly what you'd expect: ExLlamaV2 is faster in its narrow sweet spot, llama.cpp is more flexible and works everywhere.

Quantization format differences

Aspect	ExLlamaV2 (EXL2)	llama.cpp (GGUF)
Primary target	NVIDIA GPU	CPU + GPU hybrid
Bit width	2.0-8.0 bpw (continuous)	q2_K, q3_K_M, q4_K_M, q5_K_M, q6_K, q8_0, fp16 (discrete)
Mixed-precision per layer	Yes (calibrated)	Yes (K-quants)
CPU offload	Limited	First-class
File format	Single .safetensors + config	Single .gguf
Community support	Smaller	Larger

The continuous bits-per-weight in EXL2 is genuinely useful — you can tune the quant level precisely to fit your VRAM budget. A 13B model in EXL2 at 4.5 bpw fits in 12GB more comfortably than the same model in GGUF q4_K_M (~4.6 bpw equivalent) because the EXL2 quantizer can distribute precision more finely across layers.

Throughput benchmarks (community-reported)

Approximate generation tok/s on a 12GB RTX 3060 with a quiet Ryzen 7 5700X host:

Model	Quant level	ExLlamaV2 tok/s	llama.cpp tok/s (full GPU)	llama.cpp tok/s (partial offload)
Llama 3.1 8B	4 bpw / q4_K_M	95–110	60–75	n/a (fits in VRAM)
Mistral 7B	4 bpw / q4_K_M	100–120	70–85	n/a
Qwen3 14B	4 bpw / q4_K_M	60–75	40–55	25–35 (with offload)
Llama 3.2 11B	5 bpw / q5_K_M	55–70	38–48	n/a
Qwen3 32B	4 bpw / q4_K_M	n/a (OOM)	n/a (OOM)	8–14

Numbers vary substantially with batch size, context length, draft-spec speculation, and exact quant level — these are mid-range community-reported figures, not promises. The pattern is consistent: ExLlamaV2 wins by 40-60% when everything fits; llama.cpp's hybrid mode is the only option for larger models.

Memory layout: what fits in 12GB

The math for VRAM usage on a 12GB RTX 3060:

~1 GB reserved for the OS and other apps (assume monitor connected)
~11 GB available for the model + KV cache + activations

At 4 bpw / q4_K_M:

Model size	Weights	KV cache (4k context)	Total VRAM	Fits 12GB?
7B	4.0 GB	0.5 GB	~4.8 GB	Yes, comfortably
8B	4.6 GB	0.6 GB	~5.5 GB	Yes, comfortably
11B	6.3 GB	0.7 GB	~7.5 GB	Yes
13B	7.5 GB	0.9 GB	~9.0 GB	Yes
14B	8.1 GB	1.0 GB	~9.7 GB	Yes, tight
20B	11.5 GB	1.4 GB	~13.5 GB	No (overflow)
32B	18.4 GB	2.2 GB	~21.0 GB	No (heavy overflow)

The 14B model is right at the edge of what fits with reasonable context. For 14B you want ExLlamaV2 at exactly 4 bpw, with cache_q4 enabled and context capped at 4-8k. Going to 16k context starts to push past the budget.

Installation and setup

ExLlamaV2 setup on a typical Ryzen + RTX 3060 Linux box:

Install CUDA 12.x and matching PyTorch
pip install exllamav2
Download an EXL2-quantized model from Hugging Face (e.g., from turboderp's repos)
Run with python -m exllamav2.server --model <path>

llama.cpp setup:

git clone https://github.com/ggerganov/llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release
Download a GGUF-quantized model from TheBloke's HF repos or similar
Run with ./build/bin/llama-server -m <path> -ngl 99 --port 8080

The llama.cpp install is more steps but gives you a binary with no Python runtime dependencies — useful for deployment. ExLlamaV2's Python install is faster to get running for development.

Where ExLlamaV2 wins

Pure-GPU throughput on models that fit in VRAM. Consistently 40-60% faster than llama.cpp on the same hardware.
Continuous quant level. EXL2 lets you tune bpw precisely; GGUF is discrete steps.
KV cache quantization. EXL2's cache_q4 / cache_q8 options let you fit longer contexts than GGUF can manage on the same VRAM.
Tensor-core utilization. Hand-tuned kernels actually use the RTX 3060's tensor cores effectively, where llama.cpp's CUDA path is more generic.

Where llama.cpp wins

Models that don't fit in VRAM. Partial offload (some layers on GPU, rest on CPU) is fundamental to llama.cpp's architecture; in ExLlamaV2 it's a limp-mode.
CPU-only inference. llama.cpp's CPU kernels are some of the fastest in the open-source world; ExLlamaV2 has no CPU mode worth using.
AMD/Apple/multi-backend portability. Same GGUF file runs on CUDA, ROCm, Metal, Vulkan, and CPU.
Ecosystem. Ollama, LM Studio, KoboldCPP, llamafile, and most other user-friendly local-LLM apps wrap llama.cpp.
Streaming response handling, function calling, structured output. llama.cpp has had more time to bake in the server-side niceties.

Practical recipe: which one for which job

Use case	Recommended runtime
7B-8B chat on RTX 3060, max tok/s	ExLlamaV2
13B chat on RTX 3060, max tok/s	ExLlamaV2 at 4-4.5 bpw
14B chat on RTX 3060, comfortable context	ExLlamaV2 at 4 bpw + cache_q4
20-32B with partial CPU offload	llama.cpp (partial GPU layers)
AMD GPU (RX 7600 XT, etc.)	llama.cpp with ROCm or Vulkan
Apple Silicon	llama.cpp with Metal
Multi-platform deployment	llama.cpp
Embedding in Ollama / LM Studio	llama.cpp (already there)
Custom integration in Python	Either; ExLlamaV2 if you want max throughput

Storage and host considerations

LLM model files are large. A typical setup:

WD Blue SN550 1TB NVMe for the OS and a working set of 4-6 quantized models. Fast load times matter when you're switching models frequently.
Crucial BX500 1TB SATA SSD for the model library — older, less-used models live here. SATA is fine because model load is once per session.
Host CPU like a Ryzen 7 5700X — 8 cores is enough headroom for llama.cpp's CPU layers if you do partial offload, and quiet enough thermally for a desktop in a quiet room.

Common pitfalls

Wrong quant for the runtime. GGUF in ExLlamaV2 doesn't work; EXL2 in llama.cpp doesn't work. Download the right format.
OOM from too-large context. A 14B model at 4 bpw fits at 4k context but not at 16k. Cap context to what your VRAM supports.
Wrong tensor parallel split. Both runtimes can run multi-GPU; getting the layer split wrong wastes one GPU's worth of throughput.
Old CUDA versions. Both runtimes target CUDA 12.x. Older 11.x installs work but you lose some kernels.
Driver mismatches. NVIDIA driver versions matter for tensor-core kernels. Stay current.

When NOT to use either

If your workload is one-shot batch inference at scale, use vLLM instead — it's not what either of these is for. If your workload is fine-tuning, use Hugging Face transformers or Axolotl, not ExLlamaV2/llama.cpp (they're inference-only). If your workload is chat with a hosted API, use the API; running locally is for control, privacy, or cost reasons, not speed.

Bottom line

For a 12GB RTX 3060 in 2026, ExLlamaV2 wins on pure throughput for 7B-13B models that fit in VRAM, llama.cpp wins on flexibility and on larger models that need CPU offload. Most local-LLM tinkerers install both, pick the runtime per session, and don't pick a side. The right answer is "both."

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Which runtime gives the most tokens per second on RTX 3060 12GB?

For a 7B-13B model that fits entirely in 12GB of VRAM, ExLlamaV2 typically delivers 1.4-1.8x the tokens-per-second of llama.cpp with full GPU offload. The advantage shrinks once you have to partially offload to system RAM, because llama.cpp's CPU code path is faster than ExLlamaV2's. For pure-GPU inference, ExLlamaV2 wins; for hybrid GPU+CPU loads (larger models with partial offload), llama.cpp is competitive or better.

Why are llama.cpp models GGUF and ExLlamaV2 models EXL2?

They use different quantization formats. GGUF is llama.cpp's container format optimized for mixed CPU+GPU execution and broad architecture support. EXL2 is ExLlamaV2's GPU-only format optimized for NVIDIA tensor cores. You generally need to download the right-quant version for your runtime — most community models on Hugging Face (like TheBloke's collections) are published in both formats.

Can I run the same model in both runtimes?

Yes, almost always — but you need different files. A Llama 3.1 8B model in GGUF format runs in llama.cpp; the same model in EXL2 format runs in ExLlamaV2. You can have both formats sitting on disk and pick the runtime per use case. Disk-space cost is roughly 2x for keeping both, since each is its own quant of the original weights.

What quant level should I use on a 12GB RTX 3060?

For most 7B-13B models, q4 (GGUF) or 4-bit (EXL2) is the sweet spot — fits in 12GB with comfortable context, quality loss is small. A 13B model at q5_K_M (GGUF) or 5.0bpw (EXL2) starts to push against the VRAM ceiling but delivers slightly better quality. q8 / 8.0bpw is usually too large for a 12GB card on anything above 7B.

Does ExLlamaV2 require CUDA or does it work on AMD?

ExLlamaV2 is primarily NVIDIA/CUDA — that's where its performance optimizations are. It has experimental ROCm support but the experience is less polished than on NVIDIA. For AMD GPUs, llama.cpp with ROCm or Vulkan backend is the more mature path; for NVIDIA, both runtimes are viable and ExLlamaV2 generally wins on pure-GPU throughput.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?

Key takeaways

What are these two projects?

Quantization format differences

Throughput benchmarks (community-reported)

Memory layout: what fits in 12GB

Installation and setup

Where ExLlamaV2 wins

Where llama.cpp wins

Practical recipe: which one for which job

Storage and host considerations

Common pitfalls

When NOT to use either

Bottom line

Related guides

Citations and sources

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?

Key takeaways

What are these two projects?

Quantization format differences

Throughput benchmarks (community-reported)

Memory layout: what fits in 12GB

Installation and setup

Where ExLlamaV2 wins

Where llama.cpp wins

Practical recipe: which one for which job

Storage and host considerations

Common pitfalls

When NOT to use either

Bottom line

Related guides

Citations and sources

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review