Skip to main content
ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?

ExLlamaV2 vs llama.cpp on the RTX 3060 12GB: Faster for 12B?

Two of the most popular local-LLM runtimes — and which one actually delivers more tokens per second on a budget 12GB GPU.

ExLlamaV2 and llama.cpp both run 12B-class models on a 12GB RTX 3060. Here's what each runtime is for, where they win, and how the tokens-per-second math actually shakes out.

Short answer: For a model that fits entirely in the RTX 3060 12GB's VRAM, ExLlamaV2 typically delivers 1.4-1.8× the tokens per second of llama.cpp with full GPU offload. For models that need partial CPU offload, llama.cpp closes the gap or wins outright because its CPU code path is faster than ExLlamaV2's. Both runtimes are mature, both have active communities, and most local-LLM users end up with both installed for different workloads.

This piece is editorial synthesis of the upstream repository documentation, TheBloke's published quant benchmarks on Hugging Face, and community throughput threads on r/LocalLLaMA. No first-party benchmarks are reported.

Key takeaways

  • ExLlamaV2 is GPU-native, fastest for models that fit fully in VRAM.
  • llama.cpp is hybrid CPU+GPU, more flexible for larger models that need partial offload.
  • On a 12GB RTX 3060, a 7B-13B model fits comfortably in VRAM at 4-bit and ExLlamaV2 wins.
  • A 14B model at q4 is tight; llama.cpp's partial-offload path becomes useful.
  • A 32B model can't fit; both runtimes will work but with substantial CPU offload — llama.cpp typically wins.

What are these two projects?

ExLlamaV2 is a CUDA-optimized inference engine for quantized LLM models, maintained by turboderp and a small group of contributors. It uses a custom quantization format (EXL2) and is designed to extract maximum throughput from NVIDIA GPUs by hand-tuning kernels for tensor cores. It's narrower in scope than llama.cpp — it does GPU inference well, doesn't try to do everything else — and that focus shows in benchmarks.

llama.cpp is a broader project, originally written by Georgi Gerganov, that supports CPU inference, GPU offload via multiple backends (CUDA, ROCm, Metal, Vulkan, SYCL), and runs on almost everything from a Raspberry Pi to an H100 cluster. It uses GGUF quantization, supports hybrid CPU+GPU execution where layers can split between the two, and is the substrate underneath most user-facing local-LLM apps (Ollama, LM Studio, KoboldCPP, etc.).

The trade is exactly what you'd expect: ExLlamaV2 is faster in its narrow sweet spot, llama.cpp is more flexible and works everywhere.

Quantization format differences

AspectExLlamaV2 (EXL2)llama.cpp (GGUF)
Primary targetNVIDIA GPUCPU + GPU hybrid
Bit width2.0-8.0 bpw (continuous)q2_K, q3_K_M, q4_K_M, q5_K_M, q6_K, q8_0, fp16 (discrete)
Mixed-precision per layerYes (calibrated)Yes (K-quants)
CPU offloadLimitedFirst-class
File formatSingle .safetensors + configSingle .gguf
Community supportSmallerLarger

The continuous bits-per-weight in EXL2 is genuinely useful — you can tune the quant level precisely to fit your VRAM budget. A 13B model in EXL2 at 4.5 bpw fits in 12GB more comfortably than the same model in GGUF q4_K_M (~4.6 bpw equivalent) because the EXL2 quantizer can distribute precision more finely across layers.

Throughput benchmarks (community-reported)

Approximate generation tok/s on a 12GB RTX 3060 with a quiet Ryzen 7 5700X host:

ModelQuant levelExLlamaV2 tok/sllama.cpp tok/s (full GPU)llama.cpp tok/s (partial offload)
Llama 3.1 8B4 bpw / q4_K_M95–11060–75n/a (fits in VRAM)
Mistral 7B4 bpw / q4_K_M100–12070–85n/a
Qwen3 14B4 bpw / q4_K_M60–7540–5525–35 (with offload)
Llama 3.2 11B5 bpw / q5_K_M55–7038–48n/a
Qwen3 32B4 bpw / q4_K_Mn/a (OOM)n/a (OOM)8–14

Numbers vary substantially with batch size, context length, draft-spec speculation, and exact quant level — these are mid-range community-reported figures, not promises. The pattern is consistent: ExLlamaV2 wins by 40-60% when everything fits; llama.cpp's hybrid mode is the only option for larger models.

Memory layout: what fits in 12GB

The math for VRAM usage on a 12GB RTX 3060:

  • ~1 GB reserved for the OS and other apps (assume monitor connected)
  • ~11 GB available for the model + KV cache + activations

At 4 bpw / q4_K_M:

Model sizeWeightsKV cache (4k context)Total VRAMFits 12GB?
7B4.0 GB0.5 GB~4.8 GBYes, comfortably
8B4.6 GB0.6 GB~5.5 GBYes, comfortably
11B6.3 GB0.7 GB~7.5 GBYes
13B7.5 GB0.9 GB~9.0 GBYes
14B8.1 GB1.0 GB~9.7 GBYes, tight
20B11.5 GB1.4 GB~13.5 GBNo (overflow)
32B18.4 GB2.2 GB~21.0 GBNo (heavy overflow)

The 14B model is right at the edge of what fits with reasonable context. For 14B you want ExLlamaV2 at exactly 4 bpw, with cache_q4 enabled and context capped at 4-8k. Going to 16k context starts to push past the budget.

Installation and setup

ExLlamaV2 setup on a typical Ryzen + RTX 3060 Linux box:

  1. Install CUDA 12.x and matching PyTorch
  2. pip install exllamav2
  3. Download an EXL2-quantized model from Hugging Face (e.g., from turboderp's repos)
  4. Run with python -m exllamav2.server --model <path>

llama.cpp setup:

  1. git clone https://github.com/ggerganov/llama.cpp
  2. cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release
  3. Download a GGUF-quantized model from TheBloke's HF repos or similar
  4. Run with ./build/bin/llama-server -m <path> -ngl 99 --port 8080

The llama.cpp install is more steps but gives you a binary with no Python runtime dependencies — useful for deployment. ExLlamaV2's Python install is faster to get running for development.

Where ExLlamaV2 wins

  • Pure-GPU throughput on models that fit in VRAM. Consistently 40-60% faster than llama.cpp on the same hardware.
  • Continuous quant level. EXL2 lets you tune bpw precisely; GGUF is discrete steps.
  • KV cache quantization. EXL2's cache_q4 / cache_q8 options let you fit longer contexts than GGUF can manage on the same VRAM.
  • Tensor-core utilization. Hand-tuned kernels actually use the RTX 3060's tensor cores effectively, where llama.cpp's CUDA path is more generic.

Where llama.cpp wins

  • Models that don't fit in VRAM. Partial offload (some layers on GPU, rest on CPU) is fundamental to llama.cpp's architecture; in ExLlamaV2 it's a limp-mode.
  • CPU-only inference. llama.cpp's CPU kernels are some of the fastest in the open-source world; ExLlamaV2 has no CPU mode worth using.
  • AMD/Apple/multi-backend portability. Same GGUF file runs on CUDA, ROCm, Metal, Vulkan, and CPU.
  • Ecosystem. Ollama, LM Studio, KoboldCPP, llamafile, and most other user-friendly local-LLM apps wrap llama.cpp.
  • Streaming response handling, function calling, structured output. llama.cpp has had more time to bake in the server-side niceties.

Practical recipe: which one for which job

Use caseRecommended runtime
7B-8B chat on RTX 3060, max tok/sExLlamaV2
13B chat on RTX 3060, max tok/sExLlamaV2 at 4-4.5 bpw
14B chat on RTX 3060, comfortable contextExLlamaV2 at 4 bpw + cache_q4
20-32B with partial CPU offloadllama.cpp (partial GPU layers)
AMD GPU (RX 7600 XT, etc.)llama.cpp with ROCm or Vulkan
Apple Siliconllama.cpp with Metal
Multi-platform deploymentllama.cpp
Embedding in Ollama / LM Studiollama.cpp (already there)
Custom integration in PythonEither; ExLlamaV2 if you want max throughput

Storage and host considerations

LLM model files are large. A typical setup:

  • WD Blue SN550 1TB NVMe for the OS and a working set of 4-6 quantized models. Fast load times matter when you're switching models frequently.
  • Crucial BX500 1TB SATA SSD for the model library — older, less-used models live here. SATA is fine because model load is once per session.
  • Host CPU like a Ryzen 7 5700X — 8 cores is enough headroom for llama.cpp's CPU layers if you do partial offload, and quiet enough thermally for a desktop in a quiet room.

Common pitfalls

  • Wrong quant for the runtime. GGUF in ExLlamaV2 doesn't work; EXL2 in llama.cpp doesn't work. Download the right format.
  • OOM from too-large context. A 14B model at 4 bpw fits at 4k context but not at 16k. Cap context to what your VRAM supports.
  • Wrong tensor parallel split. Both runtimes can run multi-GPU; getting the layer split wrong wastes one GPU's worth of throughput.
  • Old CUDA versions. Both runtimes target CUDA 12.x. Older 11.x installs work but you lose some kernels.
  • Driver mismatches. NVIDIA driver versions matter for tensor-core kernels. Stay current.

When NOT to use either

If your workload is one-shot batch inference at scale, use vLLM instead — it's not what either of these is for. If your workload is fine-tuning, use Hugging Face transformers or Axolotl, not ExLlamaV2/llama.cpp (they're inference-only). If your workload is chat with a hosted API, use the API; running locally is for control, privacy, or cost reasons, not speed.

Bottom line

For a 12GB RTX 3060 in 2026, ExLlamaV2 wins on pure throughput for 7B-13B models that fit in VRAM, llama.cpp wins on flexibility and on larger models that need CPU offload. Most local-LLM tinkerers install both, pick the runtime per session, and don't pick a side. The right answer is "both."

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Which runtime gives the most tokens per second on RTX 3060 12GB?
For a 7B-13B model that fits entirely in 12GB of VRAM, ExLlamaV2 typically delivers 1.4-1.8x the tokens-per-second of llama.cpp with full GPU offload. The advantage shrinks once you have to partially offload to system RAM, because llama.cpp's CPU code path is faster than ExLlamaV2's. For pure-GPU inference, ExLlamaV2 wins; for hybrid GPU+CPU loads (larger models with partial offload), llama.cpp is competitive or better.
Why are llama.cpp models GGUF and ExLlamaV2 models EXL2?
They use different quantization formats. GGUF is llama.cpp's container format optimized for mixed CPU+GPU execution and broad architecture support. EXL2 is ExLlamaV2's GPU-only format optimized for NVIDIA tensor cores. You generally need to download the right-quant version for your runtime — most community models on Hugging Face (like TheBloke's collections) are published in both formats.
Can I run the same model in both runtimes?
Yes, almost always — but you need different files. A Llama 3.1 8B model in GGUF format runs in llama.cpp; the same model in EXL2 format runs in ExLlamaV2. You can have both formats sitting on disk and pick the runtime per use case. Disk-space cost is roughly 2x for keeping both, since each is its own quant of the original weights.
What quant level should I use on a 12GB RTX 3060?
For most 7B-13B models, q4 (GGUF) or 4-bit (EXL2) is the sweet spot — fits in 12GB with comfortable context, quality loss is small. A 13B model at q5_K_M (GGUF) or 5.0bpw (EXL2) starts to push against the VRAM ceiling but delivers slightly better quality. q8 / 8.0bpw is usually too large for a 12GB card on anything above 7B.
Does ExLlamaV2 require CUDA or does it work on AMD?
ExLlamaV2 is primarily NVIDIA/CUDA — that's where its performance optimizations are. It has experimental ROCm support but the experience is less polished than on NVIDIA. For AMD GPUs, llama.cpp with ROCm or Vulkan backend is the more mature path; for NVIDIA, both runtimes are viable and ExLlamaV2 generally wins on pure-GPU throughput.

Sources

— SpecPicks Editorial · Last verified 2026-06-05

NVIDIA GeForce RTX 3060
NVIDIA GeForce RTX 3060
$389.22
View on Amazon →