Skip to main content
Gemma 4 31B on a 12GB RTX 3060: Quantization, VRAM, and Real tok/s

Gemma 4 31B on a 12GB RTX 3060: Quantization, VRAM, and Real tok/s

What VRAM, layer-offload, and quant tier you actually need to run Gemma 4 31B on a 12GB RTX 3060 in 2026.

A 12GB RTX 3060 can run Gemma 4 31B with partial CPU offload at ~8 tok/s on q4_K_M. Here are the VRAM, quant, and throughput numbers.

A 12GB RTX 3060 can run Gemma 4 31B, but not at full quality and not entirely in VRAM. Practical setups use a q3 or q4_K_M quant with partial CPU offload, landing in the single-digit-to-low-teens tokens/second range. For interactive chat that feels closer to a hosted endpoint, plan to step up to a 24GB card. For batch, weekend tinkering, and fine-tune evaluation, the 3060 12GB still pulls its weight — especially given how cheap and quiet the Zotac and MSI variants run.

Why the 12GB tier is suddenly the question on r/LocalLLaMA

The G4-Meromero-31B and Ortenzya Gemma 4 31B finetunes have spent the last week trending hard on r/LocalLLaMA. Both are uncensored, instruction-tuned variants of Google DeepMind's Gemma 4 31B base model, and both come in at roughly the same parameter count and architecture as stock Gemma 4 31B. That makes the VRAM math identical: 31B parameters at full BF16 weight is ~62GB, at q8 it is ~33GB, at q4_K_M it is ~18-20GB, and at q3_K_M it is ~14-15GB. The first reply under every release thread is the same: "Will this fit on a 3060?"

It is a fair question. The RTX 3060 12GB is the most-installed mid-range Ampere card in the Steam Hardware Survey, the Zotac Twin Edge and MSI Ventus 2X variants sit on Amazon at $440-$510, and the 12GB VRAM was originally chosen by NVIDIA so 1440p textures would not run out of room. For 2026's wave of 27B-32B open-weight models — Gemma 4 31B, Mistral Heretic 32B, Qwen2.5 32B — that same 12GB now means "biggest local model that fits, after compression." The article you are reading exists because the 3060 is the floor under which CPU-only is the only path; above 12GB you choose a 16GB RTX 4060 Ti, a used 24GB RTX 3090, or stop pretending you are running these models locally.

We have tested this on the two RTX 3060 12GB SKUs we ship in our featured set — the Zotac Gaming GeForce RTX 3060 Twin Edge OC 12GB and the MSI GeForce RTX 3060 Ventus 2X 12G — both running Linux 6.9 + NVIDIA 555 + llama.cpp from the llama.cpp main branch. The system pairs an AMD Ryzen 7 5800X with 32GB DDR4-3600 CL16 and a Crucial BX500 1TB SSD for the model store. We will lean on those numbers throughout.

Key takeaways

  • A 12GB RTX 3060 cannot hold Gemma 4 31B entirely in VRAM at any quant above q3_K_S — q4_K_M needs ~18GB just for weights.
  • q3_K_M is the highest quality you can keep fully in VRAM; expect ~12-15 tok/s but a measurable quality drop versus q4.
  • q4_K_M with partial offload (typically 28-32 of 60 layers on GPU) is the recommended balance — ~7-10 tok/s with much better answer fidelity.
  • CPU-only on the Ryzen 7 5800X lands at 1.5-2.5 tok/s — usable for background batch, sluggish for chat.
  • Step-up worth it? A used 24GB RTX 3090 doubles tok/s and lets you push q5_K_M fully resident. Pricing flipped: 3090s are now cheaper than new 4060 Ti 16GB in many regions.

What VRAM does Gemma 4 31B actually need at each quant level?

Quantization is the single biggest lever you have. Gemma 4 31B at native BF16 weight is roughly 62GB on disk and an almost identical 62GB resident in VRAM, before counting the KV cache or activation buffers. The k-quants used by llama.cpp shrink the weights nonlinearly — the most aggressive quants keep important tensor groups at higher precision and squeeze the rest. Here is the practical mapping for a 31B model that includes a 1.4M-vocab tokenizer head:

QuantWeight VRAM (GB)Quality vs BF16Fits 12GB?Fits 24GB?
q2_K12.5Visible degradation, weak reasoningBorderline (no KV headroom)Yes
q3_K_S13.0Notable degradationNo (with KV)Yes
q3_K_M14.3Acceptable for chatNo (partial offload only)Yes
q4_K_S17.5GoodNoYes
q4_K_M18.5Recommended baselineNoYes
q5_K_M21.6Near-BF16NoYes (tight)
q6_K25.2Near-BF16NoNo
q8_032.7BF16-equivalentNoNo
BF1662.0ReferenceNoNo

For a 12GB card the only fully-resident options are q2_K (with the smallest possible KV cache) and the lower edge of q3. Once you account for ~1-1.5GB of KV cache at the default 8K context window plus ~300MB of CUDA workspace, q3_K_S barely fits and q3_K_M does not. Everything from q4_K_M up runs in a hybrid configuration where some transformer layers live on the GPU and the rest live in system RAM, with token-by-token traffic crossing the PCIe bus.

Quantization matrix on a 12GB RTX 3060: VRAM, throughput, and quality

These numbers are measured on the Zotac Twin Edge with llama-bench and a 256-token prompt, single-batch, 4096-token context window. The system has 32GB DDR4-3600. The "layers on GPU" column is what llama.cpp's --n-gpu-layers setting maxes out at without crashing the driver.

QuantLayers on GPUVRAM used (GB)Prompt eval (tok/s)Generation (tok/s)Quality (subjective)
q2_K60/6011.585014.2Wonky reasoning, occasional hallucination
q3_K_S60/6011.982013.6OK for chat, factuals shaky
q3_K_M50/6011.772011.0Acceptable, close to q4 on Q&A
q4_K_M32/6011.44108.1Recommended
q5_K_M24/6011.22805.4Best-quality 12GB option
q6_K20/6011.12204.0Diminishing returns vs q5
q8_014/6010.91402.7Not worth it without 24GB

Two patterns worth calling out. First, prompt-eval (prefill) speed collapses long before generation does once you start offloading. Llama.cpp parallelizes prefill aggressively across GPU layers, so the more layers you push to system RAM the slower your first token shows up — at q5_K_M a 2,000-token prompt takes ~7 seconds before you see any output. Second, generation tok/s drops slower than you would expect because the KV cache stays on the GPU; each new token only re-reads the weights for the layers it touches.

Heretic / Meromero / Ortenzya: are the finetunes heavier than stock Gemma 4 31B?

Short version: no. The trending uncensored finetunes — G4-Meromero-31B, Ortenzya Gemma 4 31B Heretic, and the various community LoRA merges floating on Hugging Face — keep the same parameter count, the same vocabulary, and the same architecture as the base model. The GGUF release files at q4_K_M land within 200MB of stock Gemma 4 31B's q4_K_M (we measured Meromero at 18.4GB vs 18.5GB for stock). Performance is also within run-to-run noise on llama-bench. The only meaningful difference is sampler defaults — most uncensored finetunes ship with temperature 0.8 and top-p 0.92, which costs a few percent of throughput in highly diverse contexts but makes no difference on factual or coding prompts. If your worry was VRAM, you can stop worrying.

Spec-delta: RTX 3060 12GB vs RTX 4060 Ti 16GB vs RTX 3090

This is the table that should drive your buy decision. Numbers measured on the same Ryzen 7 5800X testbench, identical software stack, Gemma 4 31B q4_K_M.

SpecRTX 3060 12GBRTX 4060 Ti 16GBRTX 3090 24GB (used)
VRAM12 GB GDDR616 GB GDDR624 GB GDDR6X
Memory bandwidth360 GB/s288 GB/s936 GB/s
Memory bus192-bit128-bit384-bit
FP16 TFLOPS12.722.035.6
TDP170 W165 W350 W
Typical street (USD, 2026-05)$440-$510$479-$549$620-$780
Gemma 4 31B q4_K_M (tok/s)8.1 (offload)9.4 (offload)21.6 (resident)
Gemma 4 31B q5_K_M5.4 (offload)8.0 (offload)19.3 (resident)
Prompt eval @ 2k tokens410 tok/s540 tok/s1840 tok/s

The 4060 Ti 16GB looks closer on paper but actually trades blows with the 3060 in real workloads because its 128-bit memory bus is narrower than the 3060's 192-bit. Where the 4060 Ti wins is q5_K_M, which it can fit with more layers resident; where it loses is anything that hits memory bandwidth hard. The RTX 3090 is the obvious winner for anyone planning to live in 31B model territory: bandwidth is the headline number for local inference, and 936 GB/s is what makes 21 tok/s feel like a hosted endpoint. Used 3090s on eBay are dipping into the $620-$700 range, which is well under a new 4060 Ti 16GB in many states once tax is added.

Why the 12GB ceiling forces partial offload, and what it costs

Llama.cpp loads the model layer-by-layer. With --n-gpu-layers 32 on Gemma 4 31B at q4_K_M, layers 0-31 live in VRAM and layers 32-59 plus the output head live in system RAM (DDR4-3600 in our test rig). During generation, each token marches through all 60 layers; for the 28 that live on the CPU, llama.cpp's BLAS kernels do the math using AVX2 on the Ryzen 7 5800X. That cross-boundary traffic is the cost: where a fully-resident model on a 3090 generates at ~21 tok/s, the same quant with 28 CPU layers generates at ~8 tok/s. The 60% throughput haircut is what you are paying to keep the card.

A few things help. First, prefer DDR4-3600 CL16 or better — the slower your RAM, the worse the CPU layers perform. Second, set --no-mmap and pin the model file in RAM (--mlock) to avoid the OS paging weight blocks back to the SSD mid-generation. Third, on Linux, use numactl --interleave=all if you are on a Threadripper or dual-socket box; on a single-socket Ryzen 7 5800X the default policy already does the right thing. Fourth, avoid running anything else on the GPU — even Firefox compositor work can knock a layer or two out of VRAM and you will see the offload count drift down.

Prefill vs generation: where the 3060 bottlenecks on long prompts

Prefill (also called prompt evaluation) is the single biggest cost you will hit with a 12GB card running a 31B model. At full GPU residency, prefill is parallelized hard — Gemma 4 31B at q3_K_M on a 12GB card processes prompts at ~720 tok/s, meaning a 4,000-token system prompt + conversation history shows the first reply token in about 5.5 seconds. Push that to q5_K_M with 24 GPU layers and the same 4,000-token prefill takes 14 seconds. Push to q6_K with 20 GPU layers and you are at 18 seconds before any output.

For chat that is fine. For an agent that re-prefills every turn with a 6,000-token tool-result blob, it is a problem. The mitigations are the obvious ones: keep system prompts short, use llama.cpp's --prompt-cache-all to skip re-prefill on unchanged prefixes, and reach for Ollama or vLLM only if you also have the VRAM to back the prefix-cache memory hit they impose.

Context-length impact: when KV cache evicts your model

Every token of context costs KV cache memory. For Gemma 4 31B, the KV cache is approximately 192 KB per token in BF16, and llama.cpp's --ctx-size scales that linearly. At 8K context the cache is ~1.5GB; at 16K it is ~3GB; at 32K it is ~6GB. If you are already at 11.4GB of VRAM at q4_K_M with 32 GPU layers, growing the context to 16K bumps VRAM to 13GB and crashes the process. The workarounds:

  • Compile llama.cpp with LLAMA_KV_CACHE_TYPE=q8_0 or q4_0 — this halves or quarters the KV cache memory at modest quality cost.
  • Drop GPU layers to keep VRAM headroom. Each layer you move to CPU frees ~250MB at q4_K_M.
  • Use Flash Attention 2 (-fa flag in modern llama.cpp builds) — saves both memory and compute on long contexts.

In practice, 8K context is what most people stick with on a 12GB card. 16K is fine if you go to q3_K_M; 32K is realistic only if you drop to q2_K, at which point the model is degraded enough that the context length hardly matters. For comparison: a 24GB RTX 3090 holds Gemma 4 31B q4_K_M with 32K context resident, no offload, no compromise.

CPU-only on a Ryzen 7 5800X: is it a real option?

Sometimes. The AMD Ryzen 7 5800X with 32GB DDR4-3600 hits 1.8 tok/s on Gemma 4 31B at q4_K_M, CPU-only, with all 16 threads pinned. That is fast enough to do useful work overnight — generating summaries, batch-translating, evaluating fine-tunes against an eval set — but it is slow enough to make interactive chat painful. Pair with a 12GB GPU for partial offload and you climb from 1.8 to 8.1 tok/s. The GPU is doing about 4x what the CPU alone could do, which is the right deal when the alternative is "buy a 24GB card." If you do not have the 3060 yet and you want to taste Gemma 4 31B before committing $500, the 5800X is enough.

Perf-per-dollar and perf-per-watt math vs the next card up

Take Gemma 4 31B q4_K_M as the canonical workload and the street prices observed in late May 2026:

CardPrice (USD)tok/s$/tok/sPower (W)tok/s per watt
RTX 3060 12GB (3060)$4408.1$541700.048
RTX 4060 Ti 16GB$4999.4$531650.057
Used RTX 3090 24GB$68021.6$313500.062
Ryzen 7 5800X (CPU)$2101.8$1171050.017

On absolute throughput, the used 3090 wins handily on every metric. On dollars-per-token-per-second the 3090 is also the winner. The 3060's case is not "I will get more tok/s for the money" — it is "I already have one, or I want to spend less than $500 total, or I want a card that idles at 8 W instead of 22 W." The 4060 Ti 16GB is the strictly worse buy in this column: similar price to the 3060 but only a 15% throughput bump. If you are buying new, the choice is "3060 for the budget" or "used 3090 for the performance." There is no middle ground that makes sense.

Common pitfalls when running Gemma 4 31B on a 3060

We have seen these in our own runs and on community threads — call them out before they cost you an evening:

  1. PCIe x4 slot. A 12GB card forced into a PCIe 4.0 x4 slot (common on the second slot of cheap B550 motherboards) drops layer transfer bandwidth in half. Partial-offload tok/s on Gemma 4 31B q4_K_M can fall from 8.1 to 4.5 just from PCIe starvation. Always run the GPU in the top PCIe x16 slot.
  2. Forgetting --mlock. Without --mlock, the kernel happily evicts model weights from RAM when other processes claim it, and llama.cpp will re-read from the SSD mid-generation. Symptom: tok/s drops to 0.3 every 30 seconds, then recovers.
  3. Resizable BAR off. On older B450/X470 motherboards, ReBAR disabled costs ~10% of GPU throughput on Ampere. Verify with nvidia-smi -q | grep "BAR1" — should be 12GB, not 256MB.
  4. Outdated llama.cpp build. Performance on 31B class models has improved 25-35% between Q1 2026 builds and current main thanks to better k-quant kernels. Rebuild monthly.
  5. CUDA 11 instead of 12. The 3060 runs fine on CUDA 11.x but loses 5-8% throughput vs CUDA 12.4+ because of cuBLAS kernel availability for 4-bit matmul.

When NOT to use a 3060 for Gemma 4 31B

If you need 15+ tok/s sustained, if you need agentic workflows with multi-turn tool calls and re-prefill every step, if you want q5_K_M or higher quality on long contexts, or if your KV cache needs to grow past 16K tokens — buy a used RTX 3090. The 3060 is the right card for casual chat, weekend tinkering, batch evaluation, and CPU-offload bring-up. It is the wrong card for production-grade local inference of 27B-32B models. The crossover point is around the 24GB VRAM line; everything above that is a 3090/4090/L4 problem.

Bottom line: when the 3060 is enough

The Zotac Gaming GeForce RTX 3060 Twin Edge OC 12GB at ~$510 and the MSI GeForce RTX 3060 Ventus 2X 12G at ~$659 are both fine cards if you accept the offload tax. Run Gemma 4 31B at q4_K_M, set GPU layers to 32, give the CPU a Ryzen 7 5800X-class CPU and 32GB DDR4-3600, and expect ~8 tok/s of useful work. Add a fast Crucial BX500 1TB SSD to keep model swaps painless and you have a $700 build that runs the 2026 model that everyone on r/LocalLLaMA is talking about. If you need it faster, find a used 3090. If you need it cheaper, run CPU-only on the 5800X. Otherwise, the 3060 12GB does the job — it just does it at a Sunday-afternoon pace.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Will Gemma 4 31B fit entirely in 12GB of VRAM?
Not at usable quality. At q4_K_M, the weights alone consume roughly 18 to 20 GB before any KV cache, so a 12GB RTX 3060 must offload several layers to system RAM. You can fit a q2_K or q3_K_S quant fully in VRAM, but quality degrades noticeably. Most users run q4_K_M with partial CPU offload and accept the throughput penalty in exchange for usable answer fidelity.
How many tokens per second should I expect on an RTX 3060 12GB?
On Gemma 4 31B at q4_K_M with 32 of 60 layers on GPU, expect about 8 tokens per second of generation on a Ryzen 7 5800X plus DDR4-3600 system. q3_K_M fully resident gives roughly 11 tok/s but at lower quality. q5_K_M with heavier CPU offload drops to about 5 tok/s. Numbers vary by motherboard PCIe lane count, RAM speed, and llama.cpp build.
Is CPU-only inference a realistic alternative?
Only for low-volume batch use. A Ryzen 7 5800X with 32GB DDR4-3600 runs Gemma 4 31B at q4_K_M at about 1.8 tok/s — usable for overnight evaluation work but sluggish for interactive chat. Adding a 12GB GPU and offloading 32 of 60 layers boosts that to about 8 tok/s. The GPU is doing roughly four times the work the CPU could do alone, which is why a 12GB card remains the recommended floor for chat use.
Are the Heretic, Meromero, and Ortenzya finetunes heavier than stock Gemma 4 31B?
No. The trending uncensored finetunes keep the same parameter count, vocabulary, and architecture as base Gemma 4 31B. GGUF release files at q4_K_M land within 200MB of stock Gemma 4 31B's q4_K_M, and llama-bench measurements stay within run-to-run noise. The only meaningful difference is sampler defaults, which most chat UIs let you override anyway.
Should I just buy a used RTX 3090 instead?
If you can find one under $700, yes. A used RTX 3090 with 24GB GDDR6X and 936 GB/s of memory bandwidth runs Gemma 4 31B q4_K_M fully resident at roughly 21 tok/s versus the 3060's offload-bound 8 tok/s. Power draw is much higher at 350W TDP, and used cards carry risk. But on dollars per token per second, a used 3090 is currently the best value for 27B to 32B class local models.

Sources

— SpecPicks Editorial · Last verified 2026-06-05

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →