No, a single RTX 3060 12GB cannot fit the G4-Meromero-31B Heretic finetune (or any 31B-class model) entirely on the card. With partial CPU offload and an aggressive q2 or q3 quantization you can run it usefully at single-digit to low-double-digit tokens per second. For full on-GPU inference, you need a 16GB-plus card; the 3060 12GB is the cheapest entry into 31B-class work, not a fast home for it.
Who actually wants to run an uncensored 31B at home
The audience for a finetune like G4-Meromero-31B-Uncensored-Heretic is narrower than the audience for general local LLMs, but it overlaps almost perfectly with the people most likely to own an RTX 3060 12GB. These are hobbyists who run Ollama or llama.cpp on a budget 12GB card, who have already done a Mistral 7B or Qwen 14B run, and who want to step up to a mid-size model that can hold longer roleplay context, write longer-form fiction, or answer questions without the refusal patterns of a base release.
The hardware reality is that 31B parameters in 4-bit weights — the floor for usable quality — is about 18 to 19 GB just for the model file. You cannot store that in 12 GB no matter how clever your loader is. What you can do is use llama.cpp's GPU-layer offload to push as many transformer blocks as fit, keep the rest of the model in system RAM, and accept that prefill and generation will be capped by PCIe bandwidth on the offloaded layers. The card stays useful because the layers that do live on the GPU still see the 360 GB/s of GDDR6 the 3060 12GB ships with — meaningfully faster than DDR5 system RAM at around 90 GB/s.
That mismatch is the whole story of running 31B on a 12 GB card. Below we walk through quantizations that fit, throughput numbers we see in our own benchmarks, and the spec deltas to a 16 GB or 24 GB upgrade that would let the model live entirely on-GPU.
Key takeaways
- 31B at q4_K_M needs roughly 18–19 GB; the RTX 3060 12GB cannot hold it fully on-GPU.
- A q2_K or low-bit GGUF of the Heretic finetune will run with partial offload, at 4–9 tok/s on most desktops.
- The 3060 12GB is the cheapest 12 GB consumer card; it remains a great fit for 7B–14B models at full GPU speed.
- Context length matters: a 16K-context q4 run leaks 1.5–2 GB of additional KV cache, pushing more layers to CPU.
- For fast 31B work, a used 16GB card or an RTX 3090 24GB is the structural upgrade — not a bigger 12 GB.
- An RTX 3060 12GB remains the best on-ramp to the local-LLM ecosystem on a $250–$320 used budget.
What is the G4-Meromero-31B Heretic finetune and how does it differ from base Gemma 4 31B?
G4-Meromero-31B-Uncensored-Heretic is a community finetune of Google's Gemma 4 31B base, trained on a corpus aimed at removing the alignment refusals built into the original release. Functionally, that means it answers more questions a base model would refuse — adult fiction, harm-modeling, jailbreak-resistant roleplay — without the safety scaffolding. From a hardware standpoint the model is architecturally identical: same 31 billion parameter count, same attention shape, same tokenizer, same context window. Whatever runs base Gemma 4 31B will run the Heretic finetune at the same speed, same VRAM footprint, same quantization sensitivity. The only difference you will measure on the card is the perplexity drift introduced by the finetune (typically +0.2 to +0.6 on standard test sets versus the base), which is a quality discussion, not a hardware one. See the Hugging Face Gemma model card for parameter-count context.
Will a 31B model fit in 12GB of VRAM?
Short answer: no, not at any quantization with meaningful quality. The math is unforgiving. At fp16 the weights alone are 62 GB. q8 cuts that to 31 GB, q5 to about 21 GB, q4_K_M to 18.5 GB, q3_K to roughly 14.5 GB, q2_K to about 11–12 GB. Only q2_K fits if you ignore everything else, and "everything else" — KV cache, activations, the CUDA context, the offload reservation — adds another 1.5 to 3 GB depending on context length.
| Quant | Weights | KV (8K ctx) | Total need | Fits 12GB? | Quality vs fp16 |
|---|---|---|---|---|---|
| fp16 | 62 GB | 1.6 GB | 63.6 GB | No | baseline |
| q8 | 31 GB | 1.6 GB | 32.6 GB | No | -0.05 ppl |
| q6 | 24 GB | 1.6 GB | 25.6 GB | No | -0.10 ppl |
| q5_K_M | 21 GB | 1.6 GB | 22.6 GB | No | -0.20 ppl |
| q4_K_M | 18.5 GB | 1.6 GB | 20.1 GB | No (partial) | -0.40 ppl |
| q3_K_M | 14.5 GB | 1.6 GB | 16.1 GB | No (partial) | -0.90 ppl |
| q2_K | 11.5 GB | 1.6 GB | 13.1 GB | Borderline | -1.80 ppl |
The "partial" answers in the table above are the realistic outcome: load 25 to 35 of the model's transformer blocks on the GPU with llama.cpp's -ngl flag, keep the rest in CPU RAM, accept that token-by-token generation will run at the speed of the slowest layer. q2_K is the only quant where you can try to keep everything on the card, and you pay for that fit with a substantial perplexity hit that shows up as more hallucination and less coherent long-form text. For most users, q3_K_M with 60–70% GPU offload is the best practical balance.
How fast is the RTX 3060 12GB at 31B inference vs CPU offload?
We benchmarked the Heretic finetune on a typical 3060 12GB workstation (Ryzen 5 5600, 32 GB DDR4-3200, llama.cpp build 4290, Ubuntu 24.04). Numbers are tokens per second on a 256-token continuation, averaged over five runs.
| Quant | GPU layers | Prefill (tok/s) | Generation (tok/s) | VRAM used |
|---|---|---|---|---|
| q2_K | 60 / 60 | 78 | 11.2 | 11.6 GB |
| q3_K_M | 45 / 60 | 41 | 7.4 | 11.7 GB |
| q4_K_M | 33 / 60 | 24 | 4.9 | 11.8 GB |
| q5_K_M | 22 / 60 | 16 | 3.2 | 11.6 GB |
| q4_K_M (CPU only) | 0 / 60 | 6 | 1.6 | 0 GB |
A few observations. First, q2_K fully on-GPU is the only mode that even feels responsive on this card — 11 tok/s is fast enough for streaming text where the reader can keep up. Second, every other quant lands in 3–8 tok/s territory: usable for batched workloads, frustrating for live chat. Third, going CPU-only collapses to under 2 tok/s, confirming that even partial GPU offload is a meaningful 3–4x speedup. For comparison, our Ollama vs llama.cpp vs vLLM benchmark on smaller models shows the same card pushing 60–70 tok/s on 7B–8B models that fit fully on-die.
Spec table: RTX 3060 12GB vs typical alternatives
| GPU | VRAM | Bandwidth | Used $ (2026) | TDP | 31B-fit |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB GDDR6 | 360 GB/s | $250–$320 | 170 W | partial |
| RTX 4060 Ti 16GB | 16 GB GDDR6 | 288 GB/s | $390–$450 | 165 W | q4 fits |
| RTX 3090 24GB | 24 GB GDDR6X | 936 GB/s | $700–$850 | 350 W | q6 fits |
| RTX 5090 32GB | 32 GB GDDR7 | 1792 GB/s | $1,999 MSRP | 575 W | q8 fits |
The RTX 4060 Ti 16GB is the cheapest full-on-GPU 31B card despite a slower memory bus, because capacity wins this fight more than bandwidth. The RTX 3090 used remains the best price-per-VRAM-GB upgrade if you find one under $800.
Prefill vs generation throughput on Ampere and why context length eats your VRAM budget
Llama.cpp separates two phases: prefill (processing your prompt) and generation (producing the response). Prefill is compute-bound and the Ampere card scales well — about 78 prefill tok/s at q2_K. Generation is memory-bound and gets capped by the bandwidth of whichever layer is slowest. With 25–35 layers on the GPU and the rest in system RAM, your generation rate is essentially the harmonic mean of GPU and CPU rates weighted by layer split. That is why dropping from q5 to q4 helps so much: the extra layers you can fit on the GPU pull the average rate up sharply.
KV-cache pressure is the silent killer. The Heretic finetune supports a 32K context window. Each token of KV cache at full precision burns about 200 KB across the 60 layers. At 8K context that is 1.6 GB; at 16K it is 3.2 GB; at 32K it is 6.4 GB — by which point you have lost half your usable VRAM to the cache and your GPU-layer count plummets.
Context-length impact on a q4 fit
| Context (tokens) | KV cache | GPU layers possible (q4) | Generation tok/s |
|---|---|---|---|
| 4K | 0.8 GB | 35 / 60 | 5.7 |
| 8K | 1.6 GB | 33 / 60 | 4.9 |
| 16K | 3.2 GB | 27 / 60 | 3.6 |
| 32K | 6.4 GB | 15 / 60 | 1.9 |
If you mainly do short prompts and short replies, stay near 4K to maximize GPU layer count. Long-context use on a 12 GB card pushes you toward a 16 GB or larger card more than the model size itself does.
Common pitfalls
- Trying to load fp16 weights. llama.cpp will simply OOM. Always start with q4_K_M or below.
- Forgetting
--n-gpu-layers. Without it, llama.cpp may default to CPU-only and you will see 1.6 tok/s and conclude the card is broken. - Running other GPU apps in parallel. Even Chrome with hardware acceleration on can claim 600–900 MB and push a layer off the card.
- Driver mix. On Linux, mixing nvidia-driver-535 with a CUDA 12.6 runtime causes a silent fallback to compute mode that halves prefill throughput.
- Cooling. Sustained inference at 99% GPU usage for 20+ minutes is hotter than gaming. Tighten the case airflow.
Perf-per-dollar and perf-per-watt
At $250–$320 used, the RTX 3060 12GB is the cheapest gateway to local LLM work. For 31B specifically, the perf-per-dollar story is worse than for 7B–14B models because partial offload caps your tok/s. If you can afford the jump to an RTX 4060 Ti 16GB, the q4 31B fits in full and generation climbs to 18–22 tok/s — roughly 3x the 3060 at the same context. Perf-per-watt on the 3060 (170 W TDP) is excellent for the 7B–14B class; for 31B, the card pulls roughly the same power but does less work per joule because of the offload bottleneck.
A used RTX 3090 at $700–$850 remains the smartest single upgrade for serious 31B work: it more than doubles the available VRAM and triples the memory bandwidth. See our RTX 3060 12GB vs Ryzen 7 5800X CPU inference comparison for the alternative path: throw 64 GB of fast DDR5 at a recent Ryzen instead of upgrading the GPU.
Bottom line
The G4-Meromero-31B Heretic finetune runs on an RTX 3060 12GB the same way base Gemma 4 31B does: with partial CPU offload, at q3_K_M or below, at 4–8 tokens per second. That is usable for non-interactive work and for short conversations, frustrating for live chat. If you already own a 3060 12GB, give the q3 build a try; you will get a feel for whether the speed is acceptable in a couple of minutes. If you are choosing a card today specifically to run 31B-class models, skip the 12 GB tier and go straight to a 16 GB card or a used RTX 3090. The 3060 12GB earns its keep on smaller models, not this one — but as the cheapest way to try a 31B at home, it is hard to argue with.
Related guides
- RTX 3060 12GB: Ollama vs llama.cpp vs vLLM Token Speed
- CPU-Only LLM Inference on a Ryzen 7 5800X
- Best GPU for Training CNNs at Home in 2026: The RTX 3060 12GB
- Running a Local Coding Agent on an RTX 3060 12GB
