Harmonia-31B-Heretic is a Gemma-4-31B-IT merge focused on reduced refusal behavior. At Q4_K_M it weighs 18-20GB — too large for a 12GB RTX 3060 without CPU offload, where community measurements report mid-single-digit tok/s on llama.cpp. For full-GPU residency at 31B you want a 16GB+ card or unified-memory Apple Silicon; for 12GB stick with Q4_K_M offload or a smaller variant.
The uncensored-merge ecosystem around Gemma 4 31B
Gemma 4 31B-IT shipped with the same safety alignment as the rest of the Gemma family — heavy refusal on creative-writing prompts that touch violence, sexuality, or anything Google's policy team flagged as off-limits. For agentic workflows, story generation, and most red-team / security research, those refusals turn the model into a dead end. The merge ecosystem on Hugging Face exists to recover the underlying capability without the refusal layer, while preserving the instruction-following behavior that makes Gemma 4 useful in the first place.
Harmonia-31B-Heretic is the latest of these. It's a multi-source merge: take Gemma-4-31B-IT, blend it with finetunes that have already softened the refusal vector, and ship the result as a single model file the community can quantize and run locally. The "Heretic" suffix is convention across the merge scene for variants that have explicitly removed alignment layers; the lineage matters because it tells you what evaluation suites the result has and hasn't been measured against.
This is a hardware-first review. We're not going to walk through every nuance of the merge formula or the safety implications in detail — the model card on Hugging Face covers the base model and the community release thread covers the merge methodology. What we will do is answer the question every reader actually has: can my rig run it, what quant should I pick, and is it worth the upgrade if my current daily-driver is a 13B-class model?
Key takeaways
| Question | Answer |
|---|---|
| Base model | Google Gemma 4 31B-IT |
| Merge purpose | Reduced refusal, preserved instruction following |
| Q4_K_M file size | ~18-20 GB |
| Fits on 12GB RTX 3060? | Only with offload; expect mid-single-digit tok/s |
| Sweet-spot quant | Q6_K when VRAM allows; Q4_K_M for tight budgets |
| Recommended VRAM (no offload) | 24GB+ for Q4_K_M; 32GB+ for Q6 |
| Customer-facing OK? | No — adds your own guardrails if you ship |
What is Gemma-4-Harmonia-31B-Heretic?
The r/LocalLLaMA release thread describes the model as a community merge of multiple Gemma 4 31B-IT finetunes targeted at reduced-refusal output for creative writing and roleplay. The base is unchanged in factual terms — no new training data, no continued pretraining — so factual recall and the underlying language model behavior remain anchored to whatever Gemma 4 31B-IT learned during Google's training run.
Practically, this means:
- Instruction following is preserved. The model still follows
<start_of_turn>user / <start_of_turn>modelstructure and respects multi-turn context the way base Gemma 4 does. - Reasoning benchmarks are not dramatically different from the base. The merge does not unlock new capability — it removes a refusal layer.
- Factual recall window is whatever the base model has. Don't expect the merge to know anything the base didn't.
- Safety alignment is reduced. Outputs that the base would have refused will often complete.
How does it differ from base Gemma 4 31B-IT?
The most useful framing: think of the merge as a fine-tuned variant of base Gemma 4 31B-IT where the refusal layer has been softened. The underlying intelligence is the same; the policy behavior is different. If your use case for the base model was hitting refusal walls on legitimate creative work, the merge unblocks you. If your use case was production Q&A or anything compliance-sensitive, the merge is the wrong choice — base or instruction-tuned-with-policy is what you want.
| Behavior | Base Gemma 4 31B-IT | Harmonia-31B-Heretic |
|---|---|---|
| Instruction following | Strong | Strong (preserved) |
| Factual recall | Anchored to training window | Same as base |
| Creative-writing refusal | High | Low |
| Roleplay refusal | High | Low |
| Coding tasks | Strong | Strong (no change measured) |
| Safety guardrails | Built-in | Removed |
Quantization matrix: q2 / q3 / q4_K_M / q5 / q6 / q8 / fp16
For 31B-class models the typical llama.cpp quant ladder produces roughly these on-disk sizes. Tok/s estimates assume single-batch generation; prefill is faster on larger batches:
| Quant | File size (GB) | VRAM (full) | VRAM (offload OK) | Quality loss | Typical use |
|---|---|---|---|---|---|
| Q2_K | ~11 | 12GB possible | Yes | Significant | Last-resort, low-end VRAM |
| Q3_K_M | ~14 | 16GB | Yes | Noticeable | Tight budgets |
| Q4_K_M | ~18-20 | 24GB | Yes (3060 12GB w/ offload) | Floor for serious work | Balanced |
| Q5_K_M | ~22-23 | 24GB | Yes | Small | Quality-leaning |
| Q6_K | ~26 | 32GB | Yes | Minimal | Sweet spot when VRAM allows |
| Q8_0 | ~33 | 48GB | Yes | Near-zero | Quality-critical |
| FP16 | ~62 | 80GB | Yes | None | Reference / training |
The Q4-to-Q6 jump is the one most community testers single out. A recent r/LocalLLaMA thread on Qwen3.6 35B explicitly called out "huge quality gain from Q4 to Q6 for coding agent" — the same delta applies broadly to Gemma 4 derivatives. If you can fit Q6, do.
Spec / bench table: model size by quant, recommended VRAM, target hardware
| Quant | Recommended VRAM | Single-GPU candidates | Multi-GPU candidates |
|---|---|---|---|
| Q4_K_M | 24GB | RTX 3090, RTX 4090, RTX 5090, RTX 7900 XTX | 2×3060 12GB (split) |
| Q5_K_M | 24GB | RTX 3090, RTX 4090, RTX 5090 | 2×3060 12GB (tight) |
| Q6_K | 32GB | RTX 5090, RTX A6000, Mac Studio M-Ultra | 2×16GB cards |
| Q8_0 | 48GB+ | RTX A6000 48GB, multi-GPU rigs | 4×3060 12GB (split) |
Apple Silicon with unified memory changes the picture: an M4 Pro Max with 64GB of system RAM can load Q6 directly into GPU-addressable space without the multi-card complexity.
Will it fit on a 12GB RTX 3060?
Short answer: only with offload, and only at Q4_K_M or lower. Community measurements on RTX 3060 12GB rigs running Harmonia or comparable 31B Q4_K_M merges via llama.cpp -ngl partial offload land in the 3-6 tok/s range for single-user generation. That's usable for a chat session — about 200-400 wpm reading speed — but painfully slow for any agentic loop that depends on completing dozens of small tool-calling turns per task.
The trade-off is concrete: every additional layer pushed to GPU costs VRAM but saves milliseconds per token. On a 12GB card you can typically fit 30-40 of the 60-odd layers; the rest live on CPU and traverse PCIe per token. CPU memory bandwidth and core count become the bottleneck for the offloaded portion, which is why the same 3060 12GB rig hits very different tok/s depending on whether it's paired with a Ryzen 5800X or a budget i3.
Prefill vs generation throughput
Prefill (processing the input prompt) and generation (producing tokens) are fundamentally different workloads. Prefill is highly parallel and scales with raw compute; generation is sequential and gated by memory bandwidth. On a 3060 12GB running 31B Q4_K_M with offload:
- Prefill: 100-400 tok/s for the GPU-resident layers, much slower for the CPU-offloaded portion. Long prompts (8k+ tokens) take real time.
- Generation: 3-6 tok/s sustained, dropping further at longer context.
For interactive chat with short inputs and medium-length outputs the experience is acceptable. For RAG with thousands of tokens of context or agentic workflows with long tool-call chains, the prefill cost dominates and a smaller model with higher throughput is often the better business choice.
Context-length impact at q4_K_M
KV cache scales linearly with context length and proportionally with the model's hidden dimension. At Q4_K_M on a 31B model, every additional 1k tokens of context adds roughly 80-100 MB of KV cache when stored at half precision. For a 12GB card already brimming with offloaded weights, this is the difference between a workable 4k context and an OOM at 8k.
| Context | KV cache (approx) | Practical headroom on 3060 12GB |
|---|---|---|
| 2k | ~160 MB | Fine |
| 4k | ~320 MB | Usable |
| 8k | ~640 MB | Tight; risk of OOM with offload |
| 16k | ~1.3 GB | Not realistic without --kv-cache-type q8_0 |
llama.cpp's --kv-cache-type q8_0 and q4_0 quantize the cache itself, recovering significant headroom at minor quality cost. On a 12GB card this is often what makes 8k+ context viable.
Multi-GPU vs single-card considerations
Two RTX 3060 12GB cards in tensor-parallel mode give you 24GB of effective VRAM and enough bandwidth to host Q4_K_M of a 31B model fully on GPU. Throughput is roughly equivalent to a single RTX 3090 at less total cost, but PSU, case, and motherboard slot constraints make it a fiddly upgrade path. For a clean upgrade most operators find a single RTX 3090 (24GB) or RTX 4090 (24GB) easier to justify than a second 3060.
For aspirational rigs, an RTX 5090 (32GB) gets you Q6 of 31B with room to spare, or comfortable Q4 on much larger models.
Perf-per-dollar vs. running a smaller Q6 model
The hard question: is Q4_K_M of 31B-Harmonia actually better than Q6_K of a 14B-class merge that fits your card cleanly? For creative writing, the answer is usually yes — the larger model's broader linguistic capacity shows up in tone, vocabulary, and consistency. For coding, the answer is often no — a well-tuned 14B coding model at Q6 outperforms a generalist 31B at Q4 on most benchmarks.
Decide based on workload, not parameter count. A 31B with painful offload that you grow to hate isn't worth the bragging rights.
Verdict matrix: when Harmonia beats base Gemma 4, when it doesn't
| Use case | Pick |
|---|---|
| Creative writing where base refuses | Harmonia |
| Roleplay / character work | Harmonia |
| Production Q&A where output goes to customers | Base Gemma 4 IT |
| Coding agent | Either; quant matters more than the merge |
| Compliance-sensitive workflow | Base, with policy enforcement on top |
| Research / red-team eval | Harmonia |
Bottom line — recommended quant for a 12GB rig
For a 12GB RTX 3060 — the canonical budget LLM card — Q4_K_M with -ngl 35 to -ngl 40 and --kv-cache-type q8_0 gives you a workable single-user chat experience at 3-6 tok/s for 4k context. If you'd rather have full-GPU residency at acceptable speed, drop to a 14B-class merge at Q6_K and you'll see 30-60 tok/s without compromise.
For 31B-class daily-driver use with long context, the upgrade target is a 16GB+ card (RTX 4080 16GB, 5080 16GB) or a unified-memory Mac that can address all system RAM as GPU memory. Until then, 31B on a 12GB card is best treated as an occasional experiment, not your default.
Common pitfalls and gotchas
The single most common failure mode in local-LLM operations is silent quantization mismatch: pulling a Q6_K weight file when your config still references the Q4_K_M filename. The model loads, the API responds, the output looks plausible — but the throughput is half what you expected because the larger file fell back to CPU pages you didn't notice. Always hash-verify the on-disk file against the upstream advisory before declaring a benchmark run valid.
The second most common: assuming an MCP server bound to "all interfaces" is fine because your home network is "behind NAT." Modern routers increasingly hand out IPv6 prefixes to internal devices and the firewall behavior on IPv6 is materially less protective than on IPv4. If you've never explicitly checked, run ss -tlnp6 on the inference host and confirm nothing is listening on a global-scope IPv6 address.
The third: trusting an LLM's own "I cannot run as a tool" refusal as evidence of safety. Reduced-refusal merges and clever prompt-injection will route around model-level guardrails. The trust boundary lives at the MCP allowlist and the network layer, not in the model's text output.
Real-world numbers from comparable setups
On an RTX 3060 12GB paired with a Ryzen 5800X + 32GB DDR4-3200, the practical throughput envelope for common configurations is:
| Configuration | Single-user tok/s | Notes |
|---|---|---|
| Llama 3.1 8B Q4_K_M, full GPU | 35-50 | Sweet spot for daily-driver |
| Llama 3.1 8B Q6_K, full GPU | 28-40 | Quality jump worth the small speed cost |
| Mistral Small 22B Q4_K_M, full GPU | 14-20 | Tight but viable |
| 31B Q4_K_M with -ngl 35 offload | 3-6 | Painfully slow for agents; usable for chat |
| 70B Q4_K_M with offload | <1 | Avoid; swap to disk dominates |
These numbers are reproducible across most rigs with similar memory bandwidth. Your mileage will track tok/s with GB/s of memory bandwidth almost linearly within a given model class — bandwidth is the gating resource for generation.
When NOT to use this setup
Skip this hardware / config combination if your workload is batched serving for multiple concurrent users — for that, a single H100 / MI300 is more cost-effective than a stack of consumer cards because batched attention amortizes the per-user cost. Skip it if you need GPU-resident fine-tuning of 13B+ — the VRAM ceiling on a 12GB card is too tight. Skip it if your latency budget per token is below 50ms — consumer Ampere generation cards cannot reach that envelope. For chat-style single-user LLM use, this is the right rig; for anything production-grade, scale up.
Related guides
- Best Budget GPU for Local LLM Inference in 2026
- Best CPU for Local LLM Inference: Ryzen 5800X vs 5700X vs 5600G
- Best Mini PC for Local LLM Inference in 2026: Ryzen vs Apple vs Intel
Citations and sources
- Google Gemma 4 31B-IT — Hugging Face
- llama.cpp — github.com/ggerganov/llama.cpp
- r/LocalLLaMA release thread and quant-comparison discussion
Reviewed: May 2026.
