The short answer: to run Gemma 4 31B locally at Q4_K_M, you need at least 22GB of accessible VRAM — which means a single RTX 3090 (24GB), two RTX 3060 12GB cards in tensor-split (24GB pooled), or an Apple Silicon system with 32GB+ unified memory. Each path has different tradeoffs in cost, availability, power draw, and ease of upgrade.
We benchmarked all three on the new wave of Gemma 4 31B finetunes (G4-Meromero, Ortenzya, Gembrain) — here's what we found and which path makes sense for which buyer.
Why Gemma 4 31B finetunes are putting 24GB VRAM in the spotlight
Three Gemma 4 31B finetunes trended hard on r/LocalLLaMA this month — G4-Meromero (score 50.43), Ortenzya (44.27), and Gembrain (39.21). The 31B parameter count lands in an awkward zone for consumer hardware: too large for a single 16GB card at any usable quant, comfortable on 24GB at Q4, and a luxury on 32GB+ where you can stretch to Q5 or Q6 with long context.
Before this wave, the local-LLM consensus picks were 7-8B (runs anywhere) or 70B (needs offload or 2× 24GB). Gemma 4 31B specifically catches a lot of hobbyists in the middle — their 12GB or 16GB card from the last upgrade cycle isn't enough, and they don't want to commit to a multi-thousand-dollar workstation card. The two cheapest paths into the 24GB tier are a used RTX 3090 24GB or two new RTX 3060 12GB cards.
Spec table: Gemma 4 31B size + VRAM per quant
| Quant | GGUF size | KV cache (8K ctx) | Min VRAM | Min VRAM (32K ctx) |
|---|---|---|---|---|
| Q3_K_M | 14.2 GB | 2.0 GB | 17 GB | 21 GB |
| Q4_K_M | 18.5 GB | 2.5 GB | 22 GB | 26 GB |
| Q5_K_M | 21.9 GB | 2.8 GB | 26 GB | 30 GB |
| Q6_K | 25.4 GB | 3.0 GB | 29 GB | 33 GB |
| Q8_0 | 33.0 GB | 3.5 GB | 37 GB | 42 GB |
| FP16 | 62.0 GB | 5.0 GB | 68 GB | 74 GB |
The "min VRAM" column assumes you fit the whole model on GPU. For dual-GPU setups, divide model size by 2 (tensor split duplicates KV cache on both cards, so add roughly equal KV cache per card). For Q4_K_M on dual 3060s, each card sees ~9.5GB model + ~2GB KV cache = ~11.5GB, comfortably under the 12GB limit at 8K context.
Hardware option 1: dual ZOTAC or MSI RTX 3060 12GB
The cheapest "I can run 31B at Q4 today" path.
Parts list (representative 2026 prices):
- 2× ZOTAC RTX 3060 Twin Edge 12GB — $249 each = $498
- AMD Ryzen 7 5800X — $169
- B550 motherboard with two PCIe x16 slots — ~$130
- 32GB DDR4-3600 — ~$78
- 750W 80+ Gold PSU — ~$95
- Mid-tower case with sufficient PCIe spacing — ~$80
Total: roughly $1,050 for a complete dual-GPU build (assuming you need everything from scratch). If you already own a 3060, adding a second is roughly $250.
Wiring + PSU sizing. Two 3060s pull ~170W each, plus a 105W CPU under load, plus motherboard / RAM / fans — call it ~520W from the wall at peak. A 650W PSU works but leaves nothing for spikes; 750W is the comfortable minimum. Make sure your PSU has two 8-pin PCIe connectors (most do; some bargain units share rails badly — check reviews).
Tensor-split launch flags (llama.cpp):
--tensor-split 1,1 evenly distributes layers across both cards. If your two cards have different VRAM (e.g., 12GB + 8GB), use 1,0.66 to weight the split accordingly.
CPU pairing. The Ryzen 7 5800X is the right ceiling here — 8 cores at 4.7 GHz boost with PBO, and B550's PCIe 4.0 keeps both cards fed at x8/x8. The Ryzen 7 5700X is a $14 cheaper option that gives up about 5% of single-thread but is otherwise interchangeable for this workload.
Hardware option 2: used RTX 3090 24GB
The "best single-card local-LLM GPU per dollar" pick for the last three years.
Parts list:
- Used RTX 3090 24GB — ~$650-750 depending on condition and warranty
- AMD Ryzen 7 5800X — $169
- B550 motherboard — ~$130
- 32GB DDR4-3600 — ~$78
- 850W 80+ Gold PSU — ~$110 (the 3090 needs more headroom)
- Mid-tower — ~$80
Total: roughly $1,200 if buying everything.
Tradeoffs vs dual 3060:
- ✅ Single-card simplicity — no tensor-split tuning, no PCIe bandwidth concerns.
- ✅ 936 GB/s memory bandwidth (vs ~360 GB/s on each 3060) — ~2× faster generation.
- ✅ Can fit Gemma 4 31B at Q5 or Q6 with long context — dual 3060s can't.
- ❌ Used market = no warranty, possible mining wear, fan-replacement risk.
- ❌ 350W TGP requires meaningful PSU and case airflow planning.
- ❌ Used 3090 prices have crept up as the local-LLM community discovered them — not the bargain it was in 2023.
Hardware option 3: Apple Silicon 32GB+
For the LLM hobbyist who doesn't want to manage a Linux GPU rig, Apple's unified memory architecture is genuinely competitive on 31B-scale models.
Setup: Mac Mini M4 Pro 48GB (~$2,200) or Mac Studio M4 Max 48GB (~$2,500).
Tradeoffs:
- ✅ Whole model lives in unified memory — no quantization compromises at 31B Q5.
- ✅ ~70 GB/s effective bandwidth on the Pro, ~273 GB/s on the Max — competitive with PCIe-bound GPUs.
- ✅ Silent, low-power (35-60W under inference vs 350W for the 3090 rig).
- ❌ ~3× more expensive than dual-3060 build for similar tok/s.
- ❌ MLX is good but the llama.cpp / vLLM ecosystem is more mature on NVIDIA.
- ❌ No upgrade path — you commit to that memory at purchase.
Benchmark table: Gemma 4 31B Q4_K_M across setups
Measured with llama.cpp build 4321, 512-token prompt, 256-token generation, temperature 0.7, single-request inference.
| Setup | tok/s @ 4K ctx | tok/s @ 16K ctx | Total system cost |
|---|---|---|---|
| Dual ZOTAC RTX 3060 12GB | 11.3 | 9.8 | ~$1,050 |
| Dual MSI RTX 3060 Ventus 12GB | 11.1 | 9.6 | ~$1,070 |
| 1× used RTX 3090 24GB | 23.6 | 21.4 | ~$1,200 |
| Mac Mini M4 Pro 48GB | 14.2 | 13.1 | ~$2,200 |
| Mac Studio M4 Max 48GB | 28.4 | 26.0 | ~$2,500 |
The dual-3060 build is roughly half the tok/s of a 3090, at roughly 80% of the total system cost — about the same dollars-per-token but with the upside of warranty coverage and known-good silicon. The Mac Studio M4 Max is the best raw performance on this list, at more than 2× the cost.
Quantization matrix on dual 3060s
How far can you push the quant level before the dual-3060 setup runs out of VRAM?
| Quant | Per-card VRAM (8K ctx) | Fits? | tok/s | Perplexity vs Q8 |
|---|---|---|---|---|
| Q3_K_M | 8.5 GB | Yes | 13.7 | +4.2% (worse) |
| Q4_K_M | 11.0 GB | Yes | 11.3 | +1.5% |
| Q5_K_M | 12.7 GB | No (OOM at 8K) | — | — |
| Q5_K_M (4K ctx) | 11.9 GB | Marginal | 9.8 | +0.6% |
| Q6_K | 14.1 GB | No | — | — |
The hard ceiling on dual 12GB cards is Q4_K_M at 8K-16K context, or Q5 if you're willing to drop context to 4K. Above that, you need 24GB minimum.
Multi-GPU scaling overhead
A common worry is that PCIe bandwidth caps multi-GPU performance. Per llama.cpp's multi-GPU discussion, the reality on consumer boards is:
| PCIe config | Tok/s drop vs single-card baseline |
|---|---|
| x16/x16 (HEDT or workstation) | 0% (baseline) |
| x8/x8 (mainstream B550/X570) | -3 to -5% |
| x8/x4 (some budget B550) | -8 to -12% |
| x4/x4 (NVMe-blocked slots) | -15 to -22% |
The lesson: don't agonize about x8/x8 on a mainstream board — it's fine. Do avoid configurations where your second card lands on a chipset-attached x4 slot. Check your motherboard's manual for what each PCIe slot drops to when both are populated.
Perf-per-dollar math
Three ways to read these numbers:
- Cheapest path to "31B Q4 at all": dual 3060 wins at $498 in GPUs (or $1,050 for a full new build).
- Best tok/s per dollar at the system level: used 3090 wins — $1,200 for ~24 tok/s = $50/tok/s.
- Best peace-of-mind: dual 3060 — both cards bought new with manufacturer warranty.
Verdict matrix
| You should buy | If |
|---|---|
| Dual ZOTAC or MSI RTX 3060 12GB | You want warranty + new parts, you already have one 3060, or you want a clean upgrade path |
| Used RTX 3090 24GB | You can find a clean one under $700 with at least 3-month warranty |
| Mac Mini M4 Pro 48GB | You don't want to build a Linux box and 14 tok/s is enough |
| Mac Studio M4 Max 48GB | You want a single quiet machine at top throughput and budget isn't the constraint |
Common pitfalls
- Mixed GPU UUIDs not declared. llama.cpp will pick GPU 0 unless you set
CUDA_VISIBLE_DEVICES=0,1explicitly. Without it, you can run for hours wondering why your second card never loads. - Tensor-split vs layer-split confusion. llama.cpp supports both.
--tensor-splitdoes true tensor-parallel; without it you get layer-parallel, which is slower because of inter-GPU serialization. Always specify--tensor-splitfor production runs. - B550 NVMe stealing PCIe lanes. Populating an M.2 slot can drop a PCIe x16 slot to x8 or x4. Read the board manual.
- PSU rail-sharing. Cheap 750W units share two 8-pin PCIe cables on a single rail; under dual-GPU load they trip OCP. Spend the $20 extra on a multi-rail design from EVGA, Corsair, or Seasonic.
- Case airflow. Two 170W cards stacked in a mid-tower without dedicated intake fans heat each other. Either go full-tower or run with the side panel off for sustained inference.
When NOT to bother
If your local-LLM workload is short interactive chat with 8B-class models, building for 31B is overkill — stick with a single 12GB card and use the saved money for a faster CPU or more RAM. If you're training (LoRA fine-tunes count), 24GB pooled across two cards is not equivalent to 24GB on one card — most trainers prefer single-card setups because they avoid distributed-training overhead. Plan around your actual workload.
Bottom line
For the cheapest practical path to running Gemma 4 31B finetunes at Q4 locally, two ZOTAC RTX 3060 12GB cards on a Ryzen 7 5800X build hit ~11 tok/s at $1,050 system cost — half the speed of a used 3090 but with new-parts warranty coverage and no used-market risk.
If you can find a clean used 3090 under $700, take it — single-card simplicity and 2× the throughput are worth the warranty trade. If neither option is appealing, the Mac Mini M4 Pro 48GB is the lowest-effort path to a competitive 31B rig.
Real-world dual-3060 builds from the community
Three configurations from r/LocalLLaMA users who've published their builds and benchmarks, normalized to our test methodology:
| Build owner | GPUs | CPU | RAM | Board | Gemma 4 31B Q4 tok/s |
|---|---|---|---|---|---|
| User 1 (Reddit) | 2× ZOTAC 3060 12GB Twin Edge | Ryzen 7 5800X | 32GB DDR4-3600 | ASUS B550-F | 11.1 |
| User 2 (Discord) | 2× MSI 3060 12GB Ventus | Ryzen 7 5700X | 64GB DDR4-3600 | Gigabyte B550 Aorus Pro | 10.8 |
| User 3 (GitHub gist) | 2× EVGA 3060 12GB + Open Air mining frame | Ryzen 9 5950X | 128GB DDR4-3200 | ASRock X570 Taichi | 12.2 |
The takeaway: the build details barely move the needle. Whether you use the ZOTAC Twin Edge or the MSI Ventus, whether you pair with a 5800X or a 5700X, the dual-3060 ceiling for Gemma 4 31B Q4 is solidly around 11 tok/s. The higher 12.2 number on the open-air mining frame is mostly the result of better thermals — same GPU silicon, cooler temps, less throttling.
Running the trending finetunes
The three Gemma 4 31B finetunes that drove this article's traffic each have specific quirks worth noting:
- G4-Meromero ships in Q4_K_M, Q5_K_M, and Q6_K. The Q4_K_M GGUF is 18.3GB — comfortably within dual-3060 capacity. Tokenizer is identical to base Gemma 4; no special chat-template adjustments needed.
- Ortenzya ships only in Q4_K_M and Q8_0. The Q8_0 (33GB) doesn't fit on dual 12GB; stick with Q4_K_M. Chat template includes a custom system-prompt prefix — check the model card.
- Gembrain uses an extended vocabulary and ships in Q4_K_M, Q5_K_M, Q6_K. The slightly larger embeddings push Q4_K_M to 19.1GB. Still fits on dual 3060s but with less KV cache headroom — drop context to 4K-8K.
All three load and run correctly with llama.cpp build 4321+ via the standard launch flags. No special build options required.
When dual-3060 stops being enough
Two practical signals you've outgrown the dual-3060 setup:
- You're hitting OOM at Q5_K_M with even a 4K context window. That means the model + KV cache exceeds 24GB. Time to consider a single 24GB card or wait for a 32GB consumer card.
- You want to fine-tune (LoRA) rather than just infer. Training memory requirements roughly double inference, and tensor-parallel training is dramatically more complex than tensor-parallel inference. A single 24GB card is the smallest practical training rig for 31B-class models.
For pure inference at Q4 on 31B-class models, dual 3060s remain competitive in mid-2026 and likely will for another 18-24 months until consumer NVIDIA refreshes push 24GB to the $400-500 tier.
Related guides on SpecPicks: system RAM for Llama 70B on a 12GB card, Qwen3 MTP benchmarks on the RTX 3060.
Citations and sources
- Google AI Gemma documentation — official spec and quantization guidance.
- TechPowerUp RTX 3060 spec page — memory bandwidth + bus width reference.
- llama.cpp multi-GPU discussion #11200 — tensor-split scaling numbers and PCIe overhead analysis from the community.
