Skip to main content
Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast

Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast

The VRAM math, the right quant, and the offload penalty for running a 31B instruction model on a budget card.

Gemma 4 31B-IT needs ~18-19GB at q4_K_M, so on a 12GB RTX 3060 you pick q3_K_M or offload. Here's what fits, what spills, and how fast.

Not entirely. Gemma 4 31B-IT does not fit in 12GB at q4_K_M — the weights alone need roughly 18-19GB — so on an RTX 3060 12GB you either drop to a more aggressive quant like q3_K_M or offload part of the model to system RAM. It runs, and it can be genuinely useful, but plan your quant level around the 12GB ceiling first.

A 31B instruction model on a budget card

Gemma 4 is the kind of release that sends every local-LLM tinkerer back to their hardware spreadsheet. A 31B instruction-tuned model promises noticeably stronger reasoning, coding, and instruction-following than the 7B-to-14B models most budget rigs run comfortably — and the community immediately started asking the obvious question: does it fit on the most popular budget inference card, the RTX 3060 12GB?

The honest answer is "with compromises." 12GB was generous for the 8B-to-13B era, but a 31B model is a different weight class. At the q4_K_M quantization most people consider the quality sweet spot, the weights are larger than the card's entire frame buffer. You are not going to load this model purely on-GPU and walk away. You will be choosing between a smaller quant that fits, or a partial offload that keeps quality high but cuts throughput.

That does not make the 3060 a bad choice — it makes it a card with a clear ceiling you can plan around. If Gemma 4 31B is an occasional "bring out the big model" tool and your daily driver is a 7-13B model, the 3060 handles both, and the offload penalty on the rare 31B session is tolerable. If 31B is your everyday workload, this guide will also tell you honestly when it is time to size up. Either way, you will leave knowing exactly what fits, what spills, and how fast it goes.

Key takeaways

  • Gemma 4 31B-IT does not fit in 12GB at q4_K_M — weights need ~18-19GB, so expect q3_K_M or partial offload on a 3060.
  • q3_K_M is the practical on-card sweet spot; q4_K_M with offload preserves quality but drops tokens per second sharply.
  • llama.cpp and Ollama beat vLLM on a single 12GB card because they handle CPU offload of overflow layers gracefully.
  • Context length costs VRAM separately from weights — jumping 8K → 32K can push several more layers off the GPU.
  • Two 3060s (24GB) hold a 31B model cleanly and are often a better Gemma 4 box than one bigger single GPU.

Does Gemma 4 31B fit in 12GB at all? The VRAM math

Start from the rule of thumb that q4_K_M weights occupy roughly 0.6GB per billion parameters. A 31B model therefore needs about 18-19GB just for weights, before you account for the KV cache and runtime overhead. That is well past 12GB.

QuantGB per 1B31B weights (approx)Fits in 12GB on-card?
q2_K~0.40~12.5 GBBarely, no context headroom
q3_K_M~0.50~15.5 GBNo — partial offload needed
q4_K_M~0.60~18.5 GBNo — significant offload
q5_K_M~0.70~21.5 GBNo
q8_0~1.06~33 GBNo

Even q2_K barely squeezes the weights in with no room left for context, and q2 visibly degrades coding and math. The realistic on-card-friendly option is q3_K_M with a handful of layers offloaded, or q4_K_M with a larger offload if you prioritize quality over speed.

Which quantization should you pick?

For a 31B model on a 12GB card, the choice is between speed (smaller quant, more layers resident on the GPU) and fidelity (larger quant, more layers offloaded to system RAM). Public community measurements consistently show q4 retains most reasoning quality, q3 is a reasonable compromise, and q2 starts to show cracks on structured tasks like code generation.

QuantQualityOn a 3060 12GBRecommendation
q2_KDegraded on code/mathMostly on-cardOnly if you need speed over accuracy
q3_K_MGoodA few layers offloadedBest all-round on-card pick
q4_K_MNear-referenceSignificant offloadBest quality if you tolerate lower tok/s
q5_K_M+Negligible lossHeavy offloadNot worth it on 12GB

Test both q3_K_M and a partially offloaded q4_K_M on your own prompts. If your work is conversational, q3_K_M's speed usually wins. If you are doing careful code or analysis, the q4_K_M quality edge can justify the slower pace.

How much do you offload, and what does it cost in tokens per second?

Offloading is the lever that makes a too-big model run, and it is also the thing that slows it down. Every layer you push to system RAM is read across the PCIe bus and processed on the CPU instead of the GPU. The more you offload, the closer your throughput drifts toward CPU-only speeds.

ConfigurationApprox offloadRelative throughputExperience
q3_K_M, mostly on-GPULightFastest on 12GBSnappy enough for chat
q4_K_M, ~30-40% offloadedModerateNoticeably slowerUsable for non-interactive work
q4_K_M, heavy offloadHeavySlowBatch/overnight jobs only
q2_K, fully on-GPUNoneFast but lower qualitySpeed-first compromise

The takeaway: keep as much of the model on the GPU as your chosen quant allows, and accept that a 31B on 12GB will never feel like a 7B on the same card. Offload is a tool for "it runs at all," not "it runs fast."

vLLM vs llama.cpp vs Ollama on a single 12GB card

Runtime choice matters more than usual when you are over the VRAM line. vLLM is superb for batched, high-throughput serving — but it expects the model to fit in VRAM. On a 12GB card, vLLM is better suited to smaller Gemma 4 variants than to the full 31B, because it does not gracefully spill overflow layers to system RAM the way a hobbyist single-GPU setup needs.

llama.cpp and Ollama (which wraps llama.cpp) are the friendlier choice here precisely because CPU offload of overflow layers is a first-class feature. You tell them how many layers to keep on the GPU, and they handle the rest on the CPU. For a single consumer 12GB card running a model that does not fit, that is exactly the behavior you want. Match the runtime to whether you offload: vLLM if everything fits, llama.cpp/Ollama if it does not.

Prefill vs generation throughput on the RTX 3060

The 3060 has two different speed stories. Prefill — chewing through your prompt — is compute-bound and the 3060's 3,584 CUDA cores handle it acceptably. Generation — emitting tokens one at a time — is memory-bandwidth-bound, and here the 3060's 360 GB/s GDDR6 is the asset that makes it worth using over CPU-only inference for the layers that stay resident. The moment layers spill to system RAM, those layers generate at DDR4/DDR5 bandwidth, which is why offload hurts generation so much more than it hurts prefill.

Context-length impact: the KV cache eats your remaining VRAM

The KV cache grows linearly with context length and is entirely separate from the weights. On a card already near full from a quantized 31B, this is the difference between "it runs" and "it runs out of memory."

ContextApprox KV cache (31B-class)Effect on a near-full 12GB card
8K~1.5-2 GBManageable
16K~3-4 GBForces more layers off-GPU
32K~6-8 GBHeavy offload; throughput drops

Keep context modest when running large models on limited VRAM, or accept that long context will push more layers to the CPU and slow generation. For most Gemma 4 chat and coding tasks, 8K-16K is plenty and keeps you faster.

Is two RTX 3060 12GB cards a better Gemma 4 box than one bigger GPU?

This is the upgrade that changes everything for 31B-class models. Two 3060s give you 24GB of fast GDDR6 across the pair, enough to hold Gemma 4 31B at q4_K_M split across both cards with room for a healthy context — no system-RAM offload, full GDDR6 bandwidth on every layer. For sustained 31B work, dual 3060s frequently beat a single larger card on both price and tokens per second, and they reuse a part you may already own. We document a concrete two-card build in our dual RTX 3060 12GB local-LLM build, and cover the runtime side in our llama.cpp on the RTX 3060 12GB guide.

Perf-per-dollar and perf-per-watt vs the next step up

A single RTX 3060 12GB is the cheapest sane entry to local LLMs, and for 7-13B models nothing touches its value. For 31B, the math gets more interesting: a second 3060 roughly doubles your cost and power (each card ~170W) but removes the offload penalty entirely, often delivering more than double the 31B throughput. Compared with a single 16GB+ card, dual 3060s usually win on raw cost per usable token at this model size, at the expense of needing two PCIe slots, a bigger PSU (think 750W for the pair), and a case with airflow for two cards.

Real-world numbers: what to expect in tokens per second

Exact throughput depends on your quant, how many layers stay on the GPU, your system RAM speed, and the runtime, but the shape of the numbers is consistent across community reports and worth internalizing before you buy. A small model that fits entirely in 12GB — say an 8B at q4_K_M — generates briskly on a 3060, comfortably in the tens of tokens per second, fast enough to feel interactive. The moment you load a 31B that must offload, that figure collapses.

ScenarioRough throughput bandHow it feels
8B q4_K_M, fully on-GPUTens of tok/sSnappy, interactive
31B q3_K_M, light offloadSingle-digit to low-teens tok/sUsable for chat, slight wait
31B q4_K_M, moderate offloadLow single-digit tok/sFine for non-interactive work
31B q4_K_M, heavy offloadAround reading speed or belowBatch jobs only

The lesson is that a 31B on a 12GB card is best treated as a "thinking" model you queue work to, not a snappy assistant you chat with in real time. If you need instant responses, a 13B-or-smaller model on the same card is the better daily driver, and you bring out the 31B for harder problems where you can tolerate the wait.

Common pitfalls running Gemma 4 31B on 12GB

  1. Picking the quant before checking the math. People download a q4_K_M 31B, watch it crawl, and blame the card. Decide your quant against the 12GB ceiling first — q3_K_M is usually the right on-card choice.
  2. Forgetting the KV cache. A model that "just fits" at 8K context will OOM at 32K because the cache grows separately. Budget VRAM for context, not just weights.
  3. Reaching for vLLM on a single 12GB card. vLLM is excellent for serving models that fit; it is the wrong tool when you must offload. Use llama.cpp or Ollama instead.
  4. Slow system RAM. Offloaded layers run at system-memory speed, so DDR4-2133 versus DDR4-3600 is a visible difference. If you offload, faster RAM helps.
  5. Maxing context "just in case." Long context you do not use still costs throughput. Set context to what your prompts actually need.

When NOT to run Gemma 4 31B on a 3060

If your work is daily, latency-sensitive 31B inference — interactive coding assistance, real-time chat, anything where you sit and wait on every response — a single RTX 3060 12GB is the wrong tool, and no amount of quant tuning fixes the fundamental VRAM shortfall. In that case, do not fight 12GB: add a second 3060 for 24GB, or move to a 16GB-plus card. The 3060 is a superb value for models up to ~14B; it is a compromise, not a comfortable home, for a 31B.

Bottom line: who should run Gemma 4 31B on a 3060

Run Gemma 4 31B on a single RTX 3060 12GB if it is an occasional tool and your daily models are 7-13B — q3_K_M or a partially offloaded q4_K_M is workable, and the value is unbeatable. The card pairs cleanly with a Ryzen 7 5700X or Ryzen 7 5800X host, and the MSI RTX 3060 Ventus 2X 12G is our default pick at this tier.

Size up — to two 3060s for 24GB, or a single 16GB-plus card — if Gemma 4 31B is your everyday workload and you dislike offloading. The pain of fighting 12GB daily is real, and the step-up removes it. For the broader build, see our best CPU for a local-LLM homelab, our CUDA 13.3 RTX 3060 inference notes, and the Qwen3.6 27B agentic-coding deep dive for how a similarly sized model behaves on the same hardware.

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Will Gemma 4 31B-IT fit entirely in 12GB of VRAM?
Not at full precision and not comfortably at q4_K_M. A 31B model at q4_K_M needs roughly 18-19GB for weights alone, so on a 12GB RTX 3060 you must either drop to a more aggressive quant like q3 or q2, or offload a portion of the layers to system RAM. Offloading works but drops throughput sharply, so plan your quant level around the 12GB ceiling first.
Which quantization gives the best quality-per-VRAM on a 3060?
For a 31B model on 12GB, q3_K_M or a partially offloaded q4_K_M are the practical sweet spots. Public community measurements show q4 retains most reasoning quality while q2 visibly degrades coding and math. The right choice depends on whether you value speed (smaller quant, more layers on GPU) or fidelity (larger quant, more offload), so test both on your own prompts.
Is vLLM or llama.cpp better for Gemma 4 on a single 12GB card?
llama.cpp and Ollama are generally friendlier on a single consumer 12GB GPU because they handle CPU offload of overflow layers gracefully. vLLM excels at batched serving and higher throughput but expects the model to fit in VRAM, so on a 12GB card it is better suited to smaller Gemma 4 variants than the full 31B. Match the runtime to whether you offload.
How much does context length cut into my usable VRAM?
The KV cache grows linearly with context length and is separate from the weights. On a 12GB card already near full from a quantized 31B model, jumping from 8K to 32K context can consume several additional gigabytes, forcing more layers off the GPU and lowering tokens per second. Keep context modest, or accept the throughput hit, when running large models on limited VRAM.
Should I just buy a bigger GPU instead of fighting 12GB?
If you run 31B-class models daily and dislike offloading, a card with 16GB or more removes most of the pain and is worth the step-up. But if Gemma 4 31B is an occasional workload and your bread-and-butter is 7-13B models, the RTX 3060 12GB remains an excellent value and the offload penalty is tolerable for intermittent use. Match the purchase to your daily model size.

Sources

— SpecPicks Editorial · Last verified 2026-06-06