Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast

Name: Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast
Item: MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060
Author: Mike Perry

The VRAM math, the right quant, and the offload penalty for running a 31B instruction model on a budget card.

By Mike Perry · Published 2026-05-28 · Last verified 2026-07-22 · 10 min read

Gemma 4 31B-IT needs ~18-19GB at q4_K_M, so on a 12GB RTX 3060 you pick q3_K_M or offload. Here's what fits, what spills, and how fast.

Not entirely. Gemma 4 31B-IT does not fit in 12GB at q4_K_M — the weights alone need roughly 18-19GB — so on an RTX 3060 12GB you either drop to a more aggressive quant like q3_K_M or offload part of the model to system RAM. It runs, and it can be genuinely useful, but plan your quant level around the 12GB ceiling first.

A 31B instruction model on a budget card

Gemma 4 is the kind of release that sends every local-LLM tinkerer back to their hardware spreadsheet. A 31B instruction-tuned model promises noticeably stronger reasoning, coding, and instruction-following than the 7B-to-14B models most budget rigs run comfortably — and the community immediately started asking the obvious question: does it fit on the most popular budget inference card, the RTX 3060 12GB?

The honest answer is "with compromises." 12GB was generous for the 8B-to-13B era, but a 31B model is a different weight class. At the q4_K_M quantization most people consider the quality sweet spot, the weights are larger than the card's entire frame buffer. You are not going to load this model purely on-GPU and walk away. You will be choosing between a smaller quant that fits, or a partial offload that keeps quality high but cuts throughput.

That does not make the 3060 a bad choice — it makes it a card with a clear ceiling you can plan around. If Gemma 4 31B is an occasional "bring out the big model" tool and your daily driver is a 7-13B model, the 3060 handles both, and the offload penalty on the rare 31B session is tolerable. If 31B is your everyday workload, this guide will also tell you honestly when it is time to size up. Either way, you will leave knowing exactly what fits, what spills, and how fast it goes.

Key takeaways

Gemma 4 31B-IT does not fit in 12GB at q4_K_M — weights need ~18-19GB, so expect q3_K_M or partial offload on a 3060.
q3_K_M is the practical on-card sweet spot; q4_K_M with offload preserves quality but drops tokens per second sharply.
llama.cpp and Ollama beat vLLM on a single 12GB card because they handle CPU offload of overflow layers gracefully.
Context length costs VRAM separately from weights — jumping 8K → 32K can push several more layers off the GPU.
Two 3060s (24GB) hold a 31B model cleanly and are often a better Gemma 4 box than one bigger single GPU.

Does Gemma 4 31B fit in 12GB at all? The VRAM math

Start from the rule of thumb that q4_K_M weights occupy roughly 0.6GB per billion parameters. A 31B model therefore needs about 18-19GB just for weights, before you account for the KV cache and runtime overhead. That is well past 12GB.

Quant	GB per 1B	31B weights (approx)	Fits in 12GB on-card?
q2_K	~0.40	~12.5 GB	Barely, no context headroom
q3_K_M	~0.50	~15.5 GB	No — partial offload needed
q4_K_M	~0.60	~18.5 GB	No — significant offload
q5_K_M	~0.70	~21.5 GB	No
q8_0	~1.06	~33 GB	No

Even q2_K barely squeezes the weights in with no room left for context, and q2 visibly degrades coding and math. The realistic on-card-friendly option is q3_K_M with a handful of layers offloaded, or q4_K_M with a larger offload if you prioritize quality over speed.

Which quantization should you pick?

For a 31B model on a 12GB card, the choice is between speed (smaller quant, more layers resident on the GPU) and fidelity (larger quant, more layers offloaded to system RAM). Public community measurements consistently show q4 retains most reasoning quality, q3 is a reasonable compromise, and q2 starts to show cracks on structured tasks like code generation.

Quant	Quality	On a 3060 12GB	Recommendation
q2_K	Degraded on code/math	Mostly on-card	Only if you need speed over accuracy
q3_K_M	Good	A few layers offloaded	Best all-round on-card pick
q4_K_M	Near-reference	Significant offload	Best quality if you tolerate lower tok/s
q5_K_M+	Negligible loss	Heavy offload	Not worth it on 12GB

Test both q3_K_M and a partially offloaded q4_K_M on your own prompts. If your work is conversational, q3_K_M's speed usually wins. If you are doing careful code or analysis, the q4_K_M quality edge can justify the slower pace.

How much do you offload, and what does it cost in tokens per second?

Offloading is the lever that makes a too-big model run, and it is also the thing that slows it down. Every layer you push to system RAM is read across the PCIe bus and processed on the CPU instead of the GPU. The more you offload, the closer your throughput drifts toward CPU-only speeds.

Configuration	Approx offload	Relative throughput	Experience
q3_K_M, mostly on-GPU	Light	Fastest on 12GB	Snappy enough for chat
q4_K_M, ~30-40% offloaded	Moderate	Noticeably slower	Usable for non-interactive work
q4_K_M, heavy offload	Heavy	Slow	Batch/overnight jobs only
q2_K, fully on-GPU	None	Fast but lower quality	Speed-first compromise

The takeaway: keep as much of the model on the GPU as your chosen quant allows, and accept that a 31B on 12GB will never feel like a 7B on the same card. Offload is a tool for "it runs at all," not "it runs fast."

vLLM vs llama.cpp vs Ollama on a single 12GB card

Runtime choice matters more than usual when you are over the VRAM line. vLLM is superb for batched, high-throughput serving — but it expects the model to fit in VRAM. On a 12GB card, vLLM is better suited to smaller Gemma 4 variants than to the full 31B, because it does not gracefully spill overflow layers to system RAM the way a hobbyist single-GPU setup needs.

llama.cpp and Ollama (which wraps llama.cpp) are the friendlier choice here precisely because CPU offload of overflow layers is a first-class feature. You tell them how many layers to keep on the GPU, and they handle the rest on the CPU. For a single consumer 12GB card running a model that does not fit, that is exactly the behavior you want. Match the runtime to whether you offload: vLLM if everything fits, llama.cpp/Ollama if it does not.

Prefill vs generation throughput on the RTX 3060

The 3060 has two different speed stories. Prefill — chewing through your prompt — is compute-bound and the 3060's 3,584 CUDA cores handle it acceptably. Generation — emitting tokens one at a time — is memory-bandwidth-bound, and here the 3060's 360 GB/s GDDR6 is the asset that makes it worth using over CPU-only inference for the layers that stay resident. The moment layers spill to system RAM, those layers generate at DDR4/DDR5 bandwidth, which is why offload hurts generation so much more than it hurts prefill.

Context-length impact: the KV cache eats your remaining VRAM

The KV cache grows linearly with context length and is entirely separate from the weights. On a card already near full from a quantized 31B, this is the difference between "it runs" and "it runs out of memory."

Context	Approx KV cache (31B-class)	Effect on a near-full 12GB card
8K	~1.5-2 GB	Manageable
16K	~3-4 GB	Forces more layers off-GPU
32K	~6-8 GB	Heavy offload; throughput drops

Keep context modest when running large models on limited VRAM, or accept that long context will push more layers to the CPU and slow generation. For most Gemma 4 chat and coding tasks, 8K-16K is plenty and keeps you faster.

Is two RTX 3060 12GB cards a better Gemma 4 box than one bigger GPU?

This is the upgrade that changes everything for 31B-class models. Two 3060s give you 24GB of fast GDDR6 across the pair, enough to hold Gemma 4 31B at q4_K_M split across both cards with room for a healthy context — no system-RAM offload, full GDDR6 bandwidth on every layer. For sustained 31B work, dual 3060s frequently beat a single larger card on both price and tokens per second, and they reuse a part you may already own. We document a concrete two-card build in our dual RTX 3060 12GB local-LLM build, and cover the runtime side in our llama.cpp on the RTX 3060 12GB guide.

Perf-per-dollar and perf-per-watt vs the next step up

A single RTX 3060 12GB is the cheapest sane entry to local LLMs, and for 7-13B models nothing touches its value. For 31B, the math gets more interesting: a second 3060 roughly doubles your cost and power (each card ~170W) but removes the offload penalty entirely, often delivering more than double the 31B throughput. Compared with a single 16GB+ card, dual 3060s usually win on raw cost per usable token at this model size, at the expense of needing two PCIe slots, a bigger PSU (think 750W for the pair), and a case with airflow for two cards.

Real-world numbers: what to expect in tokens per second

Exact throughput depends on your quant, how many layers stay on the GPU, your system RAM speed, and the runtime, but the shape of the numbers is consistent across community reports and worth internalizing before you buy. A small model that fits entirely in 12GB — say an 8B at q4_K_M — generates briskly on a 3060, comfortably in the tens of tokens per second, fast enough to feel interactive. The moment you load a 31B that must offload, that figure collapses.

Scenario	Rough throughput band	How it feels
8B q4_K_M, fully on-GPU	Tens of tok/s	Snappy, interactive
31B q3_K_M, light offload	Single-digit to low-teens tok/s	Usable for chat, slight wait
31B q4_K_M, moderate offload	Low single-digit tok/s	Fine for non-interactive work
31B q4_K_M, heavy offload	Around reading speed or below	Batch jobs only

The lesson is that a 31B on a 12GB card is best treated as a "thinking" model you queue work to, not a snappy assistant you chat with in real time. If you need instant responses, a 13B-or-smaller model on the same card is the better daily driver, and you bring out the 31B for harder problems where you can tolerate the wait.

Common pitfalls running Gemma 4 31B on 12GB

Picking the quant before checking the math. People download a q4_K_M 31B, watch it crawl, and blame the card. Decide your quant against the 12GB ceiling first — q3_K_M is usually the right on-card choice.
Forgetting the KV cache. A model that "just fits" at 8K context will OOM at 32K because the cache grows separately. Budget VRAM for context, not just weights.
Reaching for vLLM on a single 12GB card. vLLM is excellent for serving models that fit; it is the wrong tool when you must offload. Use llama.cpp or Ollama instead.
Slow system RAM. Offloaded layers run at system-memory speed, so DDR4-2133 versus DDR4-3600 is a visible difference. If you offload, faster RAM helps.
Maxing context "just in case." Long context you do not use still costs throughput. Set context to what your prompts actually need.

When NOT to run Gemma 4 31B on a 3060

If your work is daily, latency-sensitive 31B inference — interactive coding assistance, real-time chat, anything where you sit and wait on every response — a single RTX 3060 12GB is the wrong tool, and no amount of quant tuning fixes the fundamental VRAM shortfall. In that case, do not fight 12GB: add a second 3060 for 24GB, or move to a 16GB-plus card. The 3060 is a superb value for models up to ~14B; it is a compromise, not a comfortable home, for a 31B.

Bottom line: who should run Gemma 4 31B on a 3060

Run Gemma 4 31B on a single RTX 3060 12GB if it is an occasional tool and your daily models are 7-13B — q3_K_M or a partially offloaded q4_K_M is workable, and the value is unbeatable. The card pairs cleanly with a Ryzen 7 5700X or Ryzen 7 5800X host, and the MSI RTX 3060 Ventus 2X 12G is our default pick at this tier.

Size up — to two 3060s for 24GB, or a single 16GB-plus card — if Gemma 4 31B is your everyday workload and you dislike offloading. The pain of fighting 12GB daily is real, and the step-up removes it. For the broader build, see our best CPU for a local-LLM homelab, our CUDA 13.3 RTX 3060 inference notes, and the Qwen3.6 27B agentic-coding deep dive for how a similarly sized model behaves on the same hardware.

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Will Gemma 4 31B-IT fit entirely in 12GB of VRAM?

Not at full precision and not comfortably at q4_K_M. A 31B model at q4_K_M needs roughly 18-19GB for weights alone, so on a 12GB RTX 3060 you must either drop to a more aggressive quant like q3 or q2, or offload a portion of the layers to system RAM. Offloading works but drops throughput sharply, so plan your quant level around the 12GB ceiling first.

Which quantization gives the best quality-per-VRAM on a 3060?

For a 31B model on 12GB, q3_K_M or a partially offloaded q4_K_M are the practical sweet spots. Public community measurements show q4 retains most reasoning quality while q2 visibly degrades coding and math. The right choice depends on whether you value speed (smaller quant, more layers on GPU) or fidelity (larger quant, more offload), so test both on your own prompts.

Is vLLM or llama.cpp better for Gemma 4 on a single 12GB card?

llama.cpp and Ollama are generally friendlier on a single consumer 12GB GPU because they handle CPU offload of overflow layers gracefully. vLLM excels at batched serving and higher throughput but expects the model to fit in VRAM, so on a 12GB card it is better suited to smaller Gemma 4 variants than the full 31B. Match the runtime to whether you offload.

How much does context length cut into my usable VRAM?

The KV cache grows linearly with context length and is separate from the weights. On a 12GB card already near full from a quantized 31B model, jumping from 8K to 32K context can consume several additional gigabytes, forcing more layers off the GPU and lowering tokens per second. Keep context modest, or accept the throughput hit, when running large models on limited VRAM.

Should I just buy a bigger GPU instead of fighting 12GB?

If you run 31B-class models daily and dislike offloading, a card with 16GB or more removes most of the pain and is worth the step-up. But if Gemma 4 31B is an occasional workload and your bread-and-butter is 7-13B models, the RTX 3060 12GB remains an excellent value and the offload penalty is tolerable for intermittent use. Match the purchase to your daily model size.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast

A 31B instruction model on a budget card

Key takeaways

Does Gemma 4 31B fit in 12GB at all? The VRAM math

Which quantization should you pick?

How much do you offload, and what does it cost in tokens per second?

vLLM vs llama.cpp vs Ollama on a single 12GB card

Prefill vs generation throughput on the RTX 3060

Context-length impact: the KV cache eats your remaining VRAM

Is two RTX 3060 12GB cards a better Gemma 4 box than one bigger GPU?

Perf-per-dollar and perf-per-watt vs the next step up

Real-world numbers: what to expect in tokens per second

Common pitfalls running Gemma 4 31B on 12GB

When NOT to run Gemma 4 31B on a 3060

Bottom line: who should run Gemma 4 31B on a 3060

Citations and sources

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Gemma 4 31B-IT on a 12GB RTX 3060: What Fits, What Offloads, How Fast

A 31B instruction model on a budget card

Key takeaways

Does Gemma 4 31B fit in 12GB at all? The VRAM math

Which quantization should you pick?

How much do you offload, and what does it cost in tokens per second?

vLLM vs llama.cpp vs Ollama on a single 12GB card

Prefill vs generation throughput on the RTX 3060

Context-length impact: the KV cache eats your remaining VRAM

Is two RTX 3060 12GB cards a better Gemma 4 box than one bigger GPU?

Perf-per-dollar and perf-per-watt vs the next step up

Real-world numbers: what to expect in tokens per second

Common pitfalls running Gemma 4 31B on 12GB

When NOT to run Gemma 4 31B on a 3060

Bottom line: who should run Gemma 4 31B on a 3060

Citations and sources

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review