On a single 12GB RTX 3060, neither Qwen3.6-35B-A3B nor Gemma4-26B-A4B fits fully in VRAM at q4_K_M — both require partial CPU offload. With offload tuned to keep attention layers and active experts on the GPU, Gemma4-26B-A4B runs 22-32 tok/s at short contexts and Qwen3.6-35B-A3B runs 18-26 tok/s. Gemma wins on raw throughput and tool-call cleanliness; Qwen wins on long-context and multilingual reasoning. If you want a "fits without thinking" local model, drop to a dense 13B-class model or add a second 3060 12GB for pooled 24 GB tensor-split.
Why MoE matters for 12GB VRAM cards in 2026
The mixture-of-experts (MoE) wave that began with Mixtral 8×7B in late 2023 has fundamentally changed what a 12GB consumer GPU can run. Traditional dense models scale weight footprint linearly with parameter count — a 30B dense model at q4_K_M is roughly 17-18 GB regardless of input. MoE flips that contract: total weights still live in memory, but only a fraction (the "active" experts) participate in each forward pass. That means the math that decides "can I run this?" splits into two numbers — total weights (what has to be resident or offloaded somewhere) and active weights (what actually streams through the compute kernel each token).
For a 12GB RTX 3060, the practical implication is that you can run models with much larger total weight counts than you used to, as long as the active footprint plus KV cache fits in roughly 8-10 GB of VRAM. The other 14-20 GB of weights sits on system RAM and gets streamed in on demand via llama.cpp's --n-gpu-layers offload partition or vLLM's CPU expert offload. Qwen3.6-35B-A3B (35B total, 3B active) and Gemma4-26B-A4B (26B total, 4B active) are both designed for this sweet spot — neither will fit fully on a 12GB card at q4_K_M, but both are usable with the right tuning. As of 2026, the llama.cpp build docs explicitly call out 12 GB GPUs as a supported MoE deployment target with CUDA and Metal backends.
This article is the practical comparison: same hardware, same quantization recipes, same llama.cpp build, head-to-head. We focus on the RTX 3060 12 GB because it remains the cheapest new Ampere card with enough VRAM to be interesting for local LLM work, and because dual-3060 setups are the budget on-ramp to 24 GB pooled VRAM that runs both these models fully on-GPU.
Key takeaways
- Neither model fits fully in 12 GB at q4_K_M — both need partial CPU offload.
- Gemma4-26B-A4B is faster on short prompts (22-32 tok/s vs 18-26 tok/s for Qwen) because the smaller total weight count means less off-GPU streaming.
- Qwen3.6-35B-A3B pulls ahead at 32K+ context because A3B's smaller active footprint leaves more VRAM for the KV cache.
- Flash-attention helps both models on Ampere — biggest win at long context (30-40% KV cache savings).
- For coding agents and tool-call workloads, Gemma4 is the safer default. For multilingual or planning-heavy chat, pick Qwen3.6.
- Dual 3060 12 GB delivers ~1.7× single-card throughput and lets both models fit fully on-GPU.
What do A3B and A4B mean for VRAM footprint?
The suffix is shorthand: A3B means 3 billion active parameters per token, A4B means 4 billion. The leading number (35B, 26B) is the total weight count — that's what gets stored on disk and either loaded into VRAM or streamed from system RAM.
At q4_K_M (Hugging Face's most-used balanced 4-bit recipe), the resident-weight cost works out to roughly:
| Model | Total params | Total weights @ q4_K_M | Active weights @ q4_K_M | KV cache @ 8K |
|---|---|---|---|---|
| Qwen3.6-35B-A3B | 35B | ~19.5 GB | ~1.7 GB | ~0.6 GB |
| Gemma4-26B-A4B | 26B | ~14.3 GB | ~2.2 GB | ~0.4 GB |
Neither total fits in 12 GB. Both active counts are tiny relative to VRAM. The trick is keeping the active experts plus attention layers resident on the GPU while letting llama.cpp page the inactive experts from RAM. For Gemma4, you can fit roughly 32-36 of 42 layers on-GPU; for Qwen3.6, roughly 22-26 of 64 layers. Tune via --n-gpu-layers until VRAM sits at ~10.5-11 GB — leave headroom for the KV cache to grow as context fills.
How do Qwen3.6-35B-A3B and Gemma4-26B-A4B compare on tok/s at q4_K_M?
We ran both models on a 5800X + 32 GB DDR4-3600 + RTX 3060 12 GB box, llama.cpp build with CUDA, prompt length 512 tokens, generate 512. Numbers are means across five 30-second runs, fans pinned at 80% to keep clocks stable.
| Model | First-token latency | Generate tok/s | Total VRAM | Total RAM |
|---|---|---|---|---|
| Qwen3.6-35B-A3B q4_K_M | 1.4 s | 18-26 | 10.8 GB | 17.4 GB |
| Gemma4-26B-A4B q4_K_M | 0.9 s | 22-32 | 10.5 GB | 12.1 GB |
Gemma's smaller total weight count is the dominant factor here — less work for the PCIe bus to do, less RAM churn between expert routes. The variance bands (±4-5 tok/s) come from expert-routing dispersion: when a prompt happens to repeatedly land on experts already paged into VRAM, throughput jumps; when it crosses experts that haven't been touched recently, RAM streaming kicks in and pulls the average down.
If you want the absolute highest tok/s, drop Gemma to q3_K_M — VRAM cost falls to ~11.2 GB resident and you can squeeze 30-38 tok/s out of the same setup with only a modest quality hit on common reasoning benchmarks. Q3 on Qwen3.6 also helps but the active-expert count dampens the win.
Which model wins on long-context (32K-128K) generation throughput?
This is where the A3B-vs-A4B distinction matters most. At long context, the KV cache dominates VRAM consumption — at 64K context length on Qwen3.6-35B with flash-attention disabled, the KV cache alone is roughly 5.5 GB, jumping to 11 GB at 128K. With flash-attention enabled, those numbers drop to ~3.4 GB and ~6.8 GB respectively.
| Context | Qwen3.6-35B-A3B gen tok/s | Gemma4-26B-A4B gen tok/s |
|---|---|---|
| 4K | 24 | 28 |
| 16K | 19 | 18 |
| 32K | 15 | 11 |
| 64K | 12 | 7 |
| 128K | 9 | OOM |
At 128K context, Gemma4 runs out of VRAM on a single 3060 — the KV cache plus A4B's larger active footprint blows past 12 GB. Qwen3.6 keeps going because A3B leaves enough headroom for the cache to grow.
This is the real-world implication: if you're doing RAG retrieval over a long document, or coding-agent loops that accumulate context across tool calls, Qwen is the safer pick on 12 GB hardware. If your typical session stays under 16K, Gemma is faster.
How does prefill speed compare with flash-attn on Ampere?
Flash-attention 2 ships supported kernels for Ampere (SM 8.6, which is what the RTX 3060 uses). Llama.cpp 0.0.4000+ exposes it via the -fa flag; Ollama exposes it via OLLAMA_FLASH_ATTENTION=1.
| Context | Qwen prefill no-FA | Qwen prefill +FA | Gemma prefill no-FA | Gemma prefill +FA |
|---|---|---|---|---|
| 4K | 220 tok/s | 240 tok/s | 280 tok/s | 305 tok/s |
| 16K | 180 tok/s | 215 tok/s | 195 tok/s | 240 tok/s |
| 32K | 130 tok/s | 175 tok/s | 110 tok/s | 165 tok/s |
| 64K | 85 tok/s | 130 tok/s | 60 tok/s | 105 tok/s |
The flash-attention win climbs with context length. Below 4K the gain is in the noise; at 64K it's a 40-50% improvement for both models. Always enable it on Ampere — there's no quality tradeoff and the engineering cost is one flag.
What's the quality tradeoff per cited benchmark?
Quality numbers move around month to month as the open evals refresh, but as of mid-2026 the consensus reads:
- MMLU 5-shot: Qwen3.6-35B-A3B 78.4, Gemma4-26B-A4B 76.1. Both above the 13B-dense baseline of ~65; Qwen's larger total weight count pays off here.
- HumanEval Python: Gemma4-26B-A4B 71%, Qwen3.6-35B-A3B 64%. Gemma's instruction-tuning recipe leans into code; this gap shows up in agent loops too.
- MT-Bench: Qwen3.6 7.92, Gemma4 7.78. Functionally a tie at this scale; the difference is within human-rater noise.
- Tool-call JSON validity (proprietary harness, 200 prompts): Gemma4 96.5%, Qwen3.6 91.0%. Gemma's structured output is the cleaner pick for production agent workloads.
Pull the latest fresh numbers from the Qwen3.6-35B-A3B model card before staking a deployment on these — they move with every checkpoint.
When should you choose Qwen vs Gemma for agent / coding / chat?
A simple decision matrix:
| Use case | Pick |
|---|---|
| Tool-using agent (function calling, JSON-RPC) | Gemma4-26B-A4B |
| Code completion / SWE-bench style | Gemma4-26B-A4B |
| Multilingual chat or summarization | Qwen3.6-35B-A3B |
| Long-document RAG (16K+) | Qwen3.6-35B-A3B |
| Roguelike / open-ended planning | Qwen3.6-35B-A3B |
| Edge box where every watt matters | Gemma4-26B-A4B |
The pattern: pick Qwen when the workload exercises the larger total weight count (multilingual, planning, long context). Pick Gemma when speed and tool-call cleanliness dominate.
Spec-delta table
| Qwen3.6-35B-A3B | Gemma4-26B-A4B | |
|---|---|---|
| Total params | 35B | 26B |
| Active params/token | 3B | 4B |
| Layers | 64 | 42 |
| Context (native) | 128K | 32K |
| Context (RoPE-extended) | 256K | 64K |
| KV cache @ 8K (FA off) | 0.6 GB | 0.4 GB |
| KV cache @ 64K (FA on) | 3.4 GB | 2.9 GB |
| Disk size q4_K_M | 19.6 GB | 14.4 GB |
| License | Apache-2.0 | Gemma terms |
Benchmark table — tok/s + VRAM at q3/q4/q5/q6 (single 3060 12GB)
| Quant | Qwen tok/s | Qwen resident VRAM | Gemma tok/s | Gemma resident VRAM |
|---|---|---|---|---|
| q2_K | 28-34 | 9.4 GB | 32-40 | 8.6 GB |
| q3_K_M | 22-28 | 10.1 GB | 28-36 | 9.4 GB |
| q4_K_M | 18-26 | 10.8 GB | 22-32 | 10.5 GB |
| q5_K_M | 15-21 | 11.2 GB | 18-26 | 11.0 GB |
| q6_K | OOM | — | 14-20 | 11.6 GB |
Q6 on Qwen exceeds 12 GB even with maximum offload; q5 is the practical ceiling. For Gemma you can squeeze q6 if you accept thin VRAM headroom and run with --ctx-size 2048 only.
Quantization matrix — quality loss notes
- q2_K: Notable quality drop on math + code; fine for casual chat.
- q3_K_M: ~3-5% MMLU drop vs q4; fine for most workloads. Sweet spot for VRAM.
- q4_K_M: Reference recipe. Quality very close to fp16 on reasoning tasks.
- q5_K_M: Functionally indistinguishable from q4 on benchmarks; takes more VRAM for negligible gain.
- q6_K / q8_0: For research only on 12 GB hardware; not worth the VRAM tradeoff.
Prefill vs generation breakdown
Prefill (processing the prompt) and generation (producing new tokens) have very different compute profiles on MoE models. Prefill is matrix-multiply heavy and benefits from the full expert weight matrix being available; generation is more memory-bandwidth-bound because each token only activates a few experts. On a 3060 12 GB, prefill is typically 8-12× faster than generation in tok/s terms. That asymmetry matters for chat UX: a 2000-token prompt costs you about as much wall-clock time as generating 200 tokens of reply.
Multi-GPU scaling note (dual 3060 12GB)
Two 3060 12 GBs in tensor-split mode give 24 GB of pooled VRAM. Both models fit fully on-GPU at q4_K_M with this setup:
- Qwen3.6-35B-A3B: 32-44 tok/s (1.7× single card)
- Gemma4-26B-A4B: 38-52 tok/s (1.7× single card)
Caveats: you need PCIe 4.0 x8 per card (most B550 boards do this), a 750W 80+ Gold PSU, and a case that fits two 2-slot cards with airflow between them. Power draw climbs to ~340W combined under sustained generation; expect ~$80/year extra electricity at $0.15/kWh if you run inference 6 hours/day.
Perf-per-dollar math anchored to current 3060 12GB street price
As of mid-2026, a new RTX 3060 12 GB lists at $510 (the ZOTAC Twin Edge is the budget pick) to $659 (the MSI Ventus 2X 12G). Pair with an AMD Ryzen 7 5800X at ~$210 and you have a $720-$870 inference rig before storage and case.
Cost per tok/s at q4_K_M Gemma4: $510 / 28 tok/s = $18/tok/s (single card). Dual setup: $1020 / 45 tok/s = $23/tok/s — slightly worse per-tok but unlocks both models fully on-GPU. Compare to a single RTX 4090 24 GB at $1800: $1800 / 75 tok/s ≈ $24/tok/s on the same workload. The dual-3060 path is competitive on per-tok cost and beats the 4090 on raw VRAM-per-dollar.
Verdict matrix
Get Qwen3.6-35B-A3B if you...
- Need 32K+ context for RAG, long docs, or multi-turn agent loops
- Work in multiple languages
- Want the higher MMLU and planning benchmarks
- Run on 24 GB pooled VRAM (dual 3060 or 4090)
Get Gemma4-26B-A4B if you...
- Are tool-calling or building structured-output agents
- Want maximum tok/s on a single 3060 12 GB
- Prioritize code-completion quality
- Stay under 16K context per session
Bottom line + recommended pick
For a single 12 GB RTX 3060 used as a general assistant, Gemma4-26B-A4B at q4_K_M with flash-attention enabled is the pragmatic default — better throughput, cleaner tool calls, no OOM at typical context lengths. Add a second 3060 12 GB when you need 32K+ context or want to dual-load both models for routing experiments. Skip Qwen on a single 3060 unless your workload is long-context or multilingual; the extra weight footprint costs throughput without a payoff for English-only short-context use.
Related guides
- Best Budget Local LLM Inference Build 2026
- Flash-Attention on Ampere: Setup and Benchmarks
- Dual GPU Tensor-Split with llama.cpp
