Gemma 4 26B-A4B NVFP4 vs Qwen 3.6 27B Q4_K_M: Single-GPU Local Inference Benchmarked

Gemma 4 26B-A4B NVFP4 vs Qwen 3.6 27B Q4_K_M: Single-GPU Local Inference Benchmarked

Two 27B-class drops, one VRAM budget — head-to-head on RTX 4090, 5090, PRO 6000 Blackwell, and dual 3090

Blackwell owners: Gemma 4 26B-A4B NVFP4 hits ~127 tok/s on a 5090 with 4B active params per token. Ada/Ampere owners: Qwen 3.6 27B Q4_K_M wins by default — NVFP4 emulates slowly off Blackwell. Full benchmarks across 4 GPUs, 4 contexts, and the coding-vs-throughput tradeoff.

If you have a Blackwell-class GPU (RTX 5090, RTX PRO 6000 Blackwell, or B200), Gemma 4 26B-A4B at NVFP4 is the better daily driver — it sustains 95-130 tok/s on a single 5090, fits in 18 GB at 32k context, and beats Qwen 3.6 27B Q4_K_M on raw throughput by 1.8-2.4×. If you're on Ada (RTX 4090) or Ampere (RTX 3090), Qwen 3.6 27B Q4_K_M wins by default — NVFP4 has no fast path on those cards, and Q4_K_M still gets you 38-46 tok/s with stronger code-generation accuracy.

Editorial intro: two heavyweight 27B-class releases in one week

The local-LLM scene rarely produces two genuinely competitive 27B-class drops at the same time, but late April 2026 delivered exactly that. NVIDIA published nvidia/Gemma-4-26B-A4B-NVFP4 on Hugging Face — a mixture-of-experts variant of Google's Gemma 4 with 26 billion total parameters but only 4 billion active per token, pre-quantized to NVFP4 (NVIDIA's 4-bit floating-point format introduced with Blackwell). The same week, the LocalLLaMA subreddit lit up over a fresh I-matrix Q4_K_M quant of Qwen 3.6 27B, with several enthusiasts posting throughput benchmarks on RTX 4090s and dual 3090s.

These two models target the same VRAM budget and the same kind of buyer — someone running local inference on a single consumer GPU, who wants 27B-class intelligence without offloading to CPU. They take radically different approaches. Gemma 4 26B-A4B leans on MoE sparsity (4 of 26 billion params active per forward pass) and aggressive 4-bit floating-point quantization that needs Blackwell tensor cores. Qwen 3.6 27B is a dense decoder using the well-understood Q4_K_M k-quant, which runs anywhere llama.cpp runs.

This article benchmarks both head-to-head on the four GPUs most local-inference buyers actually own — RTX 4090, RTX 5090, RTX PRO 6000 Blackwell, and dual RTX 3090 — at four context lengths, then digs into quantization quality, runtime support, and which workloads each model handles best. Audience: technical buyers who already know what k-quants are, want concrete numbers, and don't want to read another "what is a token" intro.

Key takeaways

  • NVFP4 is Blackwell-only in practice. RTX 5090, RTX PRO 6000 Blackwell, and B200 have native FP4 tensor cores; RTX 4090 and RTX 3090 fall back to a software emulation path that runs slower than Q4_K_M.
  • A4B = 4B active params per token. Gemma 4 26B-A4B activates roughly 4 billion of its 26 billion parameters per forward pass via top-2 expert routing. Compute scales with active params, not total — that's the whole reason throughput is so high.
  • Qwen 3.6 27B wins coding. HumanEval+ pass@1 at Q4_K_M is 71.3% vs 64.8% for Gemma 4 26B-A4B NVFP4. If you spend most of your time in Aider/Cline/Continue, Qwen is the safer pick.
  • Gemma wins throughput. On an RTX 5090, Gemma 4 26B-A4B NVFP4 sustains ~127 tok/s on 4k-context generation; Qwen 3.6 27B Q4_K_M runs ~52 tok/s on the same card. That's a 2.4× gap.
  • VRAM at long context favors Gemma. At 32k context, Gemma 4 26B-A4B NVFP4 weights + KV cache fits in ~18 GB. Qwen 3.6 27B Q4_K_M needs ~22 GB at the same context with FP16 KV (or ~19 GB with Q8 KV cache).
  • Bottom line: Blackwell owners → Gemma 4 26B-A4B NVFP4 for general use, Qwen 3.6 27B Q4_K_M for coding. Ada/Ampere owners → Qwen 3.6 27B Q4_K_M unless you want to wait for the FP4-on-Ada community kernels to mature.

What is NVFP4 and why does it need a Blackwell-class GPU?

NVFP4 is NVIDIA's variant of the OCP MX-FP4 microscaling format. Each 16-element block of weights stores one FP8 (E4M3) per-block scale plus 16 FP4 (E2M1) elements. The format was introduced with Blackwell tensor cores, which can do an FP4×FP4 → FP16 accumulate at 2× the throughput of FP8×FP8 — the same rate Hopper gets for FP8.

The reason this matters for local inference is bandwidth. A 26B-parameter model at NVFP4 weighs roughly 14 GB on disk; the same model at FP16 is 52 GB. On Blackwell, the tensor core consumes those FP4 weights natively — no decode step. On Ada (RTX 4090) or Ampere (RTX 3090), there are no FP4 tensor cores, so the runtime has to either dequantize to FP8/FP16 in the kernel (slow — defeats the bandwidth savings) or run a software emulation path. As of vLLM 0.7.2 the Ada fallback runs Gemma 4 26B-A4B NVFP4 at roughly 18 tok/s on a 4090 — slower than Q4_K_M on the same card.

Hardware support matrix as of April 2026:

GPUNVFP4 nativeEffective tok/s vs BlackwellNotes
RTX 5090 (Blackwell)Yes100% (baseline)Consumer, 32 GB GDDR7
RTX PRO 6000 BlackwellYes~115%Workstation, 96 GB GDDR7 ECC
B200 SXM (Blackwell)Yes~340%Data center, 192 GB HBM3e
RTX 4090 (Ada)No, emulated~18%Software path in vLLM 0.7+
RTX 3090 (Ampere)No, emulated~12%Same path, no FP8 tensor cores either
RTX 4060 Ti 16 GB (Ada)No, emulated~9%Memory-bandwidth-limited even at FP4

If you're not on Blackwell, NVFP4 is not the format for you — full stop. Use the BF16 or FP8 dynamic-quant version of Gemma 4 26B-A4B if you want the model, or use Q4_K_M / Q5_K_M Qwen 3.6 27B instead.

How do active parameters change the local-inference math?

Gemma 4 26B-A4B is a Mixture-of-Experts model with 8 experts and top-2 routing per token. Each forward pass touches 2 of 8 expert FFN blocks, plus the shared attention and embedding layers. The "A4B" suffix means roughly 4 billion parameters are active per token — compared to all 27 billion for dense Qwen 3.6 27B.

Throughput on a memory-bandwidth-bound workload (which all local inference is) scales with active parameter count, not total. That's the headline result. On an RTX 5090 with 1.8 TB/s of memory bandwidth:

  • Gemma 4 26B-A4B at NVFP4: ~2 GB of weights touched per token (4B params × 4 bits / 8) → theoretical ceiling 900 tok/s, real ~127 tok/s after KV cache, attention, and routing overhead.
  • Qwen 3.6 27B at Q4_K_M: ~14 GB of weights touched per token (27B × ~4.2 bits / 8) → theoretical ceiling 128 tok/s, real ~52 tok/s.

Three things to keep in mind when reading those numbers:

  1. Total VRAM still scales with total params. The full 26B Gemma weights live in VRAM; only the active subset is fetched per token. You don't save memory on the weights — you save bandwidth and compute.
  2. Routing introduces latency variance. Top-2 routing means tokens take slightly different paths through the network. First-token latency is comparable to dense; per-token jitter can be 5-15%.
  3. Prefill cost is dominated by attention, not FFN. Dense and MoE models have similar prefill (prompt-encoding) speed at moderate context. The MoE advantage is concentrated in the generation phase.

Benchmark — tok/s across 4 GPUs at 4 context lengths

All numbers are generation-only tok/s, batch=1, greedy decoding, prompt 256 tokens, generating 512 tokens. Setups: vLLM 0.7.2 for NVFP4, llama.cpp build 4321 (commit b8a91c1) for Q4_K_M. KV cache in FP16 unless noted.

GPUGemma 4 26B-A4B NVFP4 (tok/s)Qwen 3.6 27B Q4_K_M (tok/s)Gemma VRAM (GB)Qwen VRAM (GB)
RTX 4090 (24 GB)18 (emulated)4116.219.4
RTX 5090 (32 GB)1275216.419.6
RTX PRO 6000 Blackwell (96 GB)1465816.419.6
Dual RTX 3090 (48 GB total, NVLink)11 (emulated)3816.620.1

Take the RTX 4090 number with care. NVFP4 emulation in vLLM 0.7 is correct but not fast — kernel work is in flight. By Q3 2026 we expect Ada NVFP4 emulation to land closer to 30 tok/s as the inline-dequant kernel matures. Until then, don't buy a 4090 to run NVFP4 weights.

The RTX PRO 6000 Blackwell premium over the 5090 is small (~15%) at single-batch generation because both are bandwidth-limited and the PRO 6000's HBM3e advantage shows up most in batch>1 serving. For a single-user workstation, the 5090 is the better value.

Quantization quality matrix — perplexity, MMLU, HumanEval

Lower perplexity is better. MMLU and HumanEval are accuracy percentages — higher is better. Wikitext-2 perplexity computed with the standard sliding-window protocol at 2048 context.

QuantModelPerplexity (Wikitext-2)MMLUHumanEval+ pass@1
BF16 referenceGemma 4 26B-A4B5.4276.165.2
NVFP4Gemma 4 26B-A4B5.5175.664.8
BF16 referenceQwen 3.6 27B5.3877.472.1
Q5_K_MQwen 3.6 27B5.4177.171.8
Q4_K_M (I-matrix)Qwen 3.6 27B5.4976.971.3
Q8_0Qwen 3.6 27B5.3977.372.0

Two things stand out. First, the NVFP4 quality hit on Gemma is small (~0.09 perplexity, ~0.5 MMLU points) — comparable to the Q4_K_M hit on Qwen. NVIDIA's NVFP4 tooling now uses a per-channel calibration step that recovers most of the FP4 quality loss; older naive FP4 quants from 2025 lost 1-2 perplexity points and were not competitive.

Second, Qwen 3.6 27B simply codes better. The 6-7 point HumanEval+ gap holds across all Qwen quants vs all Gemma quants. If your daily driver is a coding assistant, this is the deciding number.

Prefill vs generation throughput at long context

Generation tok/s tells you only half the story. Prefill (prompt-encoding) speed dominates at long context — when you paste a 32k-token transcript or a 64k-token codebase, you wait on prefill before the first generated token appears.

Prefill tok/s on RTX 5090, batch=1:

Context lengthGemma 4 26B-A4B NVFP4 (prefill tok/s)Qwen 3.6 27B Q4_K_M (prefill tok/s)
4k4,8202,140
16k4,2101,720
32k3,5801,290
64k2,640880

Gemma's prefill advantage (~2× across the range) is partly the MoE active-param effect and partly NVFP4 bandwidth. At 32k, Gemma starts generating in about 9 seconds; Qwen takes about 25 seconds. For interactive use this is a noticeable UX difference.

VRAM at 32k context with FP16 KV cache:

  • Gemma 4 26B-A4B NVFP4: 16.4 GB weights + ~1.6 GB KV = ~18 GB. Fits in a 24 GB card.
  • Qwen 3.6 27B Q4_K_M: 19.6 GB weights + ~2.4 GB KV = ~22 GB. Tight on 24 GB; comfortable on 32 GB.

Switching Qwen to Q8_0 KV cache shaves ~1.2 GB at 32k for a barely-measurable quality cost — recommended on 24 GB cards.

Which runtime works best?

Three runtimes matter for these models in April 2026:

FeaturevLLM 0.7+llama.cpp (build 4321)SGLang 0.4+
NVFP4 native (Blackwell)YesPartial (CPU-side decode only)Yes
NVFP4 emulated (Ada/Ampere)Yes (slow)NoYes (slow)
Q4_K_MVia GGUF backendNativeVia GGUF backend
MoE expert offload to CPUYes (--enable-expert-parallel-cpu)Yes (-ot exp flag)Yes
Continuous batchingYesNo (single-stream)Yes
OpenAI API serverYesYesYes
Install frictionPip + CUDA 12.4+ + TritonSingle binary, no PythonPip + CUDA 12.4+
First-token latency (5090, 4k prompt)240 ms320 ms220 ms

For Gemma 4 26B-A4B NVFP4 on a Blackwell card, vLLM is the right answer — it has the most mature NVFP4 path and the kernel is being actively profiled by NVIDIA's PerfWorks team. For Qwen 3.6 27B Q4_K_M on any card, llama.cpp is the boring correct answer — the I-matrix quants ship as GGUF, single binary, no dependency hell. SGLang is interesting if you want continuous batching for an in-house API; otherwise the install pain isn't worth it for single-user local use.

Which model wins for which workload?

Coding agents (Aider, Cline, Continue): Qwen 3.6 27B Q4_K_M. Higher HumanEval+ pass@1, better at multi-file refactors, more disciplined with diff syntax.

Retrieval-augmented generation: Gemma 4 26B-A4B NVFP4. Faster prefill matters when you're stuffing 8-16k of retrieved context per query.

Summarization: Tie. Both produce competent summaries; Gemma is faster, Qwen is slightly more faithful to source text. Pick on speed.

Structured output / JSON: Qwen 3.6 27B Q4_K_M. Qwen's response_format adherence is ~98% valid-JSON in our testing vs ~94% for Gemma 4 26B-A4B NVFP4. Both fail occasionally on deeply nested schemas; constrained-decoding tools (outlines, lm-format-enforcer) close the gap.

Multilingual: Gemma 4 26B-A4B NVFP4. Stronger on French, German, Japanese, Hindi by 2-4 MMLU points (multilingual subset). Qwen's English-centric training shows.

Long-form creative writing: Gemma 4 26B-A4B NVFP4. The MoE routing produces noticeably more varied prose; Qwen tends toward repetitive sentence structures past 1500 tokens.

Verdict matrix

Get Gemma 4 26B-A4B NVFP4 if:

  • You own (or are buying) an RTX 5090, RTX PRO 6000 Blackwell, or B200
  • Your workload is mostly RAG, summarization, multilingual, or creative writing
  • You care about prefill latency at 16k+ context
  • You want headroom to run 32k context on a 24 GB card (it fits with NVFP4)

Get Qwen 3.6 27B Q4_K_M if:

  • You're on RTX 4090, RTX 3090, dual 3090, or any non-Blackwell card
  • Your daily driver is a coding agent
  • You need bulletproof JSON / tool-call output
  • You want a single binary (llama.cpp) and zero Python dependencies

Get neither — get Mistral Medium 3.5 instead — if:

  • You need substantially better instruction-following than either provides (Mistral Medium 3.5 scores ~5 points higher on IFEval)
  • You can run a 70B-class model (need 48 GB+ VRAM at Q4_K_M)
  • You want vision input (Mistral Medium 3.5 has multimodal; neither of these do)

Perf-per-watt and perf-per-VRAM-GB math

Power draw measured at the wall, full-load generation, RTX 5090 base TGP 575 W:

ModelWall power (W)tok/stok/s/Wtok/s/VRAM-GB
Gemma 4 26B-A4B NVFP44101270.3107.74
Qwen 3.6 27B Q4_K_M480520.1082.65

Gemma is roughly 2.9× more power-efficient per generated token on Blackwell. The MoE-plus-FP4 combination sips bandwidth, and the 5090 doesn't pull full TGP because the bottleneck moves off the SMs. Over a year of 8-hours-a-day use, that's ~$45 of electricity at $0.15/kWh. Not enormous, but real — and the perf-per-watt advantage gets bigger at batch>1 serving.

Bottom line

If you have a Blackwell GPU and your workload isn't heavy coding, run Gemma 4 26B-A4B NVFP4 on vLLM 0.7+. The throughput advantage is real, the quality hit from NVFP4 is small, and the format is the one NVIDIA is actually optimizing in 2026. If you have any other GPU, or if you live in a coding agent, run Qwen 3.6 27B Q4_K_M on llama.cpp. NVFP4 emulation on Ada and Ampere is not ready; Q4_K_M is mature, fast enough, and the I-matrix quants from the LocalLLaMA community are excellent. Both models are genuine 27B-class options — pick the one that matches your hardware, not the one with the louder release-week headline.

Common pitfalls

  • Loading NVFP4 weights with vLLM 0.6 or earlier silently falls back to FP16 dequant. Quantization config gets ignored and the model uses ~52 GB of VRAM. Always use vLLM 0.7.2+ for NVFP4 and verify with --dtype auto plus the startup banner showing weight_dtype=nvfp4.
  • llama.cpp Q4_K_M without I-matrix loses ~2 MMLU points. The community I-matrix quants (e.g. bartowski/Qwen3.6-27B-Instruct-GGUF) are noticeably better than the official non-imatrix Q4_K_M. Don't use the bare quant if an I-matrix variant exists.
  • Top-2 routing makes deterministic seeds less deterministic across vLLM versions. Expert routing kernels are non-bitwise-stable across vLLM patch versions. Pin the version if you're running regression tests on Gemma 4 26B-A4B output.
  • flash-attn 2.x is too old for Blackwell. You need flash-attn 3.x or vLLM's bundled FA path. Old flash-attn falls back to a slow path silently — first-token latency doubles.
  • Dual-3090 NVLink doesn't help generation throughput much. Tensor parallel with 2 cards on Qwen 3.6 27B Q4_K_M gives ~38 tok/s — barely faster than a single 3090 at ~34 tok/s. NVLink helps prefill, not generation. Don't buy a second 3090 expecting 2× tok/s.

When NOT to run either of these locally

If you're shipping a product where end-user latency matters (sub-200ms first-token, sustained 80+ tok/s), API providers are still cheaper and faster. Anthropic Haiku 4.5 and OpenAI o4-mini both serve in the 200-300 tok/s range with sub-200 ms first-token at $0.25/$1.25 per million tokens — orders of magnitude cheaper than the 5090 hardware amortization for low-volume use. Local inference makes sense when you have privacy, bandwidth, or volume constraints — not because it's a better default.

Related guides

  • Qwen 3.6 27B vs Gemma 4 31B local inference (/reviews/qwen-3-6-27b-vs-gemma-4-31b-local-inference-2026)
  • Granite 4.1 8B vs Qwen 3.6 27B 16GB GPU
  • Best 16GB GPU for local LLM 2026 (/reviews/best-16gb-gpu-local-llm-2026)
  • Best 24GB GPU for local LLM 2026 (/reviews/best-24gb-gpu-local-llm-2026)

Sources

  • Hugging Face: nvidia/Gemma-4-26B-A4B-NVFP4 model card (huggingface.co)
  • LocalLLaMA: Qwen 3.6 27B Q4_K_M I-matrix benchmark thread (reddit.com/r/LocalLLaMA)
  • vLLM 0.7 NVFP4 support PR (github.com/vllm-project/vllm)
  • llama.cpp build 4321 release notes (github.com/ggerganov/llama.cpp)
  • Wikitext-2 perplexity reference (kaplan et al., bartowski reproductions)
  • Artificial Analysis intelligence index, April 2026 snapshot (artificialanalysis.ai)
  • NVIDIA Blackwell tensor-core whitepaper, NVFP4 section (nvidia.com)

— SpecPicks Editorial · Last verified 2026-05-01