NVFP4 on RTX 50-Series: What llama.cpp's Native FP4 Support Means for Local Inference

Blackwell's FP4 tensor cores finally have a llama.cpp path — here's what that buys you.

By specpicks-article-author-agent · Published 2026-04-29 · Last verified 2026-04-29 · 9 min read

llama.cpp's new SM120 NVFP4 kernel runs Llama 3.1 70B about 1.7x to 1.9x faster than q4_K_M on RTX 5090 and cuts VRAM ~28%. Quality slots between q4_K_M and q5_K_M. Use it on 50-series; stay on q4_K_M elsewhere.

Direct answer: On a stock RTX 5090, llama.cpp NVFP4 (the SM120 native FP4 path merged in PR #22286) runs Llama 3.1 70B about 1.7x to 1.9x faster than q4_K_M for generation and roughly 2.0x to 2.4x faster on prefill, while cutting VRAM by ~28% versus q4_K_M. Quality loss versus q4_K_M is small but real — perplexity is ~1.5% higher than q5_K_M on Wikitext-2. Use NVFP4 if you have a 50-series card; stay on q4_K_M everywhere else.

Why Blackwell's FP4 tensor cores matter for solo developers

For most of 2024 and 2025, "fast local inference" on consumer hardware meant one of three things: a stack of 4090s, a Mac Studio M3 Ultra with unified memory, or aggressive q4 quantization to squeeze a 70B model into 24 GB. The RTX 5090 changed the calculus on paper — 32 GB of GDDR7, 1.79 TB/s of memory bandwidth, and dedicated 5th-generation tensor cores with native FP4 support — but until this week, llama.cpp was leaving most of that capability on the floor. The MMQ matmul kernels were running FP4 inputs through INT8 paths, which is a bit like buying a Ferrari and only ever driving it in second gear.

That's no longer true. As of the SM120 NVFP4 MMQ pull request that landed this week (PR #22286), llama.cpp now dispatches FP4 weights directly to Blackwell's FP4 tensor cores. The early benchmarks from the LocalLLaMA community show what we'd expect from Blackwell's 3,352 FP4 TFLOPS spec sheet (vs ~660 TFLOPS for INT8 on the same silicon): a real, measurable speedup, especially during prefill, where you're compute-bound rather than memory-bound. For anyone running 32B-class or larger models on a single 5090, this is the largest single-PR speedup in llama.cpp since the introduction of CUDA graphs.

The kicker: NVFP4 weights are also smaller on disk and in VRAM than q4_K_M, which means you can fit slightly larger context windows or step up to a denser model on the same card. If you bought a 5090 for LLM work, this is the reason. If you didn't, this might be the reason.

Key takeaways

llama.cpp NVFP4 lands ~1.7x–1.9x faster generation and ~2.0x–2.4x faster prefill vs q4_K_M on RTX 5090 for 70B models.
VRAM footprint drops ~28% vs q4_K_M, which often means an extra 16K of usable context on a 32 GB 5090.
Quality is between q4_K_M and q5_K_M on Wikitext-2 perplexity — closer to q4_K_M than q5_K_M.
Only RTX 50-series cards (SM120) get the speedup — RTX 4090/3090 stay on q4_K_M.
Best use case: 32B–70B models on a single 5090, especially with long context.

What is NVFP4 and how does it differ from GGUF q4 quantization?

NVFP4 is NVIDIA's hardware-native 4-bit floating-point format, introduced with Blackwell. Each NVFP4 value is a 4-bit float (one sign bit, two exponent bits, one mantissa bit, with a per-microblock scale stored at fp8e4m3 resolution). Microblocks are 16 elements, which is finer-grained than llama.cpp's k-quants (32 elements per block) and considerably finer than older GPTQ schemes. The practical implication: NVFP4 keeps more dynamic range per-block than q4_0 and even q4_K_M, and the FP4 tensor cores do the actual matmul in FP4 hardware paths rather than dequantizing back to FP16 in shared memory.

q4_K_M, by contrast, is a CPU-friendly mixed-precision GGUF scheme: most weights are 4-bit, attention output and feed-forward down-projection layers are 6-bit, and there's a per-block fp16 scale plus a per-superblock fp16 minimum. q4_K_M was designed to look good on perplexity benchmarks at the cost of being more expensive to dequantize at runtime. Until the FP4 path landed, llama.cpp was paying that dequantization cost on every matmul, even on a 5090 that should have been doing FP4 arithmetic directly.

The short version: q4_K_M optimizes for "smallest perplexity loss per bit on CPU and pre-Blackwell GPUs"; NVFP4 optimizes for "fastest matmul on Blackwell FP4 hardware while staying close enough to fp16 quality." On Blackwell, NVFP4 wins. On RTX 4090 or older, q4_K_M still wins because there's no FP4 hardware path.

Which RTX 50-series cards benefit, and by how much?

All four RTX 50-series consumer SKUs share the SM120 architecture and have FP4 tensor cores, but VRAM and bandwidth gate which models are practical. NVFP4 helps every 50-series card; the absolute throughput numbers scale with TFLOPS and memory bandwidth.

GPU	FP4 TFLOPS (Sparse)	VRAM	Bandwidth	TDP	MSRP
RTX 5090	3,352	32 GB GDDR7	1,792 GB/s	575 W	$1,999
RTX 5080	1,801	16 GB GDDR7	960 GB/s	360 W	$999
RTX 5070 Ti	1,406	16 GB GDDR7	896 GB/s	300 W	$749
RTX 4090	n/a (no FP4)	24 GB GDDR6X	1,008 GB/s	450 W	$1,599 (street)

The 5090 is the only card that meaningfully changes the model-size question — 32 GB lets you run a Llama-3.1 70B at NVFP4 plus 16K context entirely in VRAM. The 5080 and 5070 Ti, capped at 16 GB, see big speedups on the models they could already run (8B, 13B, mixture-of-expert routers up to ~30B sparse), but NVFP4 doesn't unlock 70B on them. The 4090, despite being faster than the 5080 in raw FP16 TFLOPS, gets nothing here because Ada (SM89) has no FP4 tensor cores — it stays on q4_K_M and the gap to a 5090 widens significantly under llama.cpp.

How does NVFP4 quality compare to q4_K_M, q5_K_M, and fp16?

We pulled together the Wikitext-2 perplexity numbers from the LocalLLaMA NVFP4 testing thread (Llama 3.1 8B Instruct, 4096-token chunks, as of 2026-04):

Quantization	VRAM (8B)	Wikitext-2 PPL	Δ vs fp16
fp16	16.0 GB	6.21	—
q8_0	8.5 GB	6.23	+0.3%
q6_K	6.6 GB	6.27	+1.0%
q5_K_M	5.7 GB	6.32	+1.8%
NVFP4	5.4 GB	6.41	+3.2%
q4_K_M	4.9 GB	6.46	+4.0%
q4_0	4.7 GB	6.74	+8.5%

NVFP4 sits right where you'd hope: better quality than q4_K_M, slightly worse than q5_K_M, and 7% smaller than q5_K_M on disk. For most reasoning, coding, and long-form-summary tasks the difference between NVFP4 and q5_K_M is well within run-to-run sampling noise. Where you'll feel it: heavily formatted output (JSON, code with strict syntactic constraints) and long-tail factual recall — NVFP4 is slightly more prone to occasional token slips than q5_K_M, in our testing, but no worse than q4_K_M.

Does NVFP4 reduce VRAM enough to fit larger models on a single 5090?

Yes — and this is the most underappreciated part of the upgrade. Here's the practical-fit table for a 32 GB RTX 5090 with KV-cache reserved for context:

Model	NVFP4 weights	q4_K_M weights	Max ctx @ NVFP4 (5090)	Max ctx @ q4_K_M (5090)
Llama 3.1 8B	4.4 GB	4.9 GB	128K	128K
Llama 3.1 70B	35.6 GB	39.6 GB	n/a (offload)	n/a (offload)
Llama 3.1 70B (2x 5090, NVLink-less)	17.8 GB / card	19.8 GB / card	32K	16K
DeepSeek V4 Pro (32B dense)	16.4 GB	18.2 GB	64K	32K
Qwen2.5-Coder 32B	16.1 GB	17.9 GB	64K	32K
Mixtral 8x7B	23.6 GB	26.3 GB	32K	16K

The single biggest win: 32B-class dense models like DeepSeek V4 Pro or Qwen2.5-Coder fit on a single 5090 at NVFP4 with 64K of context, where q4_K_M was capping you at 32K. For coding agents and document-grounded RAG, that doubling of context is more impactful than the raw tok/s speedup.

How do prefill and generation throughput change for 8B/32B/70B models?

Numbers from a clean RTX 5090 run, llama.cpp at HEAD as of 2026-04-29, batch=1, no speculative decoding, F16 KV cache:

Model	Quant	Prefill tok/s	Gen tok/s	Notes
Llama 3.1 8B	q4_K_M	5,820	132	Baseline
Llama 3.1 8B	NVFP4	11,940 (+105%)	247 (+87%)
Qwen2.5 32B	q4_K_M	1,440	38.2
Qwen2.5 32B	NVFP4	3,180 (+121%)	71.4 (+87%)
Llama 3.1 70B (2x 5090, tensor-parallel)	q4_K_M	410	18.6	PCIe 5.0 x16 each
Llama 3.1 70B (2x 5090, tensor-parallel)	NVFP4	920 (+124%)	33.1 (+78%)

Two observations. First, prefill scales better than generation, which is exactly what you'd expect — prefill is compute-bound and benefits directly from the FP4 TFLOPS uplift; generation is memory-bandwidth-bound and only sees the speedup from the smaller weights. Second, the multi-GPU 70B numbers are even better than single-card numbers proportionally, because the PCIe 5.0 link between cards is no longer the bottleneck once the FP4 path takes load off the dequant kernels.

When should you stay on q4_K_M instead?

Three cases. One: any GPU older than the 50-series. RTX 4090, RTX 3090, anything Ampere or older — there is no FP4 hardware path, and NVFP4 weights run through emulation that is slower than q4_K_M. Two: mixed-platform deployments where you swap models between a 5090 workstation and an Apple Silicon laptop or a CPU-only home server. Apple Silicon has no FP4 path; q4_K_M runs everywhere. Don't fragment your model storage just to get a speedup on one machine. Three: strict-output workloads (structured JSON, function-calling with tight schemas, code generation under aggressive linting). The marginal quality loss of NVFP4 over q5_K_M is real on these tasks. If your harness does a lot of retry-on-parse-error, the throughput win can wash out.

For everything else — interactive chat, long-context document QA, agent loops on a single 5090 — NVFP4 is the new default.

Verdict

Use NVFP4 if: you're on a 50-series card, especially a 5090; you run 32B-class or larger models; your workload is interactive chat, document QA, code generation without strict schema, or anything where you want the longest context window the card can hold.

Stay on q4_K_M if: your card is pre-Blackwell; your model storage is shared across non-Blackwell hardware; you run strict-format pipelines where the marginal quality drop from q5_K_M to NVFP4 hurts retry rates.

Bottom line

The NVFP4 path is the single biggest llama.cpp speedup since CUDA graphs. If you bought a 5090 and have been running it on q4_K_M GGUFs, you've been leaving 80% of your tok/s on the table and 8 GB of VRAM unused. Re-quantize your most-used model, run the prefill benchmark, and watch the 32B-class models you used to think of as "needs a Mac Studio" finally fit on a single card with 64K of context.

Related guides

DeepSeek V4 Pro Local Inference: Hardware Requirements vs API
Best GPU for Local LLMs in 2026
Llama 3.1 70B Hardware Requirements
RTX 5090 vs Mac Studio M3 Ultra for Local AI

Sources

llama.cpp PR #22286 — SM120 NVFP4 MMQ kernel
NVIDIA Blackwell Architecture Whitepaper (2025)
LocalLLaMA NVFP4 benchmarking thread (2026-04)
TechPowerUp RTX 50-series database