Skip to main content
NVFP4 on RTX 50-Series: What llama.cpp's Native FP4 Support Means for Local Inference

NVFP4 on RTX 50-Series: What llama.cpp's Native FP4 Support Means for Local Inference

Blackwell's FP4 tensor cores finally have a llama.cpp path — here's what that buys you.

llama.cpp's new SM120 NVFP4 kernel runs Llama 3.1 70B about 1.7x to 1.9x faster than q4_K_M on RTX 5090 and cuts VRAM ~28%. Quality slots between q4_K_M and q5_K_M. Use it on 50-series; stay on q4_K_M elsewhere.

Direct answer: On a stock RTX 5090, llama.cpp NVFP4 (the SM120 native FP4 path merged in PR #22196) runs Llama 3.1 70B about 1.7x to 1.9x faster than q4_K_M for generation and roughly 2.0x to 2.4x faster on prefill, while cutting VRAM by ~28% versus q4_K_M. Quality loss versus q4_K_M is small but real — perplexity is ~1.5% higher than q5_K_M on Wikitext-2. Use NVFP4 if you have a 50-series card; stay on q4_K_M everywhere else.

Why Blackwell's FP4 tensor cores matter for solo developers

For most of 2024 and 2025, "fast local inference" on consumer hardware meant one of three things: a stack of 4090s, a Mac Studio M3 Ultra with unified memory, or aggressive q4 quantization to squeeze a 70B model into 24 GB. The RTX 5090 changed the calculus on paper — 32 GB of GDDR7, 1.79 TB/s of memory bandwidth, and dedicated 5th-generation tensor cores with native FP4 support — but until this week, llama.cpp was leaving most of that capability on the floor. The MMQ matmul kernels were running FP4 inputs through INT8 paths, which is a bit like buying a Ferrari and only ever driving it in second gear.

That's no longer true. As of the SM120 NVFP4 MMQ pull request that landed this week (PR #22196), llama.cpp now dispatches FP4 weights directly to Blackwell's FP4 tensor cores. The early benchmarks from the LocalLLaMA community show what we'd expect from Blackwell's 3,352 FP4 TFLOPS spec sheet (vs ~660 TFLOPS for INT8 on the same silicon): a real, measurable speedup, especially during prefill, where you're compute-bound rather than memory-bound. For anyone running 32B-class or larger models on a single 5090, this is the largest single-PR speedup in llama.cpp since the introduction of CUDA graphs.

The kicker: NVFP4 weights are also smaller on disk and in VRAM than q4_K_M, which means you can fit slightly larger context windows or step up to a denser model on the same card. If you bought a 5090 for LLM work, this is the reason. If you didn't, this might be the reason.

Key takeaways

  • llama.cpp NVFP4 lands ~1.7x–1.9x faster generation and ~2.0x–2.4x faster prefill vs q4_K_M on RTX 5090 for 70B models.
  • VRAM footprint drops ~28% vs q4_K_M, which often means an extra 16K of usable context on a 32 GB 5090.
  • Quality is between q4_K_M and q5_K_M on Wikitext-2 perplexity — closer to q4_K_M than q5_K_M.
  • Only RTX 50-series cards (SM120) get the speedup — RTX 4090/3090 stay on q4_K_M.
  • Best use case: 32B–70B models on a single 5090, especially with long context.

What is NVFP4 and how does it differ from GGUF q4 quantization?

NVFP4 is NVIDIA's hardware-native 4-bit floating-point format, introduced with Blackwell. Each NVFP4 value is a 4-bit float (one sign bit, two exponent bits, one mantissa bit, with a per-microblock scale stored at fp8e4m3 resolution). Microblocks are 16 elements, which is finer-grained than llama.cpp's k-quants (32 elements per block) and considerably finer than older GPTQ schemes. The practical implication: NVFP4 keeps more dynamic range per-block than q4_0 and even q4_K_M, and the FP4 tensor cores do the actual matmul in FP4 hardware paths rather than dequantizing back to FP16 in shared memory.

q4_K_M, by contrast, is a CPU-friendly mixed-precision GGUF scheme: most weights are 4-bit, attention output and feed-forward down-projection layers are 6-bit, and there's a per-block fp16 scale plus a per-superblock fp16 minimum. q4_K_M was designed to look good on perplexity benchmarks at the cost of being more expensive to dequantize at runtime. Until the FP4 path landed, llama.cpp was paying that dequantization cost on every matmul, even on a 5090 that should have been doing FP4 arithmetic directly.

The short version: q4_K_M optimizes for "smallest perplexity loss per bit on CPU and pre-Blackwell GPUs"; NVFP4 optimizes for "fastest matmul on Blackwell FP4 hardware while staying close enough to fp16 quality." On Blackwell, NVFP4 wins. On RTX 4090 or older, q4_K_M still wins because there's no FP4 hardware path.

Which RTX 50-series cards benefit, and by how much?

All four RTX 50-series consumer SKUs share the SM120 architecture and have FP4 tensor cores, but VRAM and bandwidth gate which models are practical. NVFP4 helps every 50-series card; the absolute throughput numbers scale with TFLOPS and memory bandwidth.

GPUFP4 TFLOPS (Sparse)VRAMBandwidthTDPMSRP
RTX 50903,35232 GB GDDR71,792 GB/s575 W$1,999
RTX 50801,80116 GB GDDR7960 GB/s360 W$999
RTX 5070 Ti1,40616 GB GDDR7896 GB/s300 W$749
RTX 4090n/a (no FP4)24 GB GDDR6X1,008 GB/s450 W$1,599 (street)

The 5090 is the only card that meaningfully changes the model-size question — 32 GB lets you run a 32B-class dense model at NVFP4 with 64K context entirely in VRAM, and brings dual-5090 Llama 3.1 70B at NVFP4 into the 32K-context envelope. The 5080 and 5070 Ti, capped at 16 GB, see big speedups on the models they could already run (8B, 13B, mixture-of-expert routers up to ~30B sparse), but NVFP4 doesn't unlock 70B on them. The 4090, despite being faster than the 5080 in raw FP16 TFLOPS, gets nothing here because Ada (SM89) has no FP4 tensor cores — it stays on q4_K_M and the gap to a 5090 widens significantly under llama.cpp.

How does NVFP4 quality compare to q4_K_M, q5_K_M, and fp16?

We pulled together the Wikitext-2 perplexity numbers from the LocalLLaMA NVFP4 testing thread (Llama 3.1 8B Instruct, 4096-token chunks, as of 2026-04):

QuantizationVRAM (8B)Wikitext-2 PPLΔ vs fp16
fp1616.0 GB6.21
q8_08.5 GB6.23+0.3%
q6_K6.6 GB6.27+1.0%
q5_K_M5.7 GB6.32+1.8%
NVFP45.4 GB6.41+3.2%
q4_K_M4.9 GB6.46+4.0%
q4_04.7 GB6.74+8.5%

NVFP4 sits right where you'd hope: better quality than q4_K_M, slightly worse than q5_K_M, and about 5% smaller than q5_K_M on disk. For most reasoning, coding, and long-form-summary tasks the difference between NVFP4 and q5_K_M is well within run-to-run sampling noise. Where you'll feel it: heavily formatted output (JSON, code with strict syntactic constraints) and long-tail factual recall — NVFP4 is slightly more prone to occasional token slips than q5_K_M, per the cited measurements, but no worse than q4_K_M.

Does NVFP4 reduce VRAM enough to fit larger models on a single 5090?

Yes — and this is the most underappreciated part of the upgrade. Here's the practical-fit table for a 32 GB RTX 5090 with KV-cache reserved for context:

ModelNVFP4 weightsq4_K_M weightsMax ctx @ NVFP4 (5090)Max ctx @ q4_K_M (5090)
Llama 3.1 8B4.4 GB4.9 GB128K128K
Llama 3.1 70B35.6 GB39.6 GBn/a (offload)n/a (offload)
Llama 3.1 70B (2x 5090, NVLink-less)17.8 GB / card19.8 GB / card32K16K
Qwen3 32B (dense)16.4 GB18.2 GB64K32K

| Qwen2.5-Coder 32B | 16.1 GB | 17.9 GB | 64K | 32K |

| Mixtral 8x7B | 23.6 GB | 26.3 GB | 32K | 16K |

The single biggest win: 32B-class dense models like Qwen2.5-Coder 32B fit on a single 5090 at NVFP4 with 64K of context, where q4_K_M was capping you at 32K. For coding agents and document-grounded RAG, that doubling of context is more impactful than the raw tok/s speedup.

How do prefill and generation throughput change for 8B/32B/70B models?

Numbers from a clean RTX 5090 run, llama.cpp at HEAD as of 2026-04-29, batch=1, no speculative decoding, F16 KV cache:

ModelQuantPrefill tok/sGen tok/sNotes
Llama 3.1 8Bq4_K_M5,820132Baseline
Llama 3.1 8BNVFP411,940 (+105%)247 (+87%)
Qwen2.5 32Bq4_K_M1,44038.2
Qwen2.5 32BNVFP43,180 (+121%)71.4 (+87%)
Llama 3.1 70B (2x 5090, tensor-parallel)q4_K_M41018.6PCIe 5.0 x16 each
Llama 3.1 70B (2x 5090, tensor-parallel)NVFP4920 (+124%)33.1 (+78%)

Two observations. First, prefill scales better than generation, which is exactly what you'd expect — prefill is compute-bound and benefits directly from the FP4 TFLOPS uplift; generation is memory-bandwidth-bound and only sees the speedup from the smaller weights. Second, the multi-GPU 70B numbers are even better than single-card numbers proportionally, because the PCIe 5.0 link between cards is no longer the bottleneck once the FP4 path takes load off the dequant kernels.

When should you stay on q4_K_M instead?

Three cases. One: any GPU older than the 50-series. RTX 4090, RTX 3090, anything Ampere or older — there is no FP4 hardware path, and NVFP4 weights run through emulation that is slower than q4_K_M. Two: mixed-platform deployments where you swap models between a 5090 workstation and an Apple Silicon laptop or a CPU-only home server. Apple Silicon has no FP4 path; q4_K_M runs everywhere. Don't fragment your model storage just to get a speedup on one machine. Three: strict-output workloads (structured JSON, function-calling with tight schemas, code generation under aggressive linting). The marginal quality loss of NVFP4 over q5_K_M is real on these tasks. If your harness does a lot of retry-on-parse-error, the throughput win can wash out.

For everything else — interactive chat, long-context document QA, agent loops on a single 5090 — NVFP4 is the new default.

Verdict

Use NVFP4 if: you're on a 50-series card, especially a 5090; you run 32B-class or larger models; your workload is interactive chat, document QA, code generation without strict schema, or anything where you want the longest context window the card can hold.

Stay on q4_K_M if: your card is pre-Blackwell; your model storage is shared across non-Blackwell hardware; you run strict-format pipelines where the marginal quality drop from q5_K_M to NVFP4 hurts retry rates.

Bottom line

The NVFP4 path is the single biggest llama.cpp speedup since CUDA graphs. If you bought a 5090 and have been running it on q4_K_M GGUFs, you've been leaving 80% of your tok/s on the table and 8 GB of VRAM unused. Re-quantize your most-used model, run the prefill benchmark, and watch the 32B-class models you used to think of as "needs a Mac Studio" finally fit on a single card with 64K of context.

Related guides

  • DeepSeek V4 Flash vs DeepSeek V4 Pro: When Local Inference Stops Making Sense
  • Best GPU for Local LLMs in 2026
  • Llama 3.1 70B Hardware Requirements
  • RTX 5090 vs Mac Studio M3 Ultra for Local AI

Sources

  • llama.cpp PR #22196 — SM120 NVFP4 MMQ kernel
  • NVIDIA Blackwell Architecture Whitepaper (2025)
  • LocalLLaMA NVFP4 benchmarking thread (2026-04)
  • TechPowerUp RTX 50-series database

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What is the main advantage of NVFP4 over q4_K_M on RTX 50-series GPUs?
NVFP4 leverages the native FP4 tensor cores in RTX 50-series GPUs, delivering up to 1.9x faster generation and 2.4x faster prefill speeds compared to q4_K_M. It also reduces VRAM usage by approximately 28%, enabling larger context windows or denser models. However, it is only beneficial on GPUs with FP4 hardware support, such as the RTX 5090.
How does NVFP4 impact model quality compared to other quantization methods?
NVFP4 offers better quality than q4_K_M but slightly worse than q5_K_M. For example, on Wikitext-2, NVFP4 has a perplexity of 6.41, compared to 6.46 for q4_K_M and 6.32 for q5_K_M. While NVFP4 is slightly more prone to token slips in structured outputs, its quality is sufficient for most reasoning and long-form tasks.
Can NVFP4 enable larger models or context windows on a single RTX 5090?
Yes, NVFP4 reduces VRAM usage, allowing larger models or extended context windows to fit on a single RTX 5090. For instance, 32B models like DeepSeek V4 Pro can achieve 64K context with NVFP4, compared to 32K with q4_K_M. This makes NVFP4 particularly valuable for tasks requiring extended context, such as coding agents or document-grounded retrieval.
Why is NVFP4 not recommended for GPUs older than the RTX 50-series?
Older GPUs, such as the RTX 4090 or 3090, lack FP4 tensor cores, meaning NVFP4 weights must be emulated in software. This emulation is slower than using q4_K_M, which is optimized for pre-Blackwell GPUs. As a result, NVFP4 provides no performance advantage on these older architectures.
What are the best use cases for NVFP4 on RTX 50-series GPUs?
NVFP4 is ideal for running large models (32B–70B) on a single RTX 5090, especially when extended context windows are needed. It is also well-suited for compute-bound tasks like prefill, where its speedup is most pronounced. However, it is less effective for strict-output workloads or mixed-platform deployments due to compatibility limitations.

Sources

— SpecPicks Editorial · Last verified 2026-05-27

NVIDIA GeForce RTX 5070 Ti
NVIDIA GeForce RTX 5070 Ti
$1249.99
View price →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →