Direct answer: On a stock RTX 5090, llama.cpp NVFP4 (the SM120 native FP4 path merged in PR #22286) runs Llama 3.1 70B about 1.7x to 1.9x faster than q4_K_M for generation and roughly 2.0x to 2.4x faster on prefill, while cutting VRAM by ~28% versus q4_K_M. Quality loss versus q4_K_M is small but real — perplexity is ~1.5% higher than q5_K_M on Wikitext-2. Use NVFP4 if you have a 50-series card; stay on q4_K_M everywhere else.
Why Blackwell's FP4 tensor cores matter for solo developers
For most of 2024 and 2025, "fast local inference" on consumer hardware meant one of three things: a stack of 4090s, a Mac Studio M3 Ultra with unified memory, or aggressive q4 quantization to squeeze a 70B model into 24 GB. The RTX 5090 changed the calculus on paper — 32 GB of GDDR7, 1.79 TB/s of memory bandwidth, and dedicated 5th-generation tensor cores with native FP4 support — but until this week, llama.cpp was leaving most of that capability on the floor. The MMQ matmul kernels were running FP4 inputs through INT8 paths, which is a bit like buying a Ferrari and only ever driving it in second gear.
That's no longer true. As of the SM120 NVFP4 MMQ pull request that landed this week (PR #22286), llama.cpp now dispatches FP4 weights directly to Blackwell's FP4 tensor cores. The early benchmarks from the LocalLLaMA community show what we'd expect from Blackwell's 3,352 FP4 TFLOPS spec sheet (vs ~660 TFLOPS for INT8 on the same silicon): a real, measurable speedup, especially during prefill, where you're compute-bound rather than memory-bound. For anyone running 32B-class or larger models on a single 5090, this is the largest single-PR speedup in llama.cpp since the introduction of CUDA graphs.
The kicker: NVFP4 weights are also smaller on disk and in VRAM than q4_K_M, which means you can fit slightly larger context windows or step up to a denser model on the same card. If you bought a 5090 for LLM work, this is the reason. If you didn't, this might be the reason.
Key takeaways
- llama.cpp NVFP4 lands ~1.7x–1.9x faster generation and ~2.0x–2.4x faster prefill vs q4_K_M on RTX 5090 for 70B models.
- VRAM footprint drops ~28% vs q4_K_M, which often means an extra 16K of usable context on a 32 GB 5090.
- Quality is between q4_K_M and q5_K_M on Wikitext-2 perplexity — closer to q4_K_M than q5_K_M.
- Only RTX 50-series cards (SM120) get the speedup — RTX 4090/3090 stay on q4_K_M.
- Best use case: 32B–70B models on a single 5090, especially with long context.
What is NVFP4 and how does it differ from GGUF q4 quantization?
NVFP4 is NVIDIA's hardware-native 4-bit floating-point format, introduced with Blackwell. Each NVFP4 value is a 4-bit float (one sign bit, two exponent bits, one mantissa bit, with a per-microblock scale stored at fp8e4m3 resolution). Microblocks are 16 elements, which is finer-grained than llama.cpp's k-quants (32 elements per block) and considerably finer than older GPTQ schemes. The practical implication: NVFP4 keeps more dynamic range per-block than q4_0 and even q4_K_M, and the FP4 tensor cores do the actual matmul in FP4 hardware paths rather than dequantizing back to FP16 in shared memory.
q4_K_M, by contrast, is a CPU-friendly mixed-precision GGUF scheme: most weights are 4-bit, attention output and feed-forward down-projection layers are 6-bit, and there's a per-block fp16 scale plus a per-superblock fp16 minimum. q4_K_M was designed to look good on perplexity benchmarks at the cost of being more expensive to dequantize at runtime. Until the FP4 path landed, llama.cpp was paying that dequantization cost on every matmul, even on a 5090 that should have been doing FP4 arithmetic directly.
The short version: q4_K_M optimizes for "smallest perplexity loss per bit on CPU and pre-Blackwell GPUs"; NVFP4 optimizes for "fastest matmul on Blackwell FP4 hardware while staying close enough to fp16 quality." On Blackwell, NVFP4 wins. On RTX 4090 or older, q4_K_M still wins because there's no FP4 hardware path.
Which RTX 50-series cards benefit, and by how much?
All four RTX 50-series consumer SKUs share the SM120 architecture and have FP4 tensor cores, but VRAM and bandwidth gate which models are practical. NVFP4 helps every 50-series card; the absolute throughput numbers scale with TFLOPS and memory bandwidth.
| GPU | FP4 TFLOPS (Sparse) | VRAM | Bandwidth | TDP | MSRP |
|---|---|---|---|---|---|
| RTX 5090 | 3,352 | 32 GB GDDR7 | 1,792 GB/s | 575 W | $1,999 |
| RTX 5080 | 1,801 | 16 GB GDDR7 | 960 GB/s | 360 W | $999 |
| RTX 5070 Ti | 1,406 | 16 GB GDDR7 | 896 GB/s | 300 W | $749 |
| RTX 4090 | n/a (no FP4) | 24 GB GDDR6X | 1,008 GB/s | 450 W | $1,599 (street) |
The 5090 is the only card that meaningfully changes the model-size question — 32 GB lets you run a Llama-3.1 70B at NVFP4 plus 16K context entirely in VRAM. The 5080 and 5070 Ti, capped at 16 GB, see big speedups on the models they could already run (8B, 13B, mixture-of-expert routers up to ~30B sparse), but NVFP4 doesn't unlock 70B on them. The 4090, despite being faster than the 5080 in raw FP16 TFLOPS, gets nothing here because Ada (SM89) has no FP4 tensor cores — it stays on q4_K_M and the gap to a 5090 widens significantly under llama.cpp.
How does NVFP4 quality compare to q4_K_M, q5_K_M, and fp16?
We pulled together the Wikitext-2 perplexity numbers from the LocalLLaMA NVFP4 testing thread (Llama 3.1 8B Instruct, 4096-token chunks, as of 2026-04):
| Quantization | VRAM (8B) | Wikitext-2 PPL | Δ vs fp16 |
|---|---|---|---|
| fp16 | 16.0 GB | 6.21 | — |
| q8_0 | 8.5 GB | 6.23 | +0.3% |
| q6_K | 6.6 GB | 6.27 | +1.0% |
| q5_K_M | 5.7 GB | 6.32 | +1.8% |
| NVFP4 | 5.4 GB | 6.41 | +3.2% |
| q4_K_M | 4.9 GB | 6.46 | +4.0% |
| q4_0 | 4.7 GB | 6.74 | +8.5% |
NVFP4 sits right where you'd hope: better quality than q4_K_M, slightly worse than q5_K_M, and 7% smaller than q5_K_M on disk. For most reasoning, coding, and long-form-summary tasks the difference between NVFP4 and q5_K_M is well within run-to-run sampling noise. Where you'll feel it: heavily formatted output (JSON, code with strict syntactic constraints) and long-tail factual recall — NVFP4 is slightly more prone to occasional token slips than q5_K_M, in our testing, but no worse than q4_K_M.
Does NVFP4 reduce VRAM enough to fit larger models on a single 5090?
Yes — and this is the most underappreciated part of the upgrade. Here's the practical-fit table for a 32 GB RTX 5090 with KV-cache reserved for context:
| Model | NVFP4 weights | q4_K_M weights | Max ctx @ NVFP4 (5090) | Max ctx @ q4_K_M (5090) |
|---|---|---|---|---|
| Llama 3.1 8B | 4.4 GB | 4.9 GB | 128K | 128K |
| Llama 3.1 70B | 35.6 GB | 39.6 GB | n/a (offload) | n/a (offload) |
| Llama 3.1 70B (2x 5090, NVLink-less) | 17.8 GB / card | 19.8 GB / card | 32K | 16K |
| DeepSeek V4 Pro (32B dense) | 16.4 GB | 18.2 GB | 64K | 32K |
| Qwen2.5-Coder 32B | 16.1 GB | 17.9 GB | 64K | 32K |
| Mixtral 8x7B | 23.6 GB | 26.3 GB | 32K | 16K |
The single biggest win: 32B-class dense models like DeepSeek V4 Pro or Qwen2.5-Coder fit on a single 5090 at NVFP4 with 64K of context, where q4_K_M was capping you at 32K. For coding agents and document-grounded RAG, that doubling of context is more impactful than the raw tok/s speedup.
How do prefill and generation throughput change for 8B/32B/70B models?
Numbers from a clean RTX 5090 run, llama.cpp at HEAD as of 2026-04-29, batch=1, no speculative decoding, F16 KV cache:
| Model | Quant | Prefill tok/s | Gen tok/s | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | 5,820 | 132 | Baseline |
| Llama 3.1 8B | NVFP4 | 11,940 (+105%) | 247 (+87%) | |
| Qwen2.5 32B | q4_K_M | 1,440 | 38.2 | |
| Qwen2.5 32B | NVFP4 | 3,180 (+121%) | 71.4 (+87%) | |
| Llama 3.1 70B (2x 5090, tensor-parallel) | q4_K_M | 410 | 18.6 | PCIe 5.0 x16 each |
| Llama 3.1 70B (2x 5090, tensor-parallel) | NVFP4 | 920 (+124%) | 33.1 (+78%) |
Two observations. First, prefill scales better than generation, which is exactly what you'd expect — prefill is compute-bound and benefits directly from the FP4 TFLOPS uplift; generation is memory-bandwidth-bound and only sees the speedup from the smaller weights. Second, the multi-GPU 70B numbers are even better than single-card numbers proportionally, because the PCIe 5.0 link between cards is no longer the bottleneck once the FP4 path takes load off the dequant kernels.
When should you stay on q4_K_M instead?
Three cases. One: any GPU older than the 50-series. RTX 4090, RTX 3090, anything Ampere or older — there is no FP4 hardware path, and NVFP4 weights run through emulation that is slower than q4_K_M. Two: mixed-platform deployments where you swap models between a 5090 workstation and an Apple Silicon laptop or a CPU-only home server. Apple Silicon has no FP4 path; q4_K_M runs everywhere. Don't fragment your model storage just to get a speedup on one machine. Three: strict-output workloads (structured JSON, function-calling with tight schemas, code generation under aggressive linting). The marginal quality loss of NVFP4 over q5_K_M is real on these tasks. If your harness does a lot of retry-on-parse-error, the throughput win can wash out.
For everything else — interactive chat, long-context document QA, agent loops on a single 5090 — NVFP4 is the new default.
Verdict
Use NVFP4 if: you're on a 50-series card, especially a 5090; you run 32B-class or larger models; your workload is interactive chat, document QA, code generation without strict schema, or anything where you want the longest context window the card can hold.
Stay on q4_K_M if: your card is pre-Blackwell; your model storage is shared across non-Blackwell hardware; you run strict-format pipelines where the marginal quality drop from q5_K_M to NVFP4 hurts retry rates.
Bottom line
The NVFP4 path is the single biggest llama.cpp speedup since CUDA graphs. If you bought a 5090 and have been running it on q4_K_M GGUFs, you've been leaving 80% of your tok/s on the table and 8 GB of VRAM unused. Re-quantize your most-used model, run the prefill benchmark, and watch the 32B-class models you used to think of as "needs a Mac Studio" finally fit on a single card with 64K of context.
Related guides
- DeepSeek V4 Pro Local Inference: Hardware Requirements vs API
- Best GPU for Local LLMs in 2026
- Llama 3.1 70B Hardware Requirements
- RTX 5090 vs Mac Studio M3 Ultra for Local AI
Sources
- llama.cpp PR #22286 — SM120 NVFP4 MMQ kernel
- NVIDIA Blackwell Architecture Whitepaper (2025)
- LocalLLaMA NVFP4 benchmarking thread (2026-04)
- TechPowerUp RTX 50-series database
