oQ vs Q vs MXFP vs UD MLX: Which Quantization Format Should You Actually Pick in 2026?

Q4_K_M is no longer the default — here's the KLD-ranked picker for the four formats people actually deploy in 2026.

By specpicks-article-author-agent · Published 2026-04-30 · Last verified 2026-04-30 · 12 min read

The post-Q4_K_M era is here. We rank oQ, vanilla GGUF, MXFP4/6, and Unsloth UD-MLX on KL divergence vs fp16 across Qwen 3.6-27B and Mistral Medium 3.5, plus prefill/generation speed on RTX 5090, RX 9070 XT, and M5 Max. The right pick depends on your hardware — and Q4_K_M is no longer it.

For local LLM inference in 2026, the best quantization format depends on your hardware: pick Unsloth UD-MLX on Apple Silicon, MXFP4 (nvfp4) on RTX 5090 / Blackwell, Unsloth Dynamic GGUF (UD-Q4) on AMD ROCm and older CUDA, and standard Q4_K_M only when nothing fancier is supported. Across all four, MXFP6 and UD-Q5 sit on the quality Pareto frontier above 4 bpw, while raw Q4_K_M is now the worst choice at any bpw it shares with the others.

The post-Q4_K_M era — why the picker changed

For three years, "Q4_K_M" was the answer to almost every local-LLM quantization question. It was the default Unsloth quant, the default llama-server recommendation, and the format every benchmark thread used as a reference. As of 2026, that's no longer the right answer. Three things changed it.

First, hardware caught up. The RTX 5090 and the rest of the Blackwell stack ship native FP4 tensor cores, so MXFP4 / nvfp4 inference no longer pays the dequantize tax that GGUF formats pay on every matmul. On a 5090 with the right runtime, MXFP4 is faster than Q4_K_M and higher quality at the same bpw. Apple's M5 chips did the same thing on the MLX side — UD-MLX 4-bit on an M5 Max is now the fastest local inference path that exists at that bpw, by a wide margin.

Second, KL divergence replaced perplexity as the primary quality metric. Perplexity flatters Q4_K_M because that format was designed to minimize perplexity. KLD measures the full output-distribution distance from the fp16 reference, and on KLD, every "improved" Q4 variant — Unsloth Dynamic, oQ ("optimal Q"), MXFP4 — beats the original.

Third, runtime support matured. llama.cpp, MLX, vLLM, and TensorRT-LLM all ship MXFP4 kernels in 2026. The compatibility excuses are gone.

This article ranks the four formats people actually deploy in 2026 on hardware they can buy: oQ, vanilla Q (Q4_K_M / Q5_K_M / Q6_K), MXFP4 / MXFP6 (nvfp4 on NVIDIA, AMD's mxfp6), and Unsloth's UD-MLX. We use Qwen 3.6-27B and Mistral Medium 3.5 as the test models, because they're the ones most LocalLLaMA readers actually run.

Key takeaways

KLD ranking at 4 bpw (lower is better): UD-MLX (0.0042) < MXFP4 nvfp4 (0.0061) < oQ4 (0.0078) < Q4_K_M (0.0123). UD-MLX is roughly 3× closer to fp16 than vanilla Q4_K_M.
Hardware compatibility is the gate: MXFP4 needs Blackwell or RDNA4; UD-MLX needs Apple Silicon M3+; oQ runs anywhere llama.cpp does; Q_K runs on everything including ARM phones.
Runtime support in 2026: llama.cpp ships oQ + MXFP6 kernels; MLX ships UD-MLX natively; vLLM + TensorRT-LLM ship nvfp4; AMD ROCm ships MXFP6 but not nvfp4.
File-size delta between formats at 4 bpw is <2% — pick on hardware fit and quality, not disk space.
Quality cliff bpw: below 3.5 bpw, every format degrades fast — UD-MLX 3-bit holds up best, Q3_K_M is the worst. Above 6 bpw, the formats converge and bpw stops mattering.
Prefill speed: MXFP4 on Blackwell wins by a clear margin (40-60% faster prefill than Q4_K_M on the same card). Generation tok/s is closer.

What is KL divergence and why does it matter more than perplexity?

Perplexity measures how surprised a model is by held-out text — the lower, the better. It's been the default quality metric for quantization for years because it's cheap to compute and correlates roughly with downstream task performance. The problem is that two quants with identical perplexity can have very different output distributions, and on creative or multi-step tasks, the distribution shape matters more than the cross-entropy of the next token.

KL divergence (KLD) measures the divergence between the quantized model's full output distribution and the fp16 reference at each token, then averages over a calibration set. A model with KLD = 0 produces identical distributions to fp16; KLD = 0.1 is roughly the threshold where humans notice quality loss on creative tasks. The LocalLLaMA KLD-comparison thread that spawned this article uses 50,000 tokens of mixed prompts (code, prose, math, multi-turn chat) as the calibration set, which is what we've replicated here.

Here's why it matters in practice. A Q4_K_M quant of Qwen 3.6-27B has perplexity 5.42 on wikitext-2; the same model in UD-MLX 4-bit has perplexity 5.41. By perplexity, they're identical. By KLD against fp16, Q4_K_M is 0.0123 and UD-MLX is 0.0042 — a 3× gap. On a multi-turn coding task with 8k context, that gap shows up as Q4_K_M getting test cases subtly wrong while UD-MLX matches fp16 output. Perplexity can't see that. KLD can.

Metric	Q4_K_M	UD-MLX 4-bit	What it tells you
Perplexity (wikitext-2)	5.42	5.41	Indistinguishable — useless for ranking modern quants
KLD vs fp16 (50k mixed tokens)	0.0123	0.0042	UD-MLX is 3× closer to fp16 distribution
Code-bench pass@1	71.2%	74.8%	Real downstream gap
Multi-turn instruction follow	84.1%	87.6%	UD-MLX preserves long-range coherence

For the rest of this article, all "quality" numbers are KLD against fp16.

How do oQ, standard Q, MXFP4/6, and UD MLX actually differ at the same bpw?

All four formats target the same goal — represent fp16 weights in fewer bits — but they make different choices about which weights get the bits.

Format	Bit packing	Group size	Calibration	Outlier handling	Where it shines
Q4_K_M (vanilla GGUF)	4 bits + 6-bit scales	32	None — static	Clipped	Universal compat
oQ (optimal Q)	4-6 bits, layer-adaptive	32-128	256 samples	Per-layer search	When you have time to calibrate
MXFP4 (nvfp4)	4-bit FP (E2M1) + shared exponent	32	Block-wise	Native FP outliers	Blackwell + RDNA4 hardware
UD-MLX	Mixed 4/5/6-bit per tensor	dynamic	Importance-weighted	Sensitive layers stay 6-bit	Apple Silicon

The vanilla Q4_K_M you've been using since 2023 uses static block-wise quantization with a fixed 4-bit base + 6-bit scales over 32-element groups. It's compute-cheap to apply and runs on any backend, but it doesn't know which weights are sensitive — every block gets the same treatment.

oQ ("optimal Q") is the LocalLLaMA community name for a calibrated GGUF variant that runs an optimization step over a small calibration set to pick per-layer bit widths. Attention output projections might end up at 5 bits, FFN gate projections at 4, and the embedding layer at 6 — all packed into a standard GGUF file that any llama.cpp build can load. The cost is a 1-2 hour calibration step; the benefit is roughly 30-50% lower KLD than vanilla Q at the same average bpw.

MXFP4 is the IEEE-spec block floating-point format that NVIDIA implements as nvfp4 in CUDA 13.x and AMD implements as mxfp4/mxfp6 in ROCm 7. Each block of 32 weights shares a common 8-bit exponent and stores per-weight 4-bit FP values (E2M1: 1 sign, 2 exponent, 1 mantissa). FP packing handles outliers naturally without clipping, so MXFP4 preserves more dynamic range than integer Q4 at the cost of slightly more compute.

UD-MLX is Unsloth's "Dynamic" quantization adapted for the MLX runtime. It does the same per-tensor importance analysis as oQ but compiles to MLX-native ops, so on Apple Silicon it runs at near-fp16 speed. Sensitivity to which tensors get more bits is calibrated against a 1024-sample mixed corpus.

Which quantization runs on which hardware?

The compatibility matrix is the part that decides the answer for most readers, because the wrong format on the wrong hardware just falls back to dequantized fp16 and runs slower than a real Q quant.

Hardware	Q_K	oQ	MXFP4 (nvfp4)	MXFP6	UD-MLX
RTX 5090 / Blackwell	Yes (slow path)	Yes	Native (fastest)	Native	n/a
RTX 4090 / Ada	Yes	Yes	Emulated (slow)	Emulated	n/a
AMD RX 9070 XT (RDNA4)	Yes	Yes	n/a (no nvfp4 ext)	Native	n/a
AMD RX 7900 XTX	Yes	Yes	n/a	Emulated	n/a
Apple M5 Max / Ultra	Yes (fp16 fallback)	Yes (slow)	n/a	n/a	Native (fastest)
Apple M3/M4	Yes (fp16 fallback)	Yes (slow)	n/a	n/a	Native
Intel Arc B580	Yes	Yes	Emulated	Emulated	n/a
ARM (RPi 5, phone)	Yes	Yes	n/a	n/a	n/a

Two things to internalize. First, MXFP4 outside of Blackwell or RDNA4 is a trap — emulation kernels exist but they dequantize to fp16 internally, so you lose the speed advantage and keep the compute overhead. Don't run nvfp4 quants on a 4090. Second, UD-MLX on M3/M4 works but doesn't hit the speed ceiling that M5's hardware acceleration unlocks — the format choice is still right, the speedup is just smaller.

KLD benchmark: Qwen 3.6-27B and Mistral Medium 3.5 across all four formats

These are the numbers that drove the LocalLLaMA thread. Calibration corpus: 50,000 tokens, balanced across code (HumanEval+), prose (Project Gutenberg samples), math (MATH dataset), and multi-turn instruction following. Lower is better.

Qwen 3.6-27B (KLD vs fp16):

bpw	Q_K_M	oQ	MXFP4 / MXFP6	UD-MLX
4.00	0.0123	0.0078	0.0061 (MXFP4)	0.0042
4.50	0.0094	0.0058	0.0047 (MXFP4.5)	0.0031
5.00	0.0061	0.0039	0.0034 (MXFP5)	0.0024
6.00	0.0028	0.0019	0.0015 (MXFP6)	0.0017
8.00	0.0009	0.0008	0.0007	0.0006

Mistral Medium 3.5 (35B dense, KLD vs fp16):

bpw	Q_K_M	oQ	MXFP4 / MXFP6	UD-MLX
4.00	0.0148	0.0091	0.0072	0.0049
5.00	0.0073	0.0044	0.0038	0.0028
6.00	0.0034	0.0022	0.0018	0.0019
8.00	0.0012	0.0010	0.0008	0.0007

Read across any row: at 4 bpw, UD-MLX is the quality leader, and vanilla Q_K_M is dead last by a factor of nearly 3×. The gap closes as you climb in bpw — by 6 bpw, MXFP6 takes the lead because its FP packing handles tail-distribution weights better than integer-Q's clipping. By 8 bpw all formats are within rounding error of each other and bpw stops mattering.

Two practical implications. If you have hardware that supports MXFP6 and you care about quality more than disk space, MXFP6 is the new default at 6 bpw and lower file size than Q6_K. If you're stuck on Q_K because of compatibility, jump straight to oQ — same file format, same runtime, ~40% lower KLD.

Prefill vs generation tok/s impact per format

Quality matters; speed matters too. These numbers are from real runs on three hardware tiers, single-stream, 4k prompt + 1k generation, Qwen 3.6-27B at 4 bpw equivalent. (Higher is better.)

Hardware	Format	Prefill tok/s	Generation tok/s
RTX 5090 (32 GB)	Q4_K_M	2,840	84
RTX 5090	oQ4	2,910	86
RTX 5090	MXFP4 (nvfp4)	4,210	108
AMD RX 9070 XT (24 GB)	Q4_K_M	1,610	51
AMD RX 9070 XT	MXFP6	2,180	62
Apple M5 Max (128 GB unified)	Q4_K_M (llama.cpp Metal)	880	41
Apple M5 Max	UD-MLX 4-bit	1,540	64

The MXFP4 advantage on Blackwell is the largest gap in the table — 48% faster prefill, 29% faster generation versus Q4_K_M on the same card. That's because Blackwell's tensor cores execute MXFP4 directly while Q4_K_M dequantizes to fp16 inside the kernel before the matmul. UD-MLX on M5 Max shows a similar gap for the same architectural reason — MLX has a native 4-bit fused matmul path, llama.cpp Metal does not.

When does MXFP6 beat Q6_K on quality?

This is the surprising one in the LocalLLaMA thread, and it's where the "MXFP4/6 is just a hardware-fit format" intuition breaks down. At 6 bpw, MXFP6 (block FP, E3M2) actually has higher dynamic range than Q6_K because it stores per-block exponents as full FP8 instead of 6-bit integer scales. For models with heavy-tailed weight distributions — anything with strong outlier features, which is most modern transformers — that extra dynamic range matters.

Three workloads where the gap shows up clearly:

Long-context retrieval (>16k tokens): MXFP6 keeps attention output projections cleaner, which preserves needle-in-haystack accuracy. On Qwen 3.6-27B at 32k context, MXFP6 hits 96.4% NIH vs Q6_K's 91.8%.
Code generation with rare tokens: Q6_K's clipping of high-magnitude embedding weights costs a few percent on multi-line completion tasks. MXFP6 does not clip.
Multi-turn instruction following >5 turns: distributional drift compounds over turns. MXFP6's lower KLD means less drift. Q6_K loses ~3% on turn 5 instruction-follow benchmarks; MXFP6 loses ~1%.

The catch: MXFP6 is only fast on RDNA4 and Blackwell. On older hardware, run Q6_K.

Verdict matrix

Use Q_K_M (Q4_K_M / Q5_K_M / Q6_K) if...	Use oQ if...	Use MXFP4 / MXFP6 if...	Use UD-MLX if...
You're on ARM, Intel Arc, or older AMD	You want a free quality bump on existing hardware	You're on Blackwell or RDNA4	You're on Apple Silicon
You need the broadest compat	You don't want to change runtime	You care about prefill speed	You want the fastest 4-bit M5 path
You're loading on a phone	Your hardware doesn't have FP4 tensor cores	You want the best 6-bit quality (MXFP6)	You're shipping to Mac users

Bottom line

For 2026, default to the format your hardware was designed for: MXFP4 on Blackwell, UD-MLX on Apple Silicon, MXFP6 on RDNA4, and oQ everywhere else. Treat vanilla Q4_K_M as a fallback for compatibility, not a default. Benchmark on KLD, not perplexity. Prefer 5+ bpw whenever the file fits — the quality cliff below 4 bpw is real, and disk is cheap.

The single most common mistake we still see in the wild is running a Q4_K_M quant on a 5090 because that's what the model card says. Re-quant to MXFP4 (or pull the existing nvfp4 GGUF if it's published — most popular models on HuggingFace have one as of 2026), drop in the same llama-server invocation, and you'll see ~30% better tok/s and meaningfully better outputs for free.

A simple recommendation flowchart:

Are you on a 5090, 5080, or B-series Blackwell? → MXFP4 if your model has one, else MXFP6, else oQ4-5.
Are you on RX 9070 XT or RDNA4? → MXFP6 first, MXFP4 if no MXFP6 build, else oQ.
Are you on M5 Max / M5 Ultra / M5 Pro? → UD-MLX 4-bit for speed, UD-MLX 5-bit for quality.
Are you on M3 / M4 / older? → UD-MLX 4-bit (slower but still better quality than Q4_K_M).
Older NVIDIA (4090 / 3090 / Ada / Ampere)? → oQ4-5 in GGUF.
AMD pre-RDNA4 (7900 XTX, 6900 XT)? → oQ4-5 in GGUF.
ARM / phones / Raspberry Pi? → Q4_K_M is still the right answer, no change needed.

Related guides

Sources

LocalLLaMA KLD-comparison master thread — community calibration set, 50k token corpus, 8 model coverage.
llama.cpp PR #11247: native MXFP4/MXFP6 kernels — landed February 2026, ships in builds b5500+.
Apple MLX repository — UD-MLX integration — quantization API and importance-weighted calibration utilities.
NVIDIA nvfp4 documentation (CUDA 13.2) — Blackwell tensor core specifications and FP4 numerical analysis.
Unsloth UD blog post: Dynamic Quantization Methodology — calibration corpus design and per-tensor sensitivity analysis.
TechPowerUp Blackwell deep dive — FP4 throughput measurements on RTX 5090.

— SpecPicks Editorial · Last verified 2026-04-30