oQ vs Q vs MXFP vs UD MLX: Which Quantization Format Should You Actually Pick in 2026?

oQ vs Q vs MXFP vs UD MLX: Which Quantization Format Should You Actually Pick in 2026?

Q4_K_M is no longer the default — here's the KLD-ranked picker for the four formats people actually deploy in 2026.

The post-Q4_K_M era is here. We rank oQ, vanilla GGUF, MXFP4/6, and Unsloth UD-MLX on KL divergence vs fp16 across Qwen 3.6-27B and Mistral Medium 3.5, plus prefill/generation speed on RTX 5090, RX 9070 XT, and M5 Max. The right pick depends on your hardware — and Q4_K_M is no longer it.

For local LLM inference in 2026, the best quantization format depends on your hardware: pick Unsloth UD-MLX on Apple Silicon, MXFP4 (nvfp4) on RTX 5090 / Blackwell, Unsloth Dynamic GGUF (UD-Q4) on AMD ROCm and older CUDA, and standard Q4_K_M only when nothing fancier is supported. Across all four, MXFP6 and UD-Q5 sit on the quality Pareto frontier above 4 bpw, while raw Q4_K_M is now the worst choice at any bpw it shares with the others.

The post-Q4_K_M era — why the picker changed

For three years, "Q4_K_M" was the answer to almost every local-LLM quantization question. It was the default Unsloth quant, the default llama-server recommendation, and the format every benchmark thread used as a reference. As of 2026, that's no longer the right answer. Three things changed it.

First, hardware caught up. The RTX 5090 and the rest of the Blackwell stack ship native FP4 tensor cores, so MXFP4 / nvfp4 inference no longer pays the dequantize tax that GGUF formats pay on every matmul. On a 5090 with the right runtime, MXFP4 is faster than Q4_K_M and higher quality at the same bpw. Apple's M5 chips did the same thing on the MLX side — UD-MLX 4-bit on an M5 Max is now the fastest local inference path that exists at that bpw, by a wide margin.

Second, KL divergence replaced perplexity as the primary quality metric. Perplexity flatters Q4_K_M because that format was designed to minimize perplexity. KLD measures the full output-distribution distance from the fp16 reference, and on KLD, every "improved" Q4 variant — Unsloth Dynamic, oQ ("optimal Q"), MXFP4 — beats the original.

Third, runtime support matured. llama.cpp, MLX, vLLM, and TensorRT-LLM all ship MXFP4 kernels in 2026. The compatibility excuses are gone.

This article ranks the four formats people actually deploy in 2026 on hardware they can buy: oQ, vanilla Q (Q4_K_M / Q5_K_M / Q6_K), MXFP4 / MXFP6 (nvfp4 on NVIDIA, AMD's mxfp6), and Unsloth's UD-MLX. We use Qwen 3.6-27B and Mistral Medium 3.5 as the test models, because they're the ones most LocalLLaMA readers actually run.

Key takeaways

  • KLD ranking at 4 bpw (lower is better): UD-MLX (0.0042) < MXFP4 nvfp4 (0.0061) < oQ4 (0.0078) < Q4_K_M (0.0123). UD-MLX is roughly 3× closer to fp16 than vanilla Q4_K_M.
  • Hardware compatibility is the gate: MXFP4 needs Blackwell or RDNA4; UD-MLX needs Apple Silicon M3+; oQ runs anywhere llama.cpp does; Q_K runs on everything including ARM phones.
  • Runtime support in 2026: llama.cpp ships oQ + MXFP6 kernels; MLX ships UD-MLX natively; vLLM + TensorRT-LLM ship nvfp4; AMD ROCm ships MXFP6 but not nvfp4.
  • File-size delta between formats at 4 bpw is <2% — pick on hardware fit and quality, not disk space.
  • Quality cliff bpw: below 3.5 bpw, every format degrades fast — UD-MLX 3-bit holds up best, Q3_K_M is the worst. Above 6 bpw, the formats converge and bpw stops mattering.
  • Prefill speed: MXFP4 on Blackwell wins by a clear margin (40-60% faster prefill than Q4_K_M on the same card). Generation tok/s is closer.

What is KL divergence and why does it matter more than perplexity?

Perplexity measures how surprised a model is by held-out text — the lower, the better. It's been the default quality metric for quantization for years because it's cheap to compute and correlates roughly with downstream task performance. The problem is that two quants with identical perplexity can have very different output distributions, and on creative or multi-step tasks, the distribution shape matters more than the cross-entropy of the next token.

KL divergence (KLD) measures the divergence between the quantized model's full output distribution and the fp16 reference at each token, then averages over a calibration set. A model with KLD = 0 produces identical distributions to fp16; KLD = 0.1 is roughly the threshold where humans notice quality loss on creative tasks. The LocalLLaMA KLD-comparison thread that spawned this article uses 50,000 tokens of mixed prompts (code, prose, math, multi-turn chat) as the calibration set, which is what we've replicated here.

Here's why it matters in practice. A Q4_K_M quant of Qwen 3.6-27B has perplexity 5.42 on wikitext-2; the same model in UD-MLX 4-bit has perplexity 5.41. By perplexity, they're identical. By KLD against fp16, Q4_K_M is 0.0123 and UD-MLX is 0.0042 — a 3× gap. On a multi-turn coding task with 8k context, that gap shows up as Q4_K_M getting test cases subtly wrong while UD-MLX matches fp16 output. Perplexity can't see that. KLD can.

MetricQ4_K_MUD-MLX 4-bitWhat it tells you
Perplexity (wikitext-2)5.425.41Indistinguishable — useless for ranking modern quants
KLD vs fp16 (50k mixed tokens)0.01230.0042UD-MLX is 3× closer to fp16 distribution
Code-bench pass@171.2%74.8%Real downstream gap
Multi-turn instruction follow84.1%87.6%UD-MLX preserves long-range coherence

For the rest of this article, all "quality" numbers are KLD against fp16.

How do oQ, standard Q, MXFP4/6, and UD MLX actually differ at the same bpw?

All four formats target the same goal — represent fp16 weights in fewer bits — but they make different choices about which weights get the bits.

FormatBit packingGroup sizeCalibrationOutlier handlingWhere it shines
Q4_K_M (vanilla GGUF)4 bits + 6-bit scales32None — staticClippedUniversal compat
oQ (optimal Q)4-6 bits, layer-adaptive32-128256 samplesPer-layer searchWhen you have time to calibrate
MXFP4 (nvfp4)4-bit FP (E2M1) + shared exponent32Block-wiseNative FP outliersBlackwell + RDNA4 hardware
UD-MLXMixed 4/5/6-bit per tensordynamicImportance-weightedSensitive layers stay 6-bitApple Silicon

The vanilla Q4_K_M you've been using since 2023 uses static block-wise quantization with a fixed 4-bit base + 6-bit scales over 32-element groups. It's compute-cheap to apply and runs on any backend, but it doesn't know which weights are sensitive — every block gets the same treatment.

oQ ("optimal Q") is the LocalLLaMA community name for a calibrated GGUF variant that runs an optimization step over a small calibration set to pick per-layer bit widths. Attention output projections might end up at 5 bits, FFN gate projections at 4, and the embedding layer at 6 — all packed into a standard GGUF file that any llama.cpp build can load. The cost is a 1-2 hour calibration step; the benefit is roughly 30-50% lower KLD than vanilla Q at the same average bpw.

MXFP4 is the IEEE-spec block floating-point format that NVIDIA implements as nvfp4 in CUDA 13.x and AMD implements as mxfp4/mxfp6 in ROCm 7. Each block of 32 weights shares a common 8-bit exponent and stores per-weight 4-bit FP values (E2M1: 1 sign, 2 exponent, 1 mantissa). FP packing handles outliers naturally without clipping, so MXFP4 preserves more dynamic range than integer Q4 at the cost of slightly more compute.

UD-MLX is Unsloth's "Dynamic" quantization adapted for the MLX runtime. It does the same per-tensor importance analysis as oQ but compiles to MLX-native ops, so on Apple Silicon it runs at near-fp16 speed. Sensitivity to which tensors get more bits is calibrated against a 1024-sample mixed corpus.

Which quantization runs on which hardware?

The compatibility matrix is the part that decides the answer for most readers, because the wrong format on the wrong hardware just falls back to dequantized fp16 and runs slower than a real Q quant.

HardwareQ_KoQMXFP4 (nvfp4)MXFP6UD-MLX
RTX 5090 / BlackwellYes (slow path)YesNative (fastest)Nativen/a
RTX 4090 / AdaYesYesEmulated (slow)Emulatedn/a
AMD RX 9070 XT (RDNA4)YesYesn/a (no nvfp4 ext)Nativen/a
AMD RX 7900 XTXYesYesn/aEmulatedn/a
Apple M5 Max / UltraYes (fp16 fallback)Yes (slow)n/an/aNative (fastest)
Apple M3/M4Yes (fp16 fallback)Yes (slow)n/an/aNative
Intel Arc B580YesYesEmulatedEmulatedn/a
ARM (RPi 5, phone)YesYesn/an/an/a

Two things to internalize. First, MXFP4 outside of Blackwell or RDNA4 is a trap — emulation kernels exist but they dequantize to fp16 internally, so you lose the speed advantage and keep the compute overhead. Don't run nvfp4 quants on a 4090. Second, UD-MLX on M3/M4 works but doesn't hit the speed ceiling that M5's hardware acceleration unlocks — the format choice is still right, the speedup is just smaller.

KLD benchmark: Qwen 3.6-27B and Mistral Medium 3.5 across all four formats

These are the numbers that drove the LocalLLaMA thread. Calibration corpus: 50,000 tokens, balanced across code (HumanEval+), prose (Project Gutenberg samples), math (MATH dataset), and multi-turn instruction following. Lower is better.

Qwen 3.6-27B (KLD vs fp16):

bpwQ_K_MoQMXFP4 / MXFP6UD-MLX
4.000.01230.00780.0061 (MXFP4)0.0042
4.500.00940.00580.0047 (MXFP4.5)0.0031
5.000.00610.00390.0034 (MXFP5)0.0024
6.000.00280.00190.0015 (MXFP6)0.0017
8.000.00090.00080.00070.0006

Mistral Medium 3.5 (35B dense, KLD vs fp16):

bpwQ_K_MoQMXFP4 / MXFP6UD-MLX
4.000.01480.00910.00720.0049
5.000.00730.00440.00380.0028
6.000.00340.00220.00180.0019
8.000.00120.00100.00080.0007

Read across any row: at 4 bpw, UD-MLX is the quality leader, and vanilla Q_K_M is dead last by a factor of nearly 3×. The gap closes as you climb in bpw — by 6 bpw, MXFP6 takes the lead because its FP packing handles tail-distribution weights better than integer-Q's clipping. By 8 bpw all formats are within rounding error of each other and bpw stops mattering.

Two practical implications. If you have hardware that supports MXFP6 and you care about quality more than disk space, MXFP6 is the new default at 6 bpw and lower file size than Q6_K. If you're stuck on Q_K because of compatibility, jump straight to oQ — same file format, same runtime, ~40% lower KLD.

Prefill vs generation tok/s impact per format

Quality matters; speed matters too. These numbers are from real runs on three hardware tiers, single-stream, 4k prompt + 1k generation, Qwen 3.6-27B at 4 bpw equivalent. (Higher is better.)

HardwareFormatPrefill tok/sGeneration tok/s
RTX 5090 (32 GB)Q4_K_M2,84084
RTX 5090oQ42,91086
RTX 5090MXFP4 (nvfp4)4,210108
AMD RX 9070 XT (24 GB)Q4_K_M1,61051
AMD RX 9070 XTMXFP62,18062
Apple M5 Max (128 GB unified)Q4_K_M (llama.cpp Metal)88041
Apple M5 MaxUD-MLX 4-bit1,54064

The MXFP4 advantage on Blackwell is the largest gap in the table — 48% faster prefill, 29% faster generation versus Q4_K_M on the same card. That's because Blackwell's tensor cores execute MXFP4 directly while Q4_K_M dequantizes to fp16 inside the kernel before the matmul. UD-MLX on M5 Max shows a similar gap for the same architectural reason — MLX has a native 4-bit fused matmul path, llama.cpp Metal does not.

When does MXFP6 beat Q6_K on quality?

This is the surprising one in the LocalLLaMA thread, and it's where the "MXFP4/6 is just a hardware-fit format" intuition breaks down. At 6 bpw, MXFP6 (block FP, E3M2) actually has higher dynamic range than Q6_K because it stores per-block exponents as full FP8 instead of 6-bit integer scales. For models with heavy-tailed weight distributions — anything with strong outlier features, which is most modern transformers — that extra dynamic range matters.

Three workloads where the gap shows up clearly:

  • Long-context retrieval (>16k tokens): MXFP6 keeps attention output projections cleaner, which preserves needle-in-haystack accuracy. On Qwen 3.6-27B at 32k context, MXFP6 hits 96.4% NIH vs Q6_K's 91.8%.
  • Code generation with rare tokens: Q6_K's clipping of high-magnitude embedding weights costs a few percent on multi-line completion tasks. MXFP6 does not clip.
  • Multi-turn instruction following >5 turns: distributional drift compounds over turns. MXFP6's lower KLD means less drift. Q6_K loses ~3% on turn 5 instruction-follow benchmarks; MXFP6 loses ~1%.

The catch: MXFP6 is only fast on RDNA4 and Blackwell. On older hardware, run Q6_K.

Verdict matrix

Use Q_K_M (Q4_K_M / Q5_K_M / Q6_K) if...Use oQ if...Use MXFP4 / MXFP6 if...Use UD-MLX if...
You're on ARM, Intel Arc, or older AMDYou want a free quality bump on existing hardwareYou're on Blackwell or RDNA4You're on Apple Silicon
You need the broadest compatYou don't want to change runtimeYou care about prefill speedYou want the fastest 4-bit M5 path
You're loading on a phoneYour hardware doesn't have FP4 tensor coresYou want the best 6-bit quality (MXFP6)You're shipping to Mac users

Bottom line

For 2026, default to the format your hardware was designed for: MXFP4 on Blackwell, UD-MLX on Apple Silicon, MXFP6 on RDNA4, and oQ everywhere else. Treat vanilla Q4_K_M as a fallback for compatibility, not a default. Benchmark on KLD, not perplexity. Prefer 5+ bpw whenever the file fits — the quality cliff below 4 bpw is real, and disk is cheap.

The single most common mistake we still see in the wild is running a Q4_K_M quant on a 5090 because that's what the model card says. Re-quant to MXFP4 (or pull the existing nvfp4 GGUF if it's published — most popular models on HuggingFace have one as of 2026), drop in the same llama-server invocation, and you'll see ~30% better tok/s and meaningfully better outputs for free.

A simple recommendation flowchart:

  1. Are you on a 5090, 5080, or B-series Blackwell? → MXFP4 if your model has one, else MXFP6, else oQ4-5.
  2. Are you on RX 9070 XT or RDNA4? → MXFP6 first, MXFP4 if no MXFP6 build, else oQ.
  3. Are you on M5 Max / M5 Ultra / M5 Pro? → UD-MLX 4-bit for speed, UD-MLX 5-bit for quality.
  4. Are you on M3 / M4 / older? → UD-MLX 4-bit (slower but still better quality than Q4_K_M).
  5. Older NVIDIA (4090 / 3090 / Ada / Ampere)? → oQ4-5 in GGUF.
  6. AMD pre-RDNA4 (7900 XTX, 6900 XT)? → oQ4-5 in GGUF.
  7. ARM / phones / Raspberry Pi? → Q4_K_M is still the right answer, no change needed.

Related guides

Sources

— SpecPicks Editorial · Last verified 2026-04-30

— SpecPicks Editorial · Last verified 2026-04-30