MLX engine comparison… and oMLX is the top choice.

SpecPicks News — summary + source link

By SpecPicks News Desk · Published 2026-05-18 · Last verified 2026-05-20 · 9 min read

In brief — 2026-05-19 · A community benchmark on r/LocalLLaMA comparing Apple Silicon inference engines on an M5 Max 64 GB found oMLX leading the pack running Qwen3-35B-A3B at 4-bit quantization. For Mac users weighing MLX versus GGUF with multi-token prediction, the result adds concrete throughput data to a debate that resurfaces monthly as new engines and MTP support shift the calculus for local LLM workstations.

What happened

A benchmark post on r/LocalLLaMA quickly became a reference point for Apple Silicon LLM users: a direct comparison of inference engines on an M5 Max with 64 GB of unified memory, running mlx-community/Qwen3-35B-A3B-4bit. The author tested oMLX alongside standard MLX and at least one additional engine, with oMLX taking the throughput lead.

One caveat surfaced immediately in the thread: the underlying blog post being discussed used Qwen3-27B, not the 35B-A3B variant tested on the M5 Max — so the numbers are not strictly apples-to-apples. The 35B-A3B model is a mixture-of-experts architecture where only a fraction of parameters activate per token, changing memory bandwidth requirements significantly versus a dense 27B model. Readers should treat the relative engine rankings as directionally valid while keeping the model mismatch in mind for absolute token-per-second figures.

The timing overlaps with an active r/LocalLLaMA thread asking whether MLX or GGUF with multi-token prediction (MTP) is now the better path for Mac users — a question that has grown more complex since llama.cpp merged MTP support. A separate post examined MTP for Qwen3-35B-A3B on a 6 GB VRAM laptop, concluding the feature is not worth enabling at that memory constraint, though the conclusion does not generalize to high-memory Mac environments.

Parallel community work has demonstrated Gemma4 26B running in MLX with turboquant and a rotating KV cache on an M5 MacBook Air, confirming that the MLX ecosystem continues to expand beyond Qwen-family models. Research integrating δ-mem with Apple Silicon via MLX adds another dimension, with at least one community member reporting early findings on dynamically adjusting weights outside of context windows.

Why it matters for builders

For anyone considering an Apple Silicon Mac as a local-LLM workstation, the oMLX result is a meaningful data point. Engine choice on Apple Silicon has historically been underdiscussed compared to GPU selection on Linux or Windows, where CUDA and ROCm benchmarks dominate. The unified memory architecture of M-series chips means the inference engine's ability to efficiently schedule memory bandwidth — not raw compute — is the primary differentiator at these model sizes.

The 64 GB M5 Max configuration represents a practical ceiling for consumer Apple Silicon, and Qwen3-35B-A3B at 4-bit quantization fits within that envelope with room for context. If the engine comparison holds across model families, oMLX would be the rational default for users committed to the Apple Silicon path who want maximum throughput without switching to a GGUF workflow.

The concurrent GGUF-with-MTP-versus-MLX debate is worth following separately. MTP theoretically improves throughput by speculating multiple tokens per forward pass, but the 6 GB VRAM finding suggests the benefit is context-dependent — on memory-constrained devices, MTP's overhead may not be recovered. On a 64 GB M5 Max the arithmetic is likely different, though no direct MTP-vs-oMLX comparison on identical hardware has been published.

Hardware angle

The benchmark hardware at the center of this discussion — the M5 Max 64 GB — sits at Apple Silicon's current consumer peak. A parallel r/LocalLLaMA thread on finalizing a desktop rig targeting 96 GB of total VRAM across multiple GPUs illustrates the alternative path: a multi-GPU AMD or NVIDIA configuration on a consumer desktop platform, accepting tensor-parallelism complexity in exchange for greater raw VRAM and a familiar CUDA/ROCm toolchain.

For users who want a capable local inference box without the Apple Silicon price premium, AMD Ryzen desktop platforms remain the dominant choice for CPU-offload workloads. Builds pairing a high-core-count processor with a discrete GPU handle models that overflow VRAM into system RAM, though at reduced token throughput compared to a fully in-VRAM configuration. The AMD Ryzen 9 7900X bundle with ASUS TUF Gaming B650-PLUS represents a capable foundation for that hybrid approach, combining a 12-core CPU with a modern AM5 platform that supports DDR5 bandwidth for CPU-side offload.

What other coverage is saying

The r/LocalLLaMA GGUF-with-MTP-vs-MLX thread directly contextualizes the oMLX result, with community members noting that LM Studio and similar frontends have made engine switching easier but that the underlying throughput gap is non-trivial and worth measuring on specific hardware. The MTP-on-6GB-VRAM post provides a useful counterpoint: multi-token prediction is not a universal win and must be benchmarked per configuration. The Gemma4-in-MLX-with-turboquant post from the same week demonstrates the MLX ecosystem extending to non-Qwen model families, broadening the relevance of engine-level benchmarks. The desktop-rig thread targeting 96 GB VRAM illustrates that high-memory discrete-GPU configurations remain the alternative for users who cannot or will not buy Apple Silicon.

Sources

MLX engine comparison — oMLX is the top choice (r/LocalLLaMA) — Primary benchmark post comparing inference engines on M5 Max 64 GB with Qwen3-35B-A3B-4bit; oMLX leads on throughput.
GGUF with MTP vs MLX without — is MLX still the way for Mac users? (r/LocalLLaMA) — Community thread examining whether llama.cpp's MTP has closed the gap with MLX on Apple Silicon.
MTP for Qwen3-35B-A3B on 6 GB VRAM laptop: not worth it (r/LocalLLaMA) — Empirical finding that MTP overhead exceeds its benefit on memory-constrained hardware.
Gemma4 26B MoE running in MLX with turboquant (r/LocalLLaMA) — Demonstrates MLX running a non-Qwen MoE model on M5 hardware with rotating KV cache.
Finalizing a desktop rig: 96 GB VRAM + 128 GB RAM (r/LocalLLaMA) — High-VRAM discrete-GPU desktop builds as the alternative to Apple Silicon for large-model inference.

Filed by the SpecPicks News Desk. We summarize and link — never paywall-bypass.