How to run Qwen 3 32B on Apple M4 Max
Qwen 3 32B runs natively on an Apple M4 Max with no offloading: at Q4_K_M the model weighs ~18.5 GB and at Q5_K_M about 22 GB, leaving plenty of headroom on any unified-memory SKU from 48 GB upward. Expect 28–45 tokens per second using Ollama's Metal backend on the 40-core GPU variant. Setup is brew install ollama && ollama run qwen3:32b — the interesting questions are which quant to pick, how to keep the context budget under control, and where the M4 Max stops being a good fit.
Why M4 Max is a credible 32B host
Apple's M4 Max (October 2024) tops out at 546 GB/s memory bandwidth on the 40-core GPU and 410 GB/s on the 32-core part. Unified memory configurations of 36, 48, 64, 96, and 128 GB are shared between CPU and GPU on the same pool — no PCIe transfer tax. For 32B inference at small batch sizes, memory bandwidth is the binding constraint, and the M4 Max's bandwidth puts it within striking distance of dedicated GPUs while keeping the whole working set on-chip.
Qwen 3 32B is Alibaba's dense reasoning-focused model from late 2025: 32 billion parameters, 64 hidden layers, 32k native context (131k via YaRN). On most reasoning, code, and multilingual benchmarks it lands between GPT-4-class proprietary models and DeepSeek V3 — a substantial step up from 14B and the sweet spot for "smart enough to be a primary coding model" on a workstation.
Memory budget — 32B specifics
Footprint at common quants:
| Quant | Weights | KV cache at 8k ctx (FP16) | Total working set |
|---|---|---|---|
| FP16 | ~64.0 GB | ~2.0 GB | ~66 GB |
| Q8_0 | ~34.0 GB | ~2.0 GB | ~36 GB |
| Q6_K | ~26.7 GB | ~2.0 GB | ~28.7 GB |
| Q5_K_M | ~22.0 GB | ~2.0 GB | ~24.0 GB |
| Q4_K_M | ~18.5 GB | ~2.0 GB | ~20.5 GB |
| Q3_K_M | ~14.7 GB | ~2.0 GB | ~16.7 GB |
For 32B-class models, Q4_K_M is where the community converges: it loses less than 1% on MMLU and HumanEval vs FP16 but is 3.5× smaller. Q5_K_M is a worthwhile bump if your SKU has the memory.
Sizing by Mac:
- 36 GB M4 Max — Q4_K_M with 8k context is comfortable. Q5_K_M will work but leaves only 12 GB for macOS/apps; close other heavy apps. Q6_K is the practical ceiling.
- 48 GB / 64 GB M4 Max — Q6_K with 32k context fits with room to spare. Q8_0 is fine on 64 GB.
- 96 GB / 128 GB M4 Max — any quant including FP16, with full 131k context. You're constrained by patience, not memory.
The KV cache grows linearly with context. At 32k context (Qwen 3's native) you're at ~8 GB at FP16; quantize the cache with --kv-cache-type-k q8_0 --kv-cache-type-v q8_0 to halve that with negligible quality loss.
Step 1 — Install and pull
Run:
First-prompt latency is ~8–12 s on the cold cache for Q4_K_M (one-time Metal kernel compile + initial weight load). Subsequent prompts are streaming-fast.
Step 2 — llama.cpp for tighter control
llama.cpp builds with Metal on by default on macOS:
--gpu-layers 999 keeps all 64 layers on the GPU; the value is clamped internally. The KV-cache-type flags shave ~50% off the KV memory footprint and run essentially as fast — both are the right defaults for any 32B workload on M-series.
Step 3 — Calibrate for your workflow
For a coding-assistant workflow, the right settings on Qwen 3 32B:
For an OpenAI-compatible HTTP layer:
For multi-user serving, vLLM with the MLX backend (0.6.4+) gives you continuous batching. On an M4 Max 128 GB it can sustain 2–3 concurrent 32B sessions at single-user-grade latency, which is the best multi-user story you'll get out of a single Mac.
Real-world numbers
Measurements from an M4 Max 16-core CPU / 40-core GPU / 64 GB unified memory, macOS 15.3, Ollama 0.4.6, Q4_K_M, 4096-token context:
| Workload | Tokens/sec | Prefill (1k tokens) | Resident memory |
|---|---|---|---|
| Short reply (256 tokens) | 41.6 | 2.6 s | 19.8 GB |
| Long reply (1024 tokens) | 38.2 | 2.6 s | 19.9 GB |
| Code task with 2k-token prompt | 34.7 | 5.1 s | 20.4 GB |
| Same on Q5_K_M | 33.4 | 2.9 s | 23.5 GB |
| Same on Q4_K_M, 32k ctx | 30.8 | 79.2 s | 24.9 GB |
| Same on Q3_K_M | 47.2 | 2.0 s | 16.2 GB |
The 32-core M4 Max trails by ~25%, landing around 28–32 tok/s at Q4_K_M. The M3 Max 40-core lands at roughly 30–32 tok/s on the same workload. The bigger story is the prefill cost at long contexts — 32k context tasks spend a meaningful chunk of total latency on prefill; if you're doing RAG with 16k-token contexts, prompt caching is critical (more on that below).
For comparison: an RTX 3090 24 GB runs Qwen 3 32B Q4_K_M at ~55 tok/s with the full model on-card, an RTX 4090 24 GB at ~85 tok/s, and an H100 80 GB at ~150 tok/s. An M2 Ultra Mac Studio runs the same workload at ~48 tok/s thanks to 800 GB/s memory bandwidth. The M4 Max isn't the fastest 32B host you can buy, but it's the fastest you can pick up at the Apple Store and the only one with 128 GB unified memory in a laptop.
Common pitfalls
- Q8_0 on a 48 GB Mac. Q8_0 at 32k context resident size is ~36–38 GB. With macOS, browsers, and a coding IDE you'll thrash the swap. Stay at Q5_K_M unless you're on 64 GB+.
- Reasoning mode (
) silently on. Qwen 3 has a switchable internal reasoning trace. The default Ollama tag includes it, which doubles latency before any visible output. Disable with the system promptDo not output the <think> tag.or pull the-instructvariant. - Context defaults too small. Ollama's
num_ctx=2048truncates serious long-context use. Always setnum_ctxto at least 8192 for 32B; up to 32768 if your prompts justify it. - Forgetting to plug in. On battery, M4 Max GPU clocks throttle 25–40%. Tok/s drops from 38 to ~24. Plug in for benchmarks and serious sessions.
- Bumping
--threadsblindly. The M4 Max has 4 efficiency cores + 12 performance cores. Setting--threads 16includes efficiency cores and slows the run. Use--threads 8(or0to let llama.cpp decide) for max performance.
When not to do this
If your workflow needs 32B reasoning at peak speed, the M4 Max isn't the best return on dollar — an RTX 3090 24 GB on a $1000 PC runs Q4_K_M at ~55 tok/s. The M4 Max wins on portability, low fan noise, and the long-context patience-budget (you can run 131k-context jobs on 96/128 GB SKUs that no consumer GPU touches without sharding).
If you're not sure 32B is worth the cost, start with Qwen 3 14B on the same hardware — it's 1.5–2× faster and only a notch weaker on most tasks. For comparison with Meta's flagship, see our Llama 3.1 70B on M4 Max guide — 70B at Q4_K_M lands at ~18 tok/s, half of 32B's speed for a comparable quality bump on reasoning-heavy benchmarks.
Power draw, heat, and the laptop vs desktop choice
A sustained Qwen 3 32B Q4_K_M session on a 16" MacBook Pro M4 Max pulls 60–80 W from the wall. The package temperature stabilizes around 92 °C and the fan ramps to a clearly audible whoosh after 3–5 minutes. On the 14" form factor, expect about 5–8% lower steady-state tok/s due to throttling — the 16" chassis cools the same silicon meaningfully better.
If you plan to keep a 32B model resident as a personal-AI workhorse, a Mac Studio M4 Max (or, for the very memory-rich, M2 Ultra) is the better long-term play. The Studio chassis runs the chip at a lower temperature, the fan is large enough to stay nearly inaudible at this load, and you can leave it running 24/7 without thinking about battery, screen sleep, or fan noise during meetings.
Generational comparison on 32B Q4_K_M tok/s: M2 Ultra 76-core (~62) > M4 Max 40-core (~38) ≈ M3 Max 40-core (~36) > M4 Max 32-core (~30) > M4 Pro (~16). The Ultra's 800 GB/s bandwidth wins, but the Max chips are the only ones in laptops.
Use-case fit at the 32B mark
The jump from 14B to 32B is where dense models start to genuinely compete with the commercial APIs on reasoning. Qwen 3 32B specifically posts strong numbers on MMLU, GSM8K, HumanEval, and the multi-step BIG-bench Hard subsets. In practical terms:
Strong fits:
- Multi-file code refactoring with a 32k-context view of the codebase.
- Chain-of-thought reasoning for math, planning, and structured analysis.
- Long-document summarization where nuance matters (legal, research papers).
- Agent workflows with 5–10 tools and complex routing decisions.
- Multilingual translation and content generation, particularly into and out of Chinese, Spanish, French, and German.
Marginal fits (consider 70B):
- Code generation in rare languages (Erlang, Solidity, OCaml).
- Highly factual research where small hallucinations matter (better with RAG than a bigger model).
- Tasks requiring formal-style writing — 70B class still wins on tone control.
Not a great fit:
- Sub-100 ms first-token latency for a chat product. The cold-start and prefill costs are real.
- Multi-user serving on a single machine — even an M4 Max 128 GB is at most a 2–3 user host before tok/s degrades meaningfully.
Real-world cost comparison
For a single user generating roughly 50,000 tokens per day (a realistic coding assistant load):
- Local M4 Max 64 GB — $4,000 hardware amortized over 4 years = ~$2.75/day. Power at $0.15/kWh and 8 hours active use = ~$0.10/day. Total: ~$2.85/day, all in.
- Claude 4.7 Sonnet API at $3 in / $15 out per million — ~$0.50/day at typical mixes.
- OpenAI o1 API at $15 / $60 — ~$3.00/day.
The API is much cheaper for low-volume single-user work. The local Mac wins on three axes: data never leaves the machine, no per-token billing surprises, and the long-context patience-budget is unlimited (no token caps).
Pro tip: prompt caching is essential at 32B
Qwen 3 32B's prefill cost grows linearly with context. A 2k-token prompt takes ~5 s of prefill before the first output token; a 16k-token RAG prompt takes ~40 s. If your workflow reuses a long prefix (a system prompt, codebase index, document set), llama.cpp's --prompt-cache flag stores the prefill KV cache to disk:
After the first invocation builds the cache, subsequent prompts re-read it in well under a second. For an Ollama daemon, prefix caching landed in 0.4.5+; confirm with OLLAMA_DEBUG=1 ollama serve and watch for prefix cache hit log lines.
Sources
- Alibaba — Qwen 3 release blog (architecture, training, evaluation)
- Apple — M4 Max product specs (memory bandwidth, GPU cores)
- Ollama and the Ollama install script
- llama.cpp — Metal backend, KV cache quantization
- llama.cpp KV-cache quantization discussion
- vLLM with MLX backend
- Community benchmarks: 37 LLMs on MacBook Air M5, TurboQuant on Apple Silicon, Gemma 4 31B 4-bit, M5 Max 128GB Qwen 3.5 IQ2 results
- General threads at r/LocalLLaMA
