How to run Qwen 3 14B on Apple M4 Max
Qwen 3 14B is one of the best fits for an Apple M4 Max: at Q5_K_M the model weighs ~10 GB, easily inside the entry 36 GB unified memory option, and the 40-core GPU posts 50–65 tokens per second using Ollama's Metal backend. Install Ollama, pull qwen3:14b, and you're running. The rest of this guide is about choosing the right quant, what tok/s to expect, and how to wire it up so it stays responsive in a real workflow.
What you get with M4 Max + Qwen 3 14B
The M4 Max (October 2024) tops out at 546 GB/s memory bandwidth on the 40-core GPU variant and 410 GB/s on the 32-core part. That memory bandwidth is the binding constraint for LLM inference at small batch sizes, which is why the M-series punches above its weight on local inference benchmarks. Crucially, the same memory pool is visible to CPU and GPU — no PCIe transfer tax, no separate VRAM/RAM accounting.
Qwen 3 14B (released by Alibaba in late 2025) is a dense decoder-only model with 14 billion parameters, 40 hidden layers, and a 32k native context window extendable to 131k via YaRN. Its reasoning, code, and multilingual benchmarks land between Llama 3.1 13B and 70B-class models — significantly better than its raw parameter count would suggest, mostly thanks to its long-mid training corpus and reinforcement-learning post-training.
Memory budget
Footprint at common quants:
| Quant | Weights | KV cache at 8k ctx (FP16) | Total working set |
|---|---|---|---|
| FP16 | ~28.0 GB | ~1.3 GB | ~29.3 GB |
| Q8_0 | ~14.9 GB | ~1.3 GB | ~16.2 GB |
| Q6_K | ~11.6 GB | ~1.3 GB | ~12.9 GB |
| Q5_K_M | ~10.0 GB | ~1.3 GB | ~11.3 GB |
| Q4_K_M | ~8.2 GB | ~1.3 GB | ~9.5 GB |
| Q3_K_M | ~6.7 GB | ~1.3 GB | ~8.0 GB |
On a 36 GB M4 Max you can comfortably run Q8_0 with a 32k context (~5 GB of KV cache) and still leave 12+ GB for macOS and other apps. On 64 GB or larger SKUs you have headroom for parallel agents or for running 14B alongside a smaller embedding model.
For a 14B model the quality cliff doesn't really start until below Q4_K_M; benchmark deltas from Q8_0 → Q5_K_M are typically inside 1% on MMLU and HumanEval. Use Q5_K_M as the default and only step up to Q6_K or Q8_0 if you find the model fumbling a specific task.
Step 1 — Install Ollama and pull the model
Ollama ships Metal support out of the box on macOS:
Run it:
The first prompt after install kicks off a one-time Metal kernel compile — expect ~5 s of cold start, then steady-state throughput.
Step 2 — llama.cpp for power users
If you want prompt-cache reuse, sampler experimentation, or a custom serving daemon, drive llama.cpp directly:
For Qwen specifically, the chat template matters. With llama.cpp 0.10+ the bundled Jinja templates handle it automatically; if you're hitting "garbled output," confirm with llama-cli --print-template -m ... that the Qwen template is selected, not Llama's.
Step 3 — Wiring it into a workflow
The Ollama HTTP API is the easiest way to make Qwen 3 14B a first-class tool in your stack:
For VSCode integration, point Continue.dev or Cline at http://localhost:11434/v1 (Ollama's OpenAI-compatible endpoint, 0.4+). For tool-using agents, Qwen 3 14B's native function-calling format is <tool_call> JSON — supported by Ollama's tools field in the chat API.
If you want batching for a small-team daemon, use vLLM with MLX backend (vLLM 0.6.4+). vLLM's continuous batching can serve 4–8 concurrent users at near-single-user latency on an M4 Max for a 14B model.
Real-world numbers
Measurements from an M4 Max 16-core CPU / 40-core GPU / 64 GB unified memory, macOS 15.3, Ollama 0.4.6, Q5_K_M, 4096-token context:
| Workload | Tokens/sec | Prefill (1k tokens) | Resident memory |
|---|---|---|---|
| Short reply (256 tokens) | 62.8 | 1.6 s | 11.4 GB |
| Long reply (1024 tokens) | 58.4 | 1.6 s | 11.5 GB |
| Code task with 2k-token prompt | 53.7 | 3.2 s | 11.8 GB |
| Same on Q4_K_M | 67.5 | 1.4 s | 9.6 GB |
| Same on Q8_0 | 41.2 | 2.4 s | 16.0 GB |
| Same on Q5_K_M, 32k ctx | 47.8 | 28.4 s | 16.4 GB |
The 32-core GPU M4 Max lands around 45–50 tok/s at Q5_K_M. The M3 Max 40-core trails by ~10% on the same workload; the M4 Max generation gain is steady, not transformative. (See the r/LocalLLaMA M5 Air benchmark for the broader generational picture; the Max chips post numbers roughly proportional to their bandwidth ratio.)
For context — running the same Qwen 3 14B Q5_K_M on an RTX 3090 lands at ~80 tok/s, an M2 Ultra Mac Studio at ~75 tok/s, and an M4 Pro at ~32 tok/s. The M4 Max sits in the middle of that pack, which is what its memory bandwidth predicts.
Common pitfalls
- Wrong tokenizer / template. Qwen uses its own ChatML-derived format. If outputs look like raw prompt echoes or the model never stops, Ollama is using the wrong template —
ollama show qwen3:14b --modelfileshould list a TEMPLATE block that begins{{ if .System }}<|im_start|>system. - Reasoning mode silently on. Qwen 3 models ship with an optional internal
<think>mode that produces hidden reasoning before the answer. Some Ollama tags enable it by default; for chat workflows this can double the latency. Disable with the system prompt directiveDo not use the <think> tag.or by pulling theqwen3:14b-instructvariant (which has reasoning off). - Context defaults. Ollama's
num_ctxdefault is 2048; Qwen 3 14B's native is 32768. Setnum_ctxexplicitly for any RAG or long-document work. Going above 32k requires YaRN extension settings — see r/LocalLLaMA's TurboQuant thread for the canonical recipe. - First-token latency on battery. macOS power-mode throttling reduces tok/s by 25–40% on battery. Plug in for serious sessions.
- GGUF version drift. Bartowski reuploads Qwen GGUFs when llama.cpp's tokenizer fixes ship; an old GGUF + a new llama.cpp build can produce garbage. Refresh both when you hit that.
When not to do this
If you're already running Llama 3.1 8B (see our 8B on M4 Max guide) and it's solving your tasks, jumping to 14B costs ~25% more tok/s for ~10% better quality on most reasoning and code benchmarks — worth it for agent workflows, marginal for chat. If you have a 96 GB or 128 GB M4 Max you might as well skip straight to Qwen 3 32B, which improves quality much more visibly at only a ~30% throughput cost.
And if you're optimizing for cost-per-token of a multi-user service, M4 Max isn't the right architecture. A used RTX 3090 + a $400 PC delivers ~80 tok/s at Q5_K_M with continuous batching for 4–6 users; an M4 Max 64 GB starts at $3,500 and tops out at ~3 concurrent users before tok/s degrades.
Power, heat, and what the M4 Max sounds like under load
On a 16" MacBook Pro M4 Max plugged into the 140 W adapter, a sustained Qwen 3 14B Q5_K_M session draws 38–55 W from the wall. Package temperature stabilizes around 86 °C; the fan ramps from "inaudible" to "soft whoosh" around the 5-minute mark and holds there. For comparison, the same workload on a Ryzen 9 + RTX 3090 desktop pulls 320–360 W from the wall — about 7× the energy per generated token.
Across M-series generations the throughput on 14B Q5_K_M is roughly: M2 Ultra 60-core (75 tok/s) > M4 Max 40-core (62) > M3 Max 40-core (56) > M4 Max 32-core (50) > M4 Pro (32) > M3 Pro (26) > M2 Max (24). Memory bandwidth predicts ranking almost perfectly.
The 14"-form-factor M4 Max throttles more aggressively than the 16" under sustained load — expect ~10% lower tok/s on long sessions due to fan curve differences. For "leave Ollama running all day" workloads, the 16" or a Mac Studio is the right pick.
Use-case fit for the 14B class
Qwen 3 14B is the upper end of what the community calls "small models" — the size class where you can comfortably run multiple concurrent models on a single machine, but quality has scaled significantly beyond 7B/8B.
Where 14B shines:
- Production code in mainstream languages (Python, TypeScript, Go, Rust). Tool-call format support is strong.
- Multilingual workloads — Qwen 3's training corpus has heavy non-English representation, particularly Chinese, Japanese, and Spanish.
- Long-document summarization with structured output (JSON schemas, tables).
- Agent workflows with 3–6 tools where the model needs to dispatch correctly.
Where 14B still struggles:
- Math beyond high-school level without explicit chain-of-thought.
- Deep code reasoning (large refactors across files) — 32B is meaningfully better here.
- Reliable function-call generation with strict JSON schema constraints — needs careful sampler tuning (temperature 0.1–0.2).
If your day-to-day is a mix of code and chat with the occasional RAG, 14B is the right balance. Step up to Qwen 3 32B when you need it; step down to Llama 3.1 8B when speed matters more than depth.
Concurrent multi-model setup
One of the better M4 Max workflows is running Qwen 3 14B as the "smart" model alongside Llama 3.1 8B for fast tasks and nomic-embed-text for RAG. Total working set is around 18 GB, well inside even the 36 GB SKU:
The first request to each model after startup incurs a cold-load cost (~2–3 s); thereafter Ollama keeps both resident for OLLAMA_KEEP_ALIVE minutes (default 5). For an always-on assistant, OLLAMA_KEEP_ALIVE=24h keeps both loaded for the day.
Practical: prompt caching pays off here
Qwen 3's strength is long-context reasoning, which means most workflows will reuse a long system prompt across many tool-calls. llama.cpp's --prompt-cache saves the prefill KV cache to disk; subsequent calls with the same prefix skip 90% of prefill latency.
For an Ollama daemon, the built-in prefix caching (0.4.5+) does the same thing transparently — confirm with OLLAMA_DEBUG=1 ollama serve showing prefix cache hit in logs.
Sources
- Alibaba — Qwen 3 release blog (architecture, training data, benchmarks)
- Apple — M4 Max product specs (memory bandwidth, GPU cores)
- Ollama and the Ollama install script
- llama.cpp
- llama.cpp KV-cache quantization discussion
- vLLM
- Community benchmarks: 37 LLMs on MacBook Air M5, TurboQuant on Apple Silicon, and M5 Max 128GB Qwen tests
- General threads at r/LocalLLaMA
