The Apple M4 Pro runs Qwen 3 14B at q4_K_M in roughly 9 GB of unified memory and delivers 38–55 tokens/sec for single-user chat. Install Ollama, run ollama run qwen3:14b, and you have a thinking-mode-capable 14B model on your laptop in five minutes. This guide covers the exact commands, the VRAM math that makes it comfortable on a 24 GB chip, Qwen 3's specific reasoning toggle, and the real-world tok/s numbers you should expect as of 2026.
Why M4 Pro 24 GB is the sweet spot
Qwen 3 14B (Qwen on Hugging Face) is a mid-size model with two operating modes: a fast non-thinking mode that streams answers immediately, and a thinking mode that emits a <think>...</think> reasoning trace before the answer. At q4_K_M it weighs 8.4 GB on disk and about 9 GB resident at 8K context — comfortable on the base 24 GB M4 Pro and roomy enough to leave macOS, Safari, and a JetBrains IDE running.
The M4 Pro's 273 GB/s memory bandwidth is what makes 14B feel snappy. Decode is increasingly bandwidth-bound at this size, so the chip's bandwidth, not its core count, is the throughput ceiling. That means the 12-core M4 Pro Mac mini and 14-core MacBook Pro land within ~10% of each other.
Hardware and storage
| Component | Minimum | Recommended |
|---|---|---|
| Chip | M4 Pro (12-core CPU) | M4 Pro (14-core CPU, 20-core GPU) |
| Unified memory | 24 GB | 24 GB |
| Free disk | 10 GB | 30 GB |
| macOS | Sequoia 15.1 | Sequoia 15.4+ |
24 GB is enough headroom for 14B at 16K context plus normal desktop apps. See the M4 family announcement for the full SKU list.
Step 1 — Install Ollama
The official installer (linked from the Ollama homepage) places the binary in /Applications and registers a launch agent. Ollama auto-detects Metal and pins to performance cores when a model is loaded.
Step 2 — Pull and run Qwen 3 14B
The pull is 8.4 GB. First-token latency is 0.6–1.2 seconds; decode streams at 45–55 tok/s on a 14-core M4 Pro.
Step 3 — Decide on thinking mode
Qwen 3 ships with thinking mode on by default in the chat template — every answer is preceded by a <think>...</think> block. For most productivity work that's wasted tokens. Turn it off in the prompt:
/no_think is a Qwen 3 directive that suppresses the reasoning block for the rest of the turn. /think re-enables it. You can also pin the behaviour at the Modelfile level:
Then ollama create qwen3-14b-fast -f Modelfile.
Step 4 — Quantisation choices
| Quant | Disk size | Resident (8K ctx) | Decode tok/s on M4 Pro | Quality |
|---|---|---|---|---|
| q3_K_M | 6.5 GB | 7.0 GB | 50–62 | Good (90%) |
| q4_K_M | 8.4 GB | 9.0 GB | 38–55 | Sweet spot |
| q5_K_M | 9.8 GB | 10.5 GB | 32–42 | Excellent |
| q6_K | 11.3 GB | 12.1 GB | 27–34 | Excellent |
| q8_0 | 14.9 GB | 15.7 GB | 18–24 | Near FP16 |
q4_K_M is what most people should run. q5_K_M is worth the throughput hit if you're doing structured-output extraction where accuracy matters.
Step 5 — llama.cpp for power users
-fa (flash attention) gives ~4% decode improvement and halves the KV cache. The Apple-specific Metal tuning is documented in llama.cpp #4167.
Real-world benchmarks
14-core M4 Pro MacBook Pro, 24 GB unified memory, macOS 15.4, plugged in.
| Workload | Context | Decode tok/s | Prefill tok/s | KV cache |
|---|---|---|---|---|
| Single-turn chat | 2K | 52 | 880 | 0.6 GB |
| Code generation | 4K | 47 | 770 | 1.2 GB |
| RAG with 8 chunks | 8K | 41 | 620 | 2.4 GB |
| Long-form drafting | 16K | 33 | 460 | 4.8 GB |
| Doc summarisation | 32K | 22 | 290 | 9.5 GB (with -fa: 4.8 GB) |
At 32K context without flash attention, the KV cache alone is 9.5 GB on top of 8.4 GB of weights — that's 18 GB, leaving the 24 GB SKU only ~6 GB for macOS. Turn on flash attention or you'll start paging.
Where 14B shines
- Better reasoning than 8B without the latency hit of 32B
- Multilingual work — Qwen 3's training data has strong coverage of Chinese, Japanese, Korean
- Structured output / JSON generation with sub-second first-token latency
- Code review on small files — fits the whole file plus instructions in 4K context
- Function calling with reliable JSON adherence
The LocalLLaMA community has good build threads documenting tok/s numbers and prompt patterns for Qwen 3 14B if you want to cross-check your setup against other rigs.
Where 14B is wrong
- Hard math / multi-step reasoning — step up to Qwen 3 32B or DeepSeek-R1 32B
- Long-context summarisation >32K — flash attention helps but you're brushing the memory ceiling
- High-concurrency serving — single-user only on Apple silicon; use vLLM on Linux+GPU for concurrency
Common pitfalls
- Thinking mode chewing your tokens. If your tok/s seems fine but answers take 30 seconds, Qwen 3 is emitting a 1500-token
<think>block before the actual answer. Add/no_thinkor use the Modelfile pin shown above. - Wrong prompt template. Ollama applies it correctly; if you're hitting llama-server directly with
/completion, use/v1/chat/completionsinstead so Qwen's chat template is applied. num_predictof 128. Default truncates serious answers. Set to 1024.- 32K context without flash attention. Run out of memory; macOS pages; decode drops to ~5 tok/s. Either enable
-fain llama.cpp or use Ollama 0.3.13+ which enables it by default for Qwen 3. - Concurrent Final Cut export. Steals P-cores. Cap concurrency or pause.
Comparing M4 Pro to other 14B platforms
| Setup | Decode tok/s | Resident RAM | Watts under load |
|---|---|---|---|
| M4 Pro 24 GB, q4_K_M | 38–55 | 9 GB | 22 W |
| M4 Max 36 GB, q4_K_M | 55–70 | 9 GB | 32 W |
| RTX 4090 24 GB, q4_K_M | 95–115 | 10 GB | 220 W |
| RTX 5090 32 GB, q4_K_M | 140–170 | 10 GB | 280 W |
The M4 Pro is 30–40% of a 5090's throughput at 8% of the wall power — that ratio is the reason laptop-class LLM inference is suddenly viable.
Monitoring resident memory and tok/s
While tuning Qwen 3 14B you want a tight feedback loop:
For Qwen 3 you should additionally watch for thinking-mode leakage — answers should not contain <think> blocks when you used /no_think. If they do, your chat template is wrong. Re-pull the model: ollama rm qwen3:14b && ollama pull qwen3:14b to refresh the template.
Stats gives you a menu-bar GPU/CPU/memory HUD that updates every second — handy for verifying Ollama is on performance cores while you're benchmarking.
A thinking-mode trace, with and without
Same prompt, two different modes. Useful for understanding what /think actually changes about the output and the wall-clock cost.
Prompt: Given a list of 8 servers with mean response time 240 ms and stddev 55 ms, what's a reasonable alert threshold to catch a real degradation without paging on noise?
Without thinking mode (/no_think): the model returns a 90-word answer in 1.4 seconds — "Set the threshold at mean + 3 × stddev = 405 ms..." — clean and direct.
With thinking mode (default): the model first emits a <think> block 850 tokens long where it considers normal distribution assumptions, walks through the false-positive rate at 2σ vs 3σ, debates whether 8 servers is enough for normal-distribution math, and lands on 3σ for paging plus 2σ for a softer warn channel. The final answer is similar but better justified. Total wall-clock: 20 seconds at 45 tok/s decode.
For ops decisions like the example, thinking mode is worth the latency. For "summarise this paragraph" it's pure overhead. Most clients should default /no_think and reserve /think for explicit asks.
Quantisation cross-bench
Tok/s on a 14-core M4 Pro 24 GB MacBook Pro, macOS 15.4, plugged in. Each cell is the mean of 15 turns at the indicated context length, thinking mode off.
| Quant | 2K context | 8K context | 16K context (with -fa) |
|---|---|---|---|
| q3_K_M | 62 tok/s | 51 tok/s | 38 tok/s |
| q4_K_M | 52 tok/s | 41 tok/s | 30 tok/s |
| q5_K_M | 42 tok/s | 33 tok/s | 24 tok/s |
| q6_K | 34 tok/s | 27 tok/s | 20 tok/s |
| q8_0 | 22 tok/s | 18 tok/s | 13 tok/s |
Thinking mode adds 500–2000 tokens of latency before the first user-visible token; decode tok/s itself is unchanged. For an interactive UX with 14B on M4 Pro, run non-thinking mode and reserve /think for hard problems.
Sample Modelfile recipes
What to do next
If 14B fits comfortably and you want more reasoning, see How to run Qwen 3 32B on Apple M4 Pro for the step up — note that 32B is much tighter on 24 GB and benefits from the 48 GB M4 Pro variant. The same M4 Pro will also run Llama 3.1 8B at higher tok/s for chat-first workloads.
FAQs
What is the expected tokens-per-second performance for Qwen 3 14B on Apple M4 Pro?
Expect 38 to 55 tokens per second at q4_K_M quantization for single-user chat on a 14-core M4 Pro. The 12-core M4 Pro lands ~10% lower in the same range. Thinking mode increases total response latency because the model emits 500–2000 tokens of reasoning before the answer; non-thinking mode (set with /no_think) keeps the tok/s constant but cuts wall-clock latency.
What are the main differences between Ollama and llama.cpp for Qwen 3 14B?
Ollama wraps llama.cpp with model management, an OpenAI-compatible HTTP API, automatic Metal detection, and a Modelfile system to lock down parameters. llama.cpp gives you direct access to flash attention, KV-cache quantisation, custom samplers, and per-layer GPU offload controls. Use Ollama for production; drop to llama.cpp when you need to A/B-test settings or build a customised server.
How much memory does Qwen 3 14B require on Apple M4 Pro?
Weights are 8.4 GB at q4_K_M. The KV cache adds 0.6 GB per 4K of context — so 8K context lands at 9 GB total, 16K at 11 GB, and 32K at 18 GB without flash attention or 13 GB with flash attention enabled. The base 24 GB M4 Pro handles 16K context comfortably; 32K context benefits from llama.cpp's -fa flag.
What should I do if I encounter 'out of memory' errors while running Qwen 3 14B?
Reduce context length first — drop from 16K to 8K. If that still fails, switch quantisation from q4_K_M to q3_K_M to save ~2 GB. Enable flash attention if you're on llama.cpp. Quit Safari (a tab-heavy session can hold 4 GB+) and pause any Xcode or Final Cut builds. As a last resort, set OLLAMA_KEEP_ALIVE=0 so models unload immediately after each request.
Why is the first token slower than subsequent tokens?
That's prefill latency. The model has to process the entire prompt through every layer before it can generate the first output token, while subsequent tokens only update the KV cache by one position. Long system prompts and long retrieved-context chunks push first-token latency up. Truncate, summarise, or cache the prefix. Llama.cpp's --keep flag preserves a prefix across requests, which is useful for stable system prompts.
