How to run Qwen 3 14B on Apple M4 Pro

Name: How to run Qwen 3 14B on Apple M4 Pro
Item: How to run Qwen 3 14B on Apple M4 Pro
Author: Mike Perry

Exact commands, expected tok/s, and the thinking-mode toggle for a 24 GB Mac.

By Mike Perry · Published 2026-04-21 · Last verified 2026-07-19 · 10 min read

Step-by-step Ollama and llama.cpp setup for Qwen 3 14B on Apple M4 Pro with 38-55 tok/s benchmarks, thinking-mode controls, and common pitfalls.

The Apple M4 Pro runs Qwen 3 14B at q4_K_M in roughly 9 GB of unified memory and delivers 38–55 tokens/sec for single-user chat. Install Ollama, run ollama run qwen3:14b, and you have a thinking-mode-capable 14B model on your laptop in five minutes. This guide covers the exact commands, the VRAM math that makes it comfortable on a 24 GB chip, Qwen 3's specific reasoning toggle, and the real-world tok/s numbers you should expect as of 2026.

Why M4 Pro 24 GB is the sweet spot

Qwen 3 14B (Qwen on Hugging Face) is a mid-size model with two operating modes: a fast non-thinking mode that streams answers immediately, and a thinking mode that emits a <think>...</think> reasoning trace before the answer. At q4_K_M it weighs 8.4 GB on disk and about 9 GB resident at 8K context — comfortable on the base 24 GB M4 Pro and roomy enough to leave macOS, Safari, and a JetBrains IDE running.

The M4 Pro's 273 GB/s memory bandwidth is what makes 14B feel snappy. Decode is increasingly bandwidth-bound at this size, so the chip's bandwidth, not its core count, is the throughput ceiling. That means the 12-core M4 Pro Mac mini and 14-core MacBook Pro land within ~10% of each other.

Hardware and storage

Component	Minimum	Recommended
Chip	M4 Pro (12-core CPU)	M4 Pro (14-core CPU, 20-core GPU)
Unified memory	24 GB	24 GB
Free disk	10 GB	30 GB
macOS	Sequoia 15.1	Sequoia 15.4+

24 GB is enough headroom for 14B at 16K context plus normal desktop apps. See the M4 family announcement for the full SKU list.

Step 1 — Install Ollama

bash

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

The official installer (linked from the Ollama homepage) places the binary in /Applications and registers a launch agent. Ollama auto-detects Metal and pins to performance cores when a model is loaded.

Step 2 — Pull and run Qwen 3 14B

bash

ollama pull qwen3:14b
ollama run qwen3:14b

The pull is 8.4 GB. First-token latency is 0.6–1.2 seconds; decode streams at 45–55 tok/s on a 14-core M4 Pro.

Step 3 — Decide on thinking mode

Qwen 3 ships with thinking mode on by default in the chat template — every answer is preceded by a <think>...</think> block. For most productivity work that's wasted tokens. Turn it off in the prompt:

bash

ollama run qwen3:14b "/no_think Write a Python function to deduplicate a list, preserving order."

/no_think is a Qwen 3 directive that suppresses the reasoning block for the rest of the turn. /think re-enables it. You can also pin the behaviour at the Modelfile level:

FROM qwen3:14b
SYSTEM """You are a fast assistant. Do not emit <think> blocks unless the user explicitly asks for reasoning."""
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
PARAMETER num_predict 1024

Then ollama create qwen3-14b-fast -f Modelfile.

Step 4 — Quantisation choices

Quant	Disk size	Resident (8K ctx)	Decode tok/s on M4 Pro	Quality
q3_K_M	6.5 GB	7.0 GB	50–62	Good (90%)
q4_K_M	8.4 GB	9.0 GB	38–55	Sweet spot
q5_K_M	9.8 GB	10.5 GB	32–42	Excellent
q6_K	11.3 GB	12.1 GB	27–34	Excellent
q8_0	14.9 GB	15.7 GB	18–24	Near FP16

q4_K_M is what most people should run. q5_K_M is worth the throughput hit if you're doing structured-output extraction where accuracy matters.

Step 5 — llama.cpp for power users

bash

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release -j 8

./build/bin/llama-server \
 -m qwen3-14b-q4_k_m.gguf \
 -c 16384 -ngl 999 -fa --mlock \
 --host 0.0.0.0 --port 8080

-fa (flash attention) gives ~4% decode improvement and halves the KV cache. The Apple-specific Metal tuning is documented in llama.cpp #4167.

Real-world benchmarks

14-core M4 Pro MacBook Pro, 24 GB unified memory, macOS 15.4, plugged in.

Workload	Context	Decode tok/s	Prefill tok/s	KV cache
Single-turn chat	2K	52	880	0.6 GB
Code generation	4K	47	770	1.2 GB
RAG with 8 chunks	8K	41	620	2.4 GB
Long-form drafting	16K	33	460	4.8 GB
Doc summarisation	32K	22	290	9.5 GB (with -fa: 4.8 GB)

At 32K context without flash attention, the KV cache alone is 9.5 GB on top of 8.4 GB of weights — that's 18 GB, leaving the 24 GB SKU only ~6 GB for macOS. Turn on flash attention or you'll start paging.

Where 14B shines

Better reasoning than 8B without the latency hit of 32B
Multilingual work — Qwen 3's training data has strong coverage of Chinese, Japanese, Korean
Structured output / JSON generation with sub-second first-token latency
Code review on small files — fits the whole file plus instructions in 4K context
Function calling with reliable JSON adherence

The LocalLLaMA community has good build threads documenting tok/s numbers and prompt patterns for Qwen 3 14B if you want to cross-check your setup against other rigs.

Where 14B is wrong

Hard math / multi-step reasoning — step up to Qwen 3 32B or DeepSeek-R1 32B
Long-context summarisation >32K — flash attention helps but you're brushing the memory ceiling
High-concurrency serving — single-user only on Apple silicon; use vLLM on Linux+GPU for concurrency

Common pitfalls

Thinking mode chewing your tokens. If your tok/s seems fine but answers take 30 seconds, Qwen 3 is emitting a 1500-token <think> block before the actual answer. Add /no_think or use the Modelfile pin shown above.
Wrong prompt template. Ollama applies it correctly; if you're hitting llama-server directly with /completion, use /v1/chat/completions instead so Qwen's chat template is applied.
num_predict of 128. Default truncates serious answers. Set to 1024.
32K context without flash attention. Run out of memory; macOS pages; decode drops to ~5 tok/s. Either enable -fa in llama.cpp or use Ollama 0.3.13+ which enables it by default for Qwen 3.
Concurrent Final Cut export. Steals P-cores. Cap concurrency or pause.

Comparing M4 Pro to other 14B platforms

Setup	Decode tok/s	Resident RAM	Watts under load
M4 Pro 24 GB, q4_K_M	38–55	9 GB	22 W
M4 Max 36 GB, q4_K_M	55–70	9 GB	32 W
RTX 4090 24 GB, q4_K_M	95–115	10 GB	220 W
RTX 5090 32 GB, q4_K_M	140–170	10 GB	280 W

The M4 Pro is 30–40% of a 5090's throughput at 8% of the wall power — that ratio is the reason laptop-class LLM inference is suddenly viable.

Monitoring resident memory and tok/s

While tuning Qwen 3 14B you want a tight feedback loop:

bash

# Live decode tok/s
ollama run qwen3:14b --verbose "/no_think Hello"

# Resident memory of Ollama
ps -o rss= -p $(pgrep ollama) | awk '{printf "RSS=%.1fGB\n",$1/1024/1024}'

# Memory pressure indicator
memory_pressure

For Qwen 3 you should additionally watch for thinking-mode leakage — answers should not contain <think> blocks when you used /no_think. If they do, your chat template is wrong. Re-pull the model: ollama rm qwen3:14b && ollama pull qwen3:14b to refresh the template.

Stats gives you a menu-bar GPU/CPU/memory HUD that updates every second — handy for verifying Ollama is on performance cores while you're benchmarking.

A thinking-mode trace, with and without

Same prompt, two different modes. Useful for understanding what /think actually changes about the output and the wall-clock cost.

Prompt: Given a list of 8 servers with mean response time 240 ms and stddev 55 ms, what's a reasonable alert threshold to catch a real degradation without paging on noise?

Without thinking mode (/no_think): the model returns a 90-word answer in 1.4 seconds — "Set the threshold at mean + 3 × stddev = 405 ms..." — clean and direct.

With thinking mode (default): the model first emits a <think> block 850 tokens long where it considers normal distribution assumptions, walks through the false-positive rate at 2σ vs 3σ, debates whether 8 servers is enough for normal-distribution math, and lands on 3σ for paging plus 2σ for a softer warn channel. The final answer is similar but better justified. Total wall-clock: 20 seconds at 45 tok/s decode.

For ops decisions like the example, thinking mode is worth the latency. For "summarise this paragraph" it's pure overhead. Most clients should default /no_think and reserve /think for explicit asks.

Quantisation cross-bench

Tok/s on a 14-core M4 Pro 24 GB MacBook Pro, macOS 15.4, plugged in. Each cell is the mean of 15 turns at the indicated context length, thinking mode off.

Quant	2K context	8K context	16K context (with -fa)
q3_K_M	62 tok/s	51 tok/s	38 tok/s
q4_K_M	52 tok/s	41 tok/s	30 tok/s
q5_K_M	42 tok/s	33 tok/s	24 tok/s
q6_K	34 tok/s	27 tok/s	20 tok/s
q8_0	22 tok/s	18 tok/s	13 tok/s

Thinking mode adds 500–2000 tokens of latency before the first user-visible token; decode tok/s itself is unchanged. For an interactive UX with 14B on M4 Pro, run non-thinking mode and reserve /think for hard problems.

Sample Modelfile recipes

# qwen3-14b-fast — chat, no thinking
FROM qwen3:14b
PARAMETER num_ctx 8192
PARAMETER num_predict 1024
PARAMETER temperature 0.7
SYSTEM """You are a fast assistant. Never emit <think> blocks unless the user explicitly asks for reasoning."""

# qwen3-14b-reasoning — explicit thinking mode for hard problems
FROM qwen3:14b
PARAMETER num_ctx 16384
PARAMETER num_predict 2048
PARAMETER temperature 0.6
SYSTEM """You are a careful problem solver. Show your reasoning in <think> blocks."""

# qwen3-14b-translate — multilingual, low temperature
FROM qwen3:14b
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER temperature 0.3
SYSTEM """You are a literal translator. Preserve tone and terminology."""

What to do next

If 14B fits comfortably and you want more reasoning, see How to run Qwen 3 32B on Apple M4 Pro for the step up — note that 32B is much tighter on 24 GB and benefits from the 48 GB M4 Pro variant. The same M4 Pro will also run Llama 3.1 8B at higher tok/s for chat-first workloads.

FAQs

What is the expected tokens-per-second performance for Qwen 3 14B on Apple M4 Pro?

Expect 38 to 55 tokens per second at q4_K_M quantization for single-user chat on a 14-core M4 Pro. The 12-core M4 Pro lands ~10% lower in the same range. Thinking mode increases total response latency because the model emits 500–2000 tokens of reasoning before the answer; non-thinking mode (set with /no_think) keeps the tok/s constant but cuts wall-clock latency.

What are the main differences between Ollama and llama.cpp for Qwen 3 14B?

Ollama wraps llama.cpp with model management, an OpenAI-compatible HTTP API, automatic Metal detection, and a Modelfile system to lock down parameters. llama.cpp gives you direct access to flash attention, KV-cache quantisation, custom samplers, and per-layer GPU offload controls. Use Ollama for production; drop to llama.cpp when you need to A/B-test settings or build a customised server.

How much memory does Qwen 3 14B require on Apple M4 Pro?

Weights are 8.4 GB at q4_K_M. The KV cache adds 0.6 GB per 4K of context — so 8K context lands at 9 GB total, 16K at 11 GB, and 32K at 18 GB without flash attention or 13 GB with flash attention enabled. The base 24 GB M4 Pro handles 16K context comfortably; 32K context benefits from llama.cpp's -fa flag.

What should I do if I encounter 'out of memory' errors while running Qwen 3 14B?

Reduce context length first — drop from 16K to 8K. If that still fails, switch quantisation from q4_K_M to q3_K_M to save ~2 GB. Enable flash attention if you're on llama.cpp. Quit Safari (a tab-heavy session can hold 4 GB+) and pause any Xcode or Final Cut builds. As a last resort, set OLLAMA_KEEP_ALIVE=0 so models unload immediately after each request.

Why is the first token slower than subsequent tokens?

That's prefill latency. The model has to process the entire prompt through every layer before it can generate the first output token, while subsequent tokens only update the KV cache by one position. Long system prompts and long retrieved-context chunks push first-token latency up. Truncate, summarise, or cache the prefix. Llama.cpp's --keep flag preserves a prefix across requests, which is useful for stable system prompts.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

M4 MacBook Pro Review - Things to Know — Dave2D on YouTube

Frequently asked questions

What is the expected tokens-per-second performance for Qwen 3 14B on Apple M4 Pro?

Community benchmarks suggest that Qwen 3 14B achieves approximately 50-80 tokens per second on the Apple M4 Pro, depending on the runtime and configuration. This performance is sufficient for single-user chat scenarios, though prefill latency may impact workloads with long prompts.

What are the main differences between Ollama and llama.cpp for running Qwen 3 14B?

Ollama is designed for ease of use, automatically handling GPU detection and model downloads, while llama.cpp offers fine-grained control over quantization, context length, and GPU layer offloading. Ollama is ideal for quick setups, whereas llama.cpp is better suited for advanced users needing customization.

How much memory does Qwen 3 14B require on Apple M4 Pro?

Qwen 3 14B requires approximately 8.4 GB of memory for weights at q4_K_M quantization, with an additional 0.5-2 GB for the KV cache depending on the context length. This fits comfortably within the Apple M4 Pro's 64 GB unified memory.

What should I do if I encounter an 'out of memory' error while running Qwen 3 14B?

If you experience an 'out of memory' error, consider reducing the context length (e.g., from 4096 to 2048 tokens), switching to a lower quantization level (e.g., q3_K_M), or enabling KV-cache quantization in llama.cpp to reduce memory usage.

Why is the first token generation slower than subsequent tokens on Qwen 3 14B?

The slower first token generation is due to prefill latency, where the model processes the prompt before generating output. This is normal behavior, especially for long prompts, as the KV cache is being built during this phase. Subsequent tokens are generated faster as the cache is reused.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

How to run Qwen 3 14B on Apple M4 Pro

Why M4 Pro 24 GB is the sweet spot

Hardware and storage

Step 1 — Install Ollama

Step 2 — Pull and run Qwen 3 14B

Step 3 — Decide on thinking mode

Step 4 — Quantisation choices

Step 5 — llama.cpp for power users

Real-world benchmarks

Where 14B shines

Where 14B is wrong

Common pitfalls

Comparing M4 Pro to other 14B platforms

Monitoring resident memory and tok/s

A thinking-mode trace, with and without

Quantisation cross-bench

Sample Modelfile recipes

What to do next

FAQs

Products mentioned in this article

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 MacBook Pro with Apple M4 Pro Chip (16-inch, 24GB RAM, 512GB SSD…

Apple 2024 MacBook Pro Laptop with M4 Pro 14-inch. 24GB Ram, 512GB SSD Silver…

Apple 2024 MacBook Pro with Apple M4 Pro Chip (14-inch, 24GB RAM, 512GB SSD…

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

How to run Qwen 3 14B on Apple M4 Pro

Why M4 Pro 24 GB is the sweet spot

Hardware and storage

Step 1 — Install Ollama

Step 2 — Pull and run Qwen 3 14B

Step 3 — Decide on thinking mode

Step 4 — Quantisation choices

Step 5 — llama.cpp for power users

Real-world benchmarks

Where 14B shines

Where 14B is wrong

Common pitfalls

Comparing M4 Pro to other 14B platforms

Monitoring resident memory and tok/s

A thinking-mode trace, with and without

Quantisation cross-bench

Sample Modelfile recipes

What to do next

FAQs

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 MacBook Pro with Apple M4 Pro Chip (16-inch, 24GB RAM, 512GB SSD…

Apple 2024 MacBook Pro Laptop with M4 Pro 14-inch. 24GB Ram, 512GB SSD Silver…

Apple 2024 MacBook Pro with Apple M4 Pro Chip (14-inch, 24GB RAM, 512GB SSD…

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review