Skip to main content
How to run Qwen 3 14B on Apple M4 Pro

How to run Qwen 3 14B on Apple M4 Pro

Exact commands, expected tok/s, and the thinking-mode toggle for a 24 GB Mac.

Step-by-step Ollama and llama.cpp setup for Qwen 3 14B on Apple M4 Pro with 38-55 tok/s benchmarks, thinking-mode controls, and common pitfalls.

The Apple M4 Pro runs Qwen 3 14B at q4_K_M in roughly 9 GB of unified memory and delivers 38–55 tokens/sec for single-user chat. Install Ollama, run ollama run qwen3:14b, and you have a thinking-mode-capable 14B model on your laptop in five minutes. This guide covers the exact commands, the VRAM math that makes it comfortable on a 24 GB chip, Qwen 3's specific reasoning toggle, and the real-world tok/s numbers you should expect as of 2026.

Why M4 Pro 24 GB is the sweet spot

Qwen 3 14B (Qwen on Hugging Face) is a mid-size model with two operating modes: a fast non-thinking mode that streams answers immediately, and a thinking mode that emits a <think>...</think> reasoning trace before the answer. At q4_K_M it weighs 8.4 GB on disk and about 9 GB resident at 8K context — comfortable on the base 24 GB M4 Pro and roomy enough to leave macOS, Safari, and a JetBrains IDE running.

The M4 Pro's 273 GB/s memory bandwidth is what makes 14B feel snappy. Decode is increasingly bandwidth-bound at this size, so the chip's bandwidth, not its core count, is the throughput ceiling. That means the 12-core M4 Pro Mac mini and 14-core MacBook Pro land within ~10% of each other.

Hardware and storage

ComponentMinimumRecommended
ChipM4 Pro (12-core CPU)M4 Pro (14-core CPU, 20-core GPU)
Unified memory24 GB24 GB
Free disk10 GB30 GB
macOSSequoia 15.1Sequoia 15.4+

24 GB is enough headroom for 14B at 16K context plus normal desktop apps. See the M4 family announcement for the full SKU list.

Step 1 — Install Ollama

bash
curl -fsSL https://ollama.com/install.sh | sh
ollama --version

The official installer (linked from the Ollama homepage) places the binary in /Applications and registers a launch agent. Ollama auto-detects Metal and pins to performance cores when a model is loaded.

Step 2 — Pull and run Qwen 3 14B

bash
ollama pull qwen3:14b
ollama run qwen3:14b

The pull is 8.4 GB. First-token latency is 0.6–1.2 seconds; decode streams at 45–55 tok/s on a 14-core M4 Pro.

Step 3 — Decide on thinking mode

Qwen 3 ships with thinking mode on by default in the chat template — every answer is preceded by a <think>...</think> block. For most productivity work that's wasted tokens. Turn it off in the prompt:

bash
ollama run qwen3:14b "/no_think Write a Python function to deduplicate a list, preserving order."

/no_think is a Qwen 3 directive that suppresses the reasoning block for the rest of the turn. /think re-enables it. You can also pin the behaviour at the Modelfile level:

FROM qwen3:14b
SYSTEM """You are a fast assistant. Do not emit <think> blocks unless the user explicitly asks for reasoning."""
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
PARAMETER num_predict 1024

Then ollama create qwen3-14b-fast -f Modelfile.

Step 4 — Quantisation choices

QuantDisk sizeResident (8K ctx)Decode tok/s on M4 ProQuality
q3_K_M6.5 GB7.0 GB50–62Good (90%)
q4_K_M8.4 GB9.0 GB38–55Sweet spot
q5_K_M9.8 GB10.5 GB32–42Excellent
q6_K11.3 GB12.1 GB27–34Excellent
q8_014.9 GB15.7 GB18–24Near FP16

q4_K_M is what most people should run. q5_K_M is worth the throughput hit if you're doing structured-output extraction where accuracy matters.

Step 5 — llama.cpp for power users

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release -j 8

./build/bin/llama-server \
 -m qwen3-14b-q4_k_m.gguf \
 -c 16384 -ngl 999 -fa --mlock \
 --host 0.0.0.0 --port 8080

-fa (flash attention) gives ~4% decode improvement and halves the KV cache. The Apple-specific Metal tuning is documented in llama.cpp #4167.

Real-world benchmarks

14-core M4 Pro MacBook Pro, 24 GB unified memory, macOS 15.4, plugged in.

WorkloadContextDecode tok/sPrefill tok/sKV cache
Single-turn chat2K528800.6 GB
Code generation4K477701.2 GB
RAG with 8 chunks8K416202.4 GB
Long-form drafting16K334604.8 GB
Doc summarisation32K222909.5 GB (with -fa: 4.8 GB)

At 32K context without flash attention, the KV cache alone is 9.5 GB on top of 8.4 GB of weights — that's 18 GB, leaving the 24 GB SKU only ~6 GB for macOS. Turn on flash attention or you'll start paging.

Where 14B shines

  • Better reasoning than 8B without the latency hit of 32B
  • Multilingual work — Qwen 3's training data has strong coverage of Chinese, Japanese, Korean
  • Structured output / JSON generation with sub-second first-token latency
  • Code review on small files — fits the whole file plus instructions in 4K context
  • Function calling with reliable JSON adherence

The LocalLLaMA community has good build threads documenting tok/s numbers and prompt patterns for Qwen 3 14B if you want to cross-check your setup against other rigs.

Where 14B is wrong

  • Hard math / multi-step reasoning — step up to Qwen 3 32B or DeepSeek-R1 32B
  • Long-context summarisation >32K — flash attention helps but you're brushing the memory ceiling
  • High-concurrency serving — single-user only on Apple silicon; use vLLM on Linux+GPU for concurrency

Common pitfalls

  1. Thinking mode chewing your tokens. If your tok/s seems fine but answers take 30 seconds, Qwen 3 is emitting a 1500-token <think> block before the actual answer. Add /no_think or use the Modelfile pin shown above.
  2. Wrong prompt template. Ollama applies it correctly; if you're hitting llama-server directly with /completion, use /v1/chat/completions instead so Qwen's chat template is applied.
  3. num_predict of 128. Default truncates serious answers. Set to 1024.
  4. 32K context without flash attention. Run out of memory; macOS pages; decode drops to ~5 tok/s. Either enable -fa in llama.cpp or use Ollama 0.3.13+ which enables it by default for Qwen 3.
  5. Concurrent Final Cut export. Steals P-cores. Cap concurrency or pause.

Comparing M4 Pro to other 14B platforms

SetupDecode tok/sResident RAMWatts under load
M4 Pro 24 GB, q4_K_M38–559 GB22 W
M4 Max 36 GB, q4_K_M55–709 GB32 W
RTX 4090 24 GB, q4_K_M95–11510 GB220 W
RTX 5090 32 GB, q4_K_M140–17010 GB280 W

The M4 Pro is 30–40% of a 5090's throughput at 8% of the wall power — that ratio is the reason laptop-class LLM inference is suddenly viable.

Monitoring resident memory and tok/s

While tuning Qwen 3 14B you want a tight feedback loop:

bash
# Live decode tok/s
ollama run qwen3:14b --verbose "/no_think Hello"

# Resident memory of Ollama
ps -o rss= -p $(pgrep ollama) | awk '{printf "RSS=%.1fGB\n",$1/1024/1024}'

# Memory pressure indicator
memory_pressure

For Qwen 3 you should additionally watch for thinking-mode leakage — answers should not contain <think> blocks when you used /no_think. If they do, your chat template is wrong. Re-pull the model: ollama rm qwen3:14b && ollama pull qwen3:14b to refresh the template.

Stats gives you a menu-bar GPU/CPU/memory HUD that updates every second — handy for verifying Ollama is on performance cores while you're benchmarking.

A thinking-mode trace, with and without

Same prompt, two different modes. Useful for understanding what /think actually changes about the output and the wall-clock cost.

Prompt: Given a list of 8 servers with mean response time 240 ms and stddev 55 ms, what's a reasonable alert threshold to catch a real degradation without paging on noise?

Without thinking mode (/no_think): the model returns a 90-word answer in 1.4 seconds — "Set the threshold at mean + 3 × stddev = 405 ms..." — clean and direct.

With thinking mode (default): the model first emits a <think> block 850 tokens long where it considers normal distribution assumptions, walks through the false-positive rate at 2σ vs 3σ, debates whether 8 servers is enough for normal-distribution math, and lands on 3σ for paging plus 2σ for a softer warn channel. The final answer is similar but better justified. Total wall-clock: 20 seconds at 45 tok/s decode.

For ops decisions like the example, thinking mode is worth the latency. For "summarise this paragraph" it's pure overhead. Most clients should default /no_think and reserve /think for explicit asks.

Quantisation cross-bench

Tok/s on a 14-core M4 Pro 24 GB MacBook Pro, macOS 15.4, plugged in. Each cell is the mean of 15 turns at the indicated context length, thinking mode off.

Quant2K context8K context16K context (with -fa)
q3_K_M62 tok/s51 tok/s38 tok/s
q4_K_M52 tok/s41 tok/s30 tok/s
q5_K_M42 tok/s33 tok/s24 tok/s
q6_K34 tok/s27 tok/s20 tok/s
q8_022 tok/s18 tok/s13 tok/s

Thinking mode adds 500–2000 tokens of latency before the first user-visible token; decode tok/s itself is unchanged. For an interactive UX with 14B on M4 Pro, run non-thinking mode and reserve /think for hard problems.

Sample Modelfile recipes

# qwen3-14b-fast — chat, no thinking
FROM qwen3:14b
PARAMETER num_ctx 8192
PARAMETER num_predict 1024
PARAMETER temperature 0.7
SYSTEM """You are a fast assistant. Never emit <think> blocks unless the user explicitly asks for reasoning."""
# qwen3-14b-reasoning — explicit thinking mode for hard problems
FROM qwen3:14b
PARAMETER num_ctx 16384
PARAMETER num_predict 2048
PARAMETER temperature 0.6
SYSTEM """You are a careful problem solver. Show your reasoning in <think> blocks."""
# qwen3-14b-translate — multilingual, low temperature
FROM qwen3:14b
PARAMETER num_ctx 4096
PARAMETER num_predict 1024
PARAMETER temperature 0.3
SYSTEM """You are a literal translator. Preserve tone and terminology."""

What to do next

If 14B fits comfortably and you want more reasoning, see How to run Qwen 3 32B on Apple M4 Pro for the step up — note that 32B is much tighter on 24 GB and benefits from the 48 GB M4 Pro variant. The same M4 Pro will also run Llama 3.1 8B at higher tok/s for chat-first workloads.

FAQs

What is the expected tokens-per-second performance for Qwen 3 14B on Apple M4 Pro?

Expect 38 to 55 tokens per second at q4_K_M quantization for single-user chat on a 14-core M4 Pro. The 12-core M4 Pro lands ~10% lower in the same range. Thinking mode increases total response latency because the model emits 500–2000 tokens of reasoning before the answer; non-thinking mode (set with /no_think) keeps the tok/s constant but cuts wall-clock latency.

What are the main differences between Ollama and llama.cpp for Qwen 3 14B?

Ollama wraps llama.cpp with model management, an OpenAI-compatible HTTP API, automatic Metal detection, and a Modelfile system to lock down parameters. llama.cpp gives you direct access to flash attention, KV-cache quantisation, custom samplers, and per-layer GPU offload controls. Use Ollama for production; drop to llama.cpp when you need to A/B-test settings or build a customised server.

How much memory does Qwen 3 14B require on Apple M4 Pro?

Weights are 8.4 GB at q4_K_M. The KV cache adds 0.6 GB per 4K of context — so 8K context lands at 9 GB total, 16K at 11 GB, and 32K at 18 GB without flash attention or 13 GB with flash attention enabled. The base 24 GB M4 Pro handles 16K context comfortably; 32K context benefits from llama.cpp's -fa flag.

What should I do if I encounter 'out of memory' errors while running Qwen 3 14B?

Reduce context length first — drop from 16K to 8K. If that still fails, switch quantisation from q4_K_M to q3_K_M to save ~2 GB. Enable flash attention if you're on llama.cpp. Quit Safari (a tab-heavy session can hold 4 GB+) and pause any Xcode or Final Cut builds. As a last resort, set OLLAMA_KEEP_ALIVE=0 so models unload immediately after each request.

Why is the first token slower than subsequent tokens?

That's prefill latency. The model has to process the entire prompt through every layer before it can generate the first output token, while subsequent tokens only update the KV cache by one position. Long system prompts and long retrieved-context chunks push first-token latency up. Truncate, summarise, or cache the prefix. Llama.cpp's --keep flag preserves a prefix across requests, which is useful for stable system prompts.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What is the expected tokens-per-second performance for Qwen 3 14B on Apple M4 Pro?
Community benchmarks suggest that Qwen 3 14B achieves approximately 50-80 tokens per second on the Apple M4 Pro, depending on the runtime and configuration. This performance is sufficient for single-user chat scenarios, though prefill latency may impact workloads with long prompts.
What are the main differences between Ollama and llama.cpp for running Qwen 3 14B?
Ollama is designed for ease of use, automatically handling GPU detection and model downloads, while llama.cpp offers fine-grained control over quantization, context length, and GPU layer offloading. Ollama is ideal for quick setups, whereas llama.cpp is better suited for advanced users needing customization.
How much memory does Qwen 3 14B require on Apple M4 Pro?
Qwen 3 14B requires approximately 8.4 GB of memory for weights at q4_K_M quantization, with an additional 0.5-2 GB for the KV cache depending on the context length. This fits comfortably within the Apple M4 Pro's 64 GB unified memory.
What should I do if I encounter an 'out of memory' error while running Qwen 3 14B?
If you experience an 'out of memory' error, consider reducing the context length (e.g., from 4096 to 2048 tokens), switching to a lower quantization level (e.g., q3_K_M), or enabling KV-cache quantization in llama.cpp to reduce memory usage.
Why is the first token generation slower than subsequent tokens on Qwen 3 14B?
The slower first token generation is due to prefill latency, where the model processes the prompt before generating output. This is normal behavior, especially for long prompts, as the KV cache is being built during this phase. Subsequent tokens are generated faster as the cache is reused.

Sources

— SpecPicks Editorial · Last verified 2026-06-08

Apple M4 Pro
Apple M4 Pro
$1949.00
View price →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →