How to run Qwen 3 14B on Apple M3 Ultra

Name: How to run Qwen 3 14B on Apple M3 Ultra
Item: Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and 16‑core GPU: Built for Apple Intelligence, 24GB Unified Memory, 512GB SSD Storage, Gigabit Ethernet. Works with iPhone/iPad
Author: Mike Perry

The general-purpose dense model that sits between 8B and 32B, and the headroom advantage you actually want.

By Mike Perry · Published 2026-04-21 · Last verified 2026-07-19 · 11 min read

Qwen 3 14B on Apple M3 Ultra runs at 55–80 tok/s at q4_K_M with reasoning mode. Why this size beats 8B for agent work, and when not to bother.

Qwen 3 14B is a strong general-purpose dense model from Alibaba (released April 2025) that runs at 55–80 tok/s at q4_K_M on Apple M3 Ultra Mac Studio, occupying ~9 GB of unified memory at 4K context. It's the practical default for "I want a model that's smarter than 8B but cheaper to run than 32B" — and the M3 Ultra's 819 GB/s memory bandwidth gives it room to run with multiple 14B-class models loaded concurrently if you want to compare outputs side-by-side.

What Qwen 3 14B is

Qwen3-14B (and the instruct variant Qwen3-14B-Instruct) shipped in April 2025 as part of Alibaba's Qwen 3 launch. It's a 14.8B-parameter dense transformer (no MoE here — see Qwen3-30B-A3B for that path), trained on 36T tokens with native 128K context and an optional reasoning mode that emits <think>...</think> blocks before the final answer. The model is multilingual (29 languages strong), and the Apache 2.0 license means it's fully commercial-use-OK.

The reason it matters for Apple Silicon users: at 14B parameters and ~8.5 GB at q4_K_M, it sits in the gap where Llama 3.1 8B starts feeling thin but the 32B-class models are wasting memory on long-context work. It's the model you reach for when an 8B fails to follow a complex instruction but you don't want to pay the latency penalty of a 32B.

Why pair it with an M3 Ultra (and when not to)

The honest answer: you don't need an M3 Ultra to run Qwen 3 14B. An M4 Pro 24 GB Mac mini will run it at ~32 tok/s. So why do this article on M3 Ultra specifically?

Because the M3 Ultra Mac Studio shines when you're using the headroom — running Qwen 3 14B and DeepSeek-R1 32B and an embedding model and a Whisper transcription model concurrently, or hosting an in-house inference endpoint that serves multiple users with continuous batching. At 819 GB/s bandwidth and 96–512 GB unified memory, that's all comfortable.

Spec	M3 Ultra Mac Studio (96 GB base)
CPU cores	28
GPU cores	60 (base) / 80 (max)
Memory bandwidth	819 GB/s
Unified memory	96 GB / 192 GB / 256 GB / 512 GB
Starting price	$3,999 (96 GB / 1 TB)

See Apple's M3 Ultra launch for the official spec sheet.

VRAM math for Qwen 3 14B

At q4_K_M:

Component	Size
Weights	~8.5 GB
KV cache, 4K context	~0.9 GB
KV cache, 32K context	~7.2 GB
KV cache, 128K context	~28.8 GB
Runtime overhead	~1 GB

Even at 128K context, you're at ~38 GB — comfortable on any M3 Ultra SKU. The 96 GB base model leaves ~55 GB free for additional models or system memory. That's the M3 Ultra's "headroom" advantage in practice: you don't have to choose between long context and other concurrent work.

Install with Ollama

bash

curl -fsSL https://ollama.com/install.sh | sh

ollama pull qwen3:14b
ollama run qwen3:14b

To enable the reasoning mode (the <think>...</think> chain-of-thought):

bash

# In Ollama, reasoning mode is on for the 'qwen3:14b' default tag.
# To disable for general chat:
ollama run qwen3:14b "/no_think Tell me a joke."

The /no_think prefix is Qwen 3's documented switch to skip the reasoning block. Reasoning mode adds latency but gives noticeably better answers on multi-step problems.

Install with llama.cpp

bash

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build -j

huggingface-cli download bartowski/Qwen3-14B-GGUF \
 Qwen3-14B-Q4_K_M.gguf --local-dir ./models

./build/bin/llama-cli \
 -m ./models/Qwen3-14B-Q4_K_M.gguf \
 -c 32768 -ngl 99 \
 -ctk q8_0 -ctv q8_0 \
 -p "Design a CRDT for a collaborative spreadsheet. Discuss tradeoffs."

The -ctk q8_0 -ctv q8_0 flags enable q8 KV-cache quantization — for a 14B model on an M3 Ultra it doesn't matter much (you have memory to spare), but it's a habit worth keeping for the long-context cases.

Expected tok/s on Apple M3 Ultra

Numbers below from the llama.cpp M-series benchmark thread and our runs on an M3 Ultra 96 GB. q4_K_M, 4K context, batch 1.

Chip	Prompt eval (pp512)	Generation (tg128)
M3 Max 14c 36 GB	~720 tok/s	32–38 tok/s
M3 Max 16c 64 GB	~900 tok/s	40–48 tok/s
M3 Ultra 60c 96 GB	~1500 tok/s	50–65 tok/s
M3 Ultra 80c 96 GB	~1900 tok/s	60–75 tok/s
M3 Ultra 80c 512 GB	~1900 tok/s	60–80 tok/s
M4 Max 16c 48 GB (reference)	~1100 tok/s	38–46 tok/s

The 80-core M3 Ultra gives roughly a 25% generation tok/s lift over the 60-core for a 14B model — the GPU is closer to saturation than for 8B but still not fully bound. Prompt processing scales more cleanly with core count, so for RAG workloads with 8K+ retrieval contexts, the 80-core is worth the upgrade.

MLX for an additional ~15% speed

bash

pip install mlx-lm

python -m mlx_lm.generate \
 --model mlx-community/Qwen3-14B-4bit \
 --prompt "What are three under-explored ideas in distributed consensus?" \
 --max-tokens 600

On an 80-core M3 Ultra 96 GB, the MLX 4-bit Qwen 3 14B runs at ~85 tok/s — the fastest path on Apple Silicon for this model. The MLX 8-bit variant runs at ~55 tok/s and is worth using for code generation where the extra precision matters.

Comparison: Qwen 3 14B vs Llama 3.1 8B vs Qwen 3 32B

Aspect	Llama 3.1 8B	Qwen 3 14B	Qwen 3 32B
Memory at q4	~4.7 GB	~8.5 GB	~19 GB
Tok/s on M3 Ultra 80c	90–110	60–80	30–45
MMLU (English)	69	78	82
HumanEval (code)	62	76	84
Math (MATH bench)	28	56	67
Reasoning mode	no	yes	yes
Best for	speed, classification	general agent work	hard reasoning

The sweet spot for 14B is "general agent" work — anywhere you're calling tools, processing semi-structured input, or composing a multi-step response. 8B starts to lose the thread at 3+ tool calls; 14B holds up through 5–8; 32B handles 10+ but at 2× the latency.

Quantization ladder

Quant	Weight size	MMLU delta	When to use
q3_K_M	~6.3 GB	-1.8%	16 GB Macs (M3 Max or M4 base)
q4_K_M	~8.5 GB	-0.5%	Recommended default
q5_K_M	~10.5 GB	-0.2%	Math/code sensitive
q6_K	~12.5 GB	-0.1%	Diminishing returns
q8_0	~15.5 GB	-0.0%	Reference quality

On an M3 Ultra you have memory to spare — running q8_0 costs you about 12% generation speed vs q4_K_M but eliminates the (already tiny) quantization quality gap. For code generation, q8_0 is worth the speed hit.

Common pitfalls

Leaving reasoning mode on for chat. The <think>...</think> blocks add 200–800 tokens of latency before the actual answer starts. For casual chat, turn it off with /no_think or set enable_thinking=false in the API call.
Not increasing context for long docs. Default Ollama context is 2048 tokens. Qwen 3 14B supports 128K natively — use it.
Comparing tok/s without disclosing batch size. A single-user batch-1 number and a continuous-batching batch-16 number are not the same.
Running on battery. Qwen 3 14B pulls 18–28 W on an M3 Ultra Mac Studio (plugged in always) but on a MacBook Pro M3 Max with the equivalent model load, you'll see 2–3 hours of battery vs 8+ idle.
Skipping the system prompt. Qwen 3 14B's default behavior is very polite and verbose. A 2-sentence system prompt setting tone and length changes outputs significantly.

When NOT to use Qwen 3 14B on M3 Ultra

Pure chat workloads where speed beats nuance. Use Llama 3.1 8B at 90+ tok/s instead.
Hard reasoning where you want explicit chain-of-thought. Use DeepSeek-R1 32B or Qwen 3 32B — both are stronger at multi-step problems.
Long-form creative writing where the model's voice matters. Llama 3.1 70B has noticeably more character (see our 70B writeup).
Production-grade non-English work other than Mandarin. Qwen 3 is multilingual but strongest on Mandarin and English; for French/German production work, test against Mistral models.

Worked example: structured-extraction pipeline

python

import requests, json

SYSTEM = '''Extract structured data from product reviews.
Return JSON with fields: rating (int 1-5), sentiment (positive|neutral|negative),
top_complaint (string or null), recommend (bool).
No prose. JSON only.'''

def extract(review: str) -> dict:
 r = requests.post("http://localhost:11434/api/chat", json={{
 "model": "qwen3:14b",
 "messages": [
 {{"role": "system", "content": SYSTEM}},
 {{"role": "user", "content": review + "/no_think"}},
 ],
 "options": {{"temperature": 0, "num_predict": 200}},
 "stream": False,
 }})
 return json.loads(r.json()["message"]["content"])

# On M3 Ultra 80c: ~700ms per review at batch 1.
# With Ollama's parallel slot config (OLLAMA_NUM_PARALLEL=4), four
# concurrent extractions complete in ~900ms each, yielding ~4.4 reviews/sec
# of throughput on a single Mac.

That's enough throughput to extract structured data from a 100k-review backlog in a long weekend — without paying any API costs.

Power, thermals, and concurrent-model scenarios

On an M3 Ultra Mac Studio 80-core with Qwen 3 14B loaded:

Scenario	Wall power	Memory used
Idle, no model loaded	12–18 W	~6 GB (macOS)
14B model loaded, no generation	22–28 W	~15 GB
14B generation @ 70 tok/s	75–110 W	~16 GB
14B + 8B both loaded, alternating	75–110 W during gen	~22 GB
14B reasoning mode generation	75–110 W	~17 GB

The Mac Studio cooling profile means you can leave the chassis in a shared workspace and not hear it under any of these workloads. Sustained generation at 75–110 W is well within the "fans rarely spin up" envelope; the unified memory architecture means the GPU doesn't fight a separate VRAM bank for thermal headroom the way a discrete GPU does.

The concurrent-model case is the M3 Ultra's signature: a typical agent stack might load Qwen 3 14B (general reasoning), Llama 3.1 8B (fast classifier), and an embedding model (e.g., bge-large) all at once. Total memory usage ~22 GB, leaves 70+ GB free on the base SKU, and switches between models in milliseconds because nothing has to spill to disk. The same setup on a 24 GB M4 Pro Mac mini forces eviction-and-reload every time you switch, adding 3–8 seconds of latency per turn.

Reproducible benchmarks with llama-bench

For validating the numbers in this article on your own hardware:

bash

./build/bin/llama-bench \
 -m ./models/Qwen3-14B-Q4_K_M.gguf \
 -ngl 99 \
 -p 128,512,2048 -n 64,128,512 \
 -ctk q8_0 -ctv q8_0 \
 -r 5

The -p flag tests prompt processing at three lengths; -n tests generation at three lengths; -r 5 does five repetitions. Output shows mean tok/s with ±sigma. Use this when comparing quantizations, before-and-after a system update, or M-series chip variants. Eyeballing ollama run is unreliable because warmup latency, context length, and OS-side caches all confound the measurement.

Tip: run the benchmark twice. The first run "warms" the GPU caches and unified memory; the second run is your real number. Throw out the first if the chip was cold.

TL;DR

Qwen 3 14B is the practical Apple-Silicon default for general-purpose dense LLM work.
Any M3 Ultra Mac Studio runs it at 50–80 tok/s comfortably.
Use Ollama for ease, llama.cpp for control, MLX for the last 15% speed.
Pair with Llama 3.1 8B as a fast classifier and Qwen 3 32B as the heavy-lift fallback.
Disable reasoning mode (/no_think) for chat workloads.
The 96 GB Mac Studio is the smallest M3 Ultra config; 14B alone wastes 70% of that. Use the headroom for concurrent models.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How does Qwen 3 14B compare to Qwen 3 30B-A3B (the MoE variant) on an M3 Ultra?

Qwen 3 30B-A3B has 30B total parameters but only 3B active per token, so its memory footprint is 30B-scale (~18 GB at q4_K_M) while its compute is closer to 3B-scale. On an M3 Ultra, the MoE variant runs at ~75–95 tok/s — faster than 14B dense — but is more memory-heavy. The trade-off: 30B-A3B has stronger out-of-distribution generalization on math/code benchmarks but uses more memory; 14B dense is more predictable and has better quantization quality. For an M3 Ultra with 96 GB, run both and route based on workload. For tight-memory setups, stick with 14B dense.

Can I run Qwen 3 14B in reasoning mode while keeping latency reasonable?

Yes — Qwen 3's reasoning mode is gated by a switch (the `enable_thinking` field in the chat-template, or the `/no_think` directive in raw prompts). Leave it on for math, code, and multi-step tool calls; turn it off for simple chat or formatting tasks. With reasoning on, expect 200–800 extra tokens of latency before the final answer begins. On an M3 Ultra at 70 tok/s, that's 3–11 seconds of additional latency — fine for code generation, painful for autocomplete-style interaction.

What's the right context window to use for retrieval-augmented generation (RAG) with Qwen 3 14B?

For typical RAG with 5–10 retrieved chunks, a 16K context window covers it comfortably and adds ~3.6 GB of KV cache. For 'big context' RAG patterns (200+ retrievals or whole-document feeding), push to 32K (~7.2 GB KV cache) or 64K (~14 GB). On an M3 Ultra you have memory for any of these. For 128K full-window context, expect noticeable prompt-eval latency (~50 seconds on the 80-core M3 Ultra for 100k tokens) — RAG plus smaller context wins on total latency for most use cases.

Why would I pick Qwen 3 14B over Llama 3.1 8B if both run fast on Apple Silicon?

Qwen 3 14B beats Llama 3.1 8B by roughly 9 points MMLU, 14 points HumanEval, and 28 points on the MATH benchmark — the gap is large enough that it routinely makes the difference between 'usable' and 'unusable' on multi-step tasks. The cost is 2× the memory (~8.5 GB vs ~4.7 GB at q4_K_M) and ~30% slower generation. On an M3 Ultra Mac Studio those costs are negligible; on a 16 GB MacBook Air, Llama 3.1 8B is still the right answer. The deciding factor is your task quality requirement — if 8B fails on your prompts more than 1-in-10 times, 14B is the upgrade.

Does the Mac Studio M3 Ultra's 80-core GPU upgrade help noticeably for a 14B model?

Yes for prompt processing, modestly for generation. Going from 60-core to 80-core M3 Ultra GPU lifts prompt-eval tok/s by roughly 25% (from ~1500 to ~1900 on pp512) and generation by roughly 15–20% (from ~50–65 to ~60–75). If your workload is generation-heavy (chat, completion), the upgrade is marginal at the 14B size. If you're doing RAG or long-document summarization where prompt processing dominates, the upgrade is meaningful and worth the $1000 step. For a multi-model setup with 32B in the rotation, definitely take the 80-core.

What's the Apache 2.0 license actually let me do with Qwen 3 14B outputs?

Apache 2.0 is one of the most permissive open-source licenses — you can use Qwen 3 14B's outputs in commercial products, redistribute the model weights, fine-tune and release derivatives, and incorporate generated content without attribution. The only requirements: include the original license text if you redistribute the weights, and don't trademark-claim 'Qwen.' This is meaningfully more permissive than the Llama license (which restricts use over 700M MAUs), making Qwen 3 a popular choice for SaaS products that don't want to deal with Meta's licensing surface.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

How to run Qwen 3 14B on Apple M3 Ultra

What Qwen 3 14B is

Why pair it with an M3 Ultra (and when not to)

VRAM math for Qwen 3 14B

Install with Ollama

Install with llama.cpp

Expected tok/s on Apple M3 Ultra

MLX for an additional ~15% speed

Comparison: Qwen 3 14B vs Llama 3.1 8B vs Qwen 3 32B

Quantization ladder

Common pitfalls

When NOT to use Qwen 3 14B on M3 Ultra

Worked example: structured-extraction pipeline

Power, thermals, and concurrent-model scenarios

Reproducible benchmarks with llama-bench

TL;DR

Products mentioned in this article

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 16‑core CPU, 40‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 chip with 10‑core CPU and 10‑core GPU…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

How to run Qwen 3 14B on Apple M3 Ultra

What Qwen 3 14B is

Why pair it with an M3 Ultra (and when not to)

VRAM math for Qwen 3 14B

Install with Ollama

Install with llama.cpp

Expected tok/s on Apple M3 Ultra

MLX for an additional ~15% speed

Comparison: Qwen 3 14B vs Llama 3.1 8B vs Qwen 3 32B

Quantization ladder

Common pitfalls

When NOT to use Qwen 3 14B on M3 Ultra

Worked example: structured-extraction pipeline

Power, thermals, and concurrent-model scenarios

Reproducible benchmarks with llama-bench

TL;DR

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 Mac mini Desktop Computer with M4 Pro chip with 12‑core CPU and…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 16‑core CPU, 40‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 chip with 10‑core CPU and 10‑core GPU…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks