Qwen 3 14B is a strong general-purpose dense model from Alibaba (released April 2025) that runs at 55–80 tok/s at q4_K_M on Apple M3 Ultra Mac Studio, occupying ~9 GB of unified memory at 4K context. It's the practical default for "I want a model that's smarter than 8B but cheaper to run than 32B" — and the M3 Ultra's 819 GB/s memory bandwidth gives it room to run with multiple 14B-class models loaded concurrently if you want to compare outputs side-by-side.
What Qwen 3 14B is
Qwen3-14B (and the instruct variant Qwen3-14B-Instruct) shipped in April 2025 as part of Alibaba's Qwen 3 launch. It's a 14.8B-parameter dense transformer (no MoE here — see Qwen3-30B-A3B for that path), trained on 36T tokens with native 128K context and an optional reasoning mode that emits <think>...</think> blocks before the final answer. The model is multilingual (29 languages strong), and the Apache 2.0 license means it's fully commercial-use-OK.
The reason it matters for Apple Silicon users: at 14B parameters and ~8.5 GB at q4_K_M, it sits in the gap where Llama 3.1 8B starts feeling thin but the 32B-class models are wasting memory on long-context work. It's the model you reach for when an 8B fails to follow a complex instruction but you don't want to pay the latency penalty of a 32B.
Why pair it with an M3 Ultra (and when not to)
The honest answer: you don't need an M3 Ultra to run Qwen 3 14B. An M4 Pro 24 GB Mac mini will run it at ~32 tok/s. So why do this article on M3 Ultra specifically?
Because the M3 Ultra Mac Studio shines when you're using the headroom — running Qwen 3 14B and DeepSeek-R1 32B and an embedding model and a Whisper transcription model concurrently, or hosting an in-house inference endpoint that serves multiple users with continuous batching. At 819 GB/s bandwidth and 96–512 GB unified memory, that's all comfortable.
| Spec | M3 Ultra Mac Studio (96 GB base) |
|---|---|
| CPU cores | 28 |
| GPU cores | 60 (base) / 80 (max) |
| Memory bandwidth | 819 GB/s |
| Unified memory | 96 GB / 192 GB / 256 GB / 512 GB |
| Starting price | $3,999 (96 GB / 1 TB) |
See Apple's M3 Ultra launch for the official spec sheet.
VRAM math for Qwen 3 14B
At q4_K_M:
| Component | Size |
|---|---|
| Weights | ~8.5 GB |
| KV cache, 4K context | ~0.9 GB |
| KV cache, 32K context | ~7.2 GB |
| KV cache, 128K context | ~28.8 GB |
| Runtime overhead | ~1 GB |
Even at 128K context, you're at ~38 GB — comfortable on any M3 Ultra SKU. The 96 GB base model leaves ~55 GB free for additional models or system memory. That's the M3 Ultra's "headroom" advantage in practice: you don't have to choose between long context and other concurrent work.
Install with Ollama
To enable the reasoning mode (the <think>...</think> chain-of-thought):
The /no_think prefix is Qwen 3's documented switch to skip the reasoning block. Reasoning mode adds latency but gives noticeably better answers on multi-step problems.
Install with llama.cpp
The -ctk q8_0 -ctv q8_0 flags enable q8 KV-cache quantization — for a 14B model on an M3 Ultra it doesn't matter much (you have memory to spare), but it's a habit worth keeping for the long-context cases.
Expected tok/s on Apple M3 Ultra
Numbers below from the llama.cpp M-series benchmark thread and our runs on an M3 Ultra 96 GB. q4_K_M, 4K context, batch 1.
| Chip | Prompt eval (pp512) | Generation (tg128) |
|---|---|---|
| M3 Max 14c 36 GB | ~720 tok/s | 32–38 tok/s |
| M3 Max 16c 64 GB | ~900 tok/s | 40–48 tok/s |
| M3 Ultra 60c 96 GB | ~1500 tok/s | 50–65 tok/s |
| M3 Ultra 80c 96 GB | ~1900 tok/s | 60–75 tok/s |
| M3 Ultra 80c 512 GB | ~1900 tok/s | 60–80 tok/s |
| M4 Max 16c 48 GB (reference) | ~1100 tok/s | 38–46 tok/s |
The 80-core M3 Ultra gives roughly a 25% generation tok/s lift over the 60-core for a 14B model — the GPU is closer to saturation than for 8B but still not fully bound. Prompt processing scales more cleanly with core count, so for RAG workloads with 8K+ retrieval contexts, the 80-core is worth the upgrade.
MLX for an additional ~15% speed
On an 80-core M3 Ultra 96 GB, the MLX 4-bit Qwen 3 14B runs at ~85 tok/s — the fastest path on Apple Silicon for this model. The MLX 8-bit variant runs at ~55 tok/s and is worth using for code generation where the extra precision matters.
Comparison: Qwen 3 14B vs Llama 3.1 8B vs Qwen 3 32B
| Aspect | Llama 3.1 8B | Qwen 3 14B | Qwen 3 32B |
|---|---|---|---|
| Memory at q4 | ~4.7 GB | ~8.5 GB | ~19 GB |
| Tok/s on M3 Ultra 80c | 90–110 | 60–80 | 30–45 |
| MMLU (English) | 69 | 78 | 82 |
| HumanEval (code) | 62 | 76 | 84 |
| Math (MATH bench) | 28 | 56 | 67 |
| Reasoning mode | no | yes | yes |
| Best for | speed, classification | general agent work | hard reasoning |
The sweet spot for 14B is "general agent" work — anywhere you're calling tools, processing semi-structured input, or composing a multi-step response. 8B starts to lose the thread at 3+ tool calls; 14B holds up through 5–8; 32B handles 10+ but at 2× the latency.
Quantization ladder
| Quant | Weight size | MMLU delta | When to use |
|---|---|---|---|
| q3_K_M | ~6.3 GB | -1.8% | 16 GB Macs (M3 Max or M4 base) |
| q4_K_M | ~8.5 GB | -0.5% | Recommended default |
| q5_K_M | ~10.5 GB | -0.2% | Math/code sensitive |
| q6_K | ~12.5 GB | -0.1% | Diminishing returns |
| q8_0 | ~15.5 GB | -0.0% | Reference quality |
On an M3 Ultra you have memory to spare — running q8_0 costs you about 12% generation speed vs q4_K_M but eliminates the (already tiny) quantization quality gap. For code generation, q8_0 is worth the speed hit.
Common pitfalls
- Leaving reasoning mode on for chat. The
<think>...</think>blocks add 200–800 tokens of latency before the actual answer starts. For casual chat, turn it off with/no_thinkor setenable_thinking=falsein the API call. - Not increasing context for long docs. Default Ollama context is 2048 tokens. Qwen 3 14B supports 128K natively — use it.
- Comparing tok/s without disclosing batch size. A single-user batch-1 number and a continuous-batching batch-16 number are not the same.
- Running on battery. Qwen 3 14B pulls 18–28 W on an M3 Ultra Mac Studio (plugged in always) but on a MacBook Pro M3 Max with the equivalent model load, you'll see 2–3 hours of battery vs 8+ idle.
- Skipping the system prompt. Qwen 3 14B's default behavior is very polite and verbose. A 2-sentence system prompt setting tone and length changes outputs significantly.
When NOT to use Qwen 3 14B on M3 Ultra
- Pure chat workloads where speed beats nuance. Use Llama 3.1 8B at 90+ tok/s instead.
- Hard reasoning where you want explicit chain-of-thought. Use DeepSeek-R1 32B or Qwen 3 32B — both are stronger at multi-step problems.
- Long-form creative writing where the model's voice matters. Llama 3.1 70B has noticeably more character (see our 70B writeup).
- Production-grade non-English work other than Mandarin. Qwen 3 is multilingual but strongest on Mandarin and English; for French/German production work, test against Mistral models.
Worked example: structured-extraction pipeline
That's enough throughput to extract structured data from a 100k-review backlog in a long weekend — without paying any API costs.
Power, thermals, and concurrent-model scenarios
On an M3 Ultra Mac Studio 80-core with Qwen 3 14B loaded:
| Scenario | Wall power | Memory used |
|---|---|---|
| Idle, no model loaded | 12–18 W | ~6 GB (macOS) |
| 14B model loaded, no generation | 22–28 W | ~15 GB |
| 14B generation @ 70 tok/s | 75–110 W | ~16 GB |
| 14B + 8B both loaded, alternating | 75–110 W during gen | ~22 GB |
| 14B reasoning mode generation | 75–110 W | ~17 GB |
The Mac Studio cooling profile means you can leave the chassis in a shared workspace and not hear it under any of these workloads. Sustained generation at 75–110 W is well within the "fans rarely spin up" envelope; the unified memory architecture means the GPU doesn't fight a separate VRAM bank for thermal headroom the way a discrete GPU does.
The concurrent-model case is the M3 Ultra's signature: a typical agent stack might load Qwen 3 14B (general reasoning), Llama 3.1 8B (fast classifier), and an embedding model (e.g., bge-large) all at once. Total memory usage ~22 GB, leaves 70+ GB free on the base SKU, and switches between models in milliseconds because nothing has to spill to disk. The same setup on a 24 GB M4 Pro Mac mini forces eviction-and-reload every time you switch, adding 3–8 seconds of latency per turn.
Reproducible benchmarks with llama-bench
For validating the numbers in this article on your own hardware:
The -p flag tests prompt processing at three lengths; -n tests generation at three lengths; -r 5 does five repetitions. Output shows mean tok/s with ±sigma. Use this when comparing quantizations, before-and-after a system update, or M-series chip variants. Eyeballing ollama run is unreliable because warmup latency, context length, and OS-side caches all confound the measurement.
Tip: run the benchmark twice. The first run "warms" the GPU caches and unified memory; the second run is your real number. Throw out the first if the chip was cold.
TL;DR
- Qwen 3 14B is the practical Apple-Silicon default for general-purpose dense LLM work.
- Any M3 Ultra Mac Studio runs it at 50–80 tok/s comfortably.
- Use Ollama for ease, llama.cpp for control, MLX for the last 15% speed.
- Pair with Llama 3.1 8B as a fast classifier and Qwen 3 32B as the heavy-lift fallback.
- Disable reasoning mode (
/no_think) for chat workloads. - The 96 GB Mac Studio is the smallest M3 Ultra config; 14B alone wastes 70% of that. Use the headroom for concurrent models.
