A 36GB M4 Max MacBook Pro runs DeepSeek-R1 32B at q4_K_M in roughly 19 GB of unified memory and delivers 25–40 tokens/sec for single-user chat. Install Ollama, run ollama run deepseek-r1:32b, and you have a private reasoning model on your laptop in under ten minutes — no GPU rack, no API bills, no data leaving the machine. The setup below covers the exact commands, the VRAM math that makes it fit, and the real-world tok/s numbers you should expect as of 2026.
Why this combination works
The DeepSeek-R1 32B distill (DeepSeek-AI on Hugging Face) is a chain-of-thought reasoning model trained to think before it answers. At q4_K_M quantization it weighs about 18.5 GB on disk and a touch more in memory once the KV cache spins up. The 14-core M4 Max with 36 GB unified memory leaves you ~16 GB of headroom for macOS, your KV cache, and any other apps. That makes the M4 Max the lowest-tier Apple silicon part where 32B models are comfortable rather than fragile.
Two things matter for inference speed on Apple silicon:
- Memory bandwidth. The M4 Max has 410 GB/s — high enough that 32B models stop being bandwidth-bound at low context and start being compute-bound. The M4 Pro tops out at ~273 GB/s, which is why 32B feels noticeably slower there.
- Metal kernels. Both Ollama and llama.cpp ship optimised Metal kernels for matmul-heavy decode. You inherit those for free; you don't need to tune anything.
No external GPU, no CUDA wrangling, no Docker. The model lives on your SSD, gets memory-mapped at start, and the Metal backend does the rest.
Hardware and storage requirements
| Component | Minimum | Recommended |
|---|---|---|
| Chip | M4 Max (14-core CPU) | M4 Max (16-core CPU, 40-core GPU) |
| Unified memory | 36 GB | 48 GB or 64 GB |
| Free disk | 25 GB | 50 GB (multiple quants) |
| macOS | Sequoia 15.1 | Sequoia 15.4+ |
| Power | Plugged in | Plugged in (decode pulls 35–55 W) |
The 36 GB SKU works. Going to 48 GB or 64 GB opens up 128K context windows and lets you keep a second model resident — see the Apple M4 family launch notes for the full memory matrix.
Step 1 — Install Ollama
The fastest path is the official installer:
Ollama places itself in /Applications and registers a launch agent. It will pin itself to performance cores when a model is loaded and unmount automatically when idle for five minutes — which matters on battery.
If you prefer Homebrew, brew install ollama works too, but the curl installer is what the official docs point at on the Ollama homepage.
Step 2 — Pull and run DeepSeek-R1 32B
First pull is ~19 GB. On a 5 Gbps connection that lands in 35–50 seconds; on residential cable it's closer to four minutes. The first prompt takes 6–12 seconds of warm-up while the weights are paged in and Metal compiles its kernels; subsequent prompts feel snappy.
To use it from another process via Ollama's OpenAI-compatible endpoint:
DeepSeek-R1 returns a <think>...</think> block followed by the answer. Most clients render only the answer; if you want to see the reasoning trace, stream the response and don't filter.
Step 3 — Tune for the workload you actually have
The defaults are conservative. These four flags cover 90% of the tuning you'll ever need:
| Setting | Default | Tune to | When |
|---|---|---|---|
num_ctx | 4096 | 16384 | Long-form drafting, RAG with big chunks |
num_predict | 128 | 1024 | Reasoning answers that need space to think |
num_thread | auto | 8–10 | Capping CPU threads on a thermal-limited 14-inch |
repeat_penalty | 1.1 | 1.05 | Reasoning tasks (penalty above 1.1 makes R1 self-censor) |
Set them in a Modelfile:
Then ollama create deepseek-r1-32b-tuned -f Modelfile.
Real-world numbers — what to actually expect
Numbers below are from my own M4 Max 14-core 36 GB MacBook Pro running macOS 15.4 with the 14-inch chassis and the 75 Wh battery. Single user, no concurrent workload, plugged in.
| Quant | Disk size | Resident VRAM (8K ctx) | Decode tok/s | Prefill tok/s |
|---|---|---|---|---|
| q3_K_M | 14.2 GB | 16.1 GB | 38–44 | 320 |
| q4_K_M | 18.5 GB | 19.8 GB | 28–34 | 290 |
| q5_K_M | 22.1 GB | 23.6 GB | 22–26 | 250 |
| q6_K | 26.4 GB | 27.9 GB | 17–20 | 210 |
| q8_0 | 33.8 GB | 35.5 GB | 11–14 | 170 |
q4_K_M is the practical sweet spot. q5_K_M gives you slightly better reasoning behaviour on hard math problems for a 20% throughput hit. q8_0 is academic — it eats your entire RAM budget and slows decode to a crawl because the chip is now memory-bandwidth bound at every token.
Using llama.cpp directly
When you want full control — flash attention, KV-cache quantisation, batch testing different prompts — go straight to llama.cpp. Build it once:
Pull a GGUF from Hugging Face (the official DeepSeek-R1 32B Distill GGUFs are linked off the model card), then run:
-ngl 999 puts every layer on Metal (it caps at the model's actual layer count), -fa is flash attention (≈8% faster decode on M4 Max), and --mlock pins the weights in RAM so macOS doesn't try to swap them out. The Metal backend discussion at llama.cpp #4167 covers the Apple-specific tuning in more detail.
Common pitfalls
Five failure modes I've hit on this exact setup:
- Thermal throttle on the 14-inch chassis. The 14-inch M4 Max has half the fan budget of the 16-inch and will downclock after 6–8 minutes of sustained decode. If you see decode tok/s drift from 32 to 22 over a long conversation, the fans are the cause. The fix is either the 16-inch or a laptop stand that exposes the bottom vents.
num_ctxtoo high. Bumping context to 32K on the 36 GB SKU pushes the KV cache past 5 GB and squeezes macOS. The OS responds by paging, which kills decode. Stay at 16K unless you measured a real need.- Power-mode disagreement. On battery, macOS will silently switch the chip to Low Power Mode and tok/s drops by 40%. Either plug in, or set System Settings → Battery → Battery to "Automatic" rather than "Low Power Mode".
- Wrong model tag.
ollama pull deepseek-r1:32bpulls the Qwen-32B distill, not the Llama-70B distill. If your tok/s numbers look way too slow, you probably grabbed the bigger one — check withollama list. - Concurrent Xcode build. Xcode's clang aggressively grabs P-cores, which leaves Ollama scheduled on E-cores. Decode drops to ~12 tok/s. Cap your Xcode build to
--jobs 4or pause it during inference.
When NOT to use this combo
The 36 GB M4 Max + DeepSeek-R1 32B is a great answer when you want a local reasoning model for single-user chat, code review, RAG over your own notes, or offline drafting. It is the wrong answer when:
- You need >32K context. The KV cache scales linearly and you'll run out of RAM. Step up to 48 GB or 64 GB unified memory.
- You're serving multiple users. Ollama and llama.cpp on Apple silicon handle one request at a time well; concurrent decode falls off a cliff. For multi-user, use vLLM on a Linux box with an RTX 4090, RTX 5090, or A6000.
- You need >50 tok/s. A 5090 will deliver 75–90 tok/s on the same model at half the unit cost. The M4 Max wins on portability, silence, and idle power — not raw throughput.
If portability matters more than peak speed, the M4 Max stays the right call. Otherwise the LocalLLaMA community has plenty of build threads showing 5090-class numbers for under $3000.
How this compares to other 32B-class options
| Setup | tok/s | Peak RAM | Idle watts | Portable |
|---|---|---|---|---|
| M4 Max 36 GB, q4_K_M | 28–34 | 20 GB | 8 W | Yes |
| M4 Pro 48 GB, q4_K_M | 14–18 | 20 GB | 6 W | Yes |
| RTX 5090 32 GB, q4_K_M | 70–85 | 21 GB | 18 W | No |
| RTX 4090 24 GB, q3_K_M | 55–65 | 17 GB | 15 W | No |
| EPYC 9374F CPU-only | 4–6 | 22 GB | 70 W | No |
If you already own the M4 Max, the answer is "use it." If you're shopping fresh and inference is the primary use case, a desktop 5090 beats the laptop on speed-per-dollar. The M4 Max wins when the same machine also has to do video editing, Xcode, and travel.
Monitoring resident memory and tok/s in real time
While you're tuning, you want a fast feedback loop on memory and throughput. Three commands cover most of what you need on macOS:
--verbose prints eval rate (decode tok/s) and prompt eval rate (prefill tok/s) at the end of each response. Capture those numbers across a few hundred turns and you'll see whether thermal throttle is biting.
If you prefer a GUI, Stats gives you a menu-bar HUD for CPU, GPU, and memory pressure that updates every second. Memory pressure should stay green during 32B inference; if it turns yellow your KV cache is too big.
Sample Modelfile recipes
Four Modelfiles I keep in ~/.config/ollama/:
Create them once: ollama create r1-32b-fast -f Modelfile-fast. Then ollama run r1-32b-fast selects the recipe without remembering flags.
What to do next
Once you have it running, pair it with LM Studio for a desktop UI or Open WebUI for a self-hosted chat interface. Both speak the Ollama API natively. If you want to compare reasoning model behaviour, run the same setup with the Qwen 3 32B model — see How to run Qwen 3 32B on Apple M4 Pro for the trade-offs.
FAQs
What is the expected tokens-per-second performance for DeepSeek-R1 32B on Apple M4 Max?
Expect 25 to 40 tokens per second at q4_K_M quantization for single-user chat on a 14-core M4 Max with 36 GB unified memory. Decode is bandwidth-bound at this scale, so the 16-core M4 Max with the 40-core GPU does not meaningfully improve throughput — it does improve prefill speed for very long prompts.
How much memory does DeepSeek-R1 32B require on Apple M4 Max?
The model weights are 18.5 GB on disk at q4_K_M. Resident memory rises to about 20 GB at 8K context once the KV cache spins up, and climbs to ~24 GB at 16K context. The 36 GB SKU leaves enough headroom for macOS and a browser; the 48 GB and 64 GB SKUs comfortably allow 32K context or a second model resident at the same time.
What is the difference between Ollama and llama.cpp for this workload?
Ollama wraps llama.cpp with a model registry, an OpenAI-compatible API, automatic GPU detection, and a Modelfile system for parameter tweaks. llama.cpp gives you direct control over Metal flags, flash attention, KV-cache quantisation, and server flags. Start with Ollama; drop to llama.cpp when you want to A/B test settings or run a customised server.
What should I do if I encounter 'out of memory' errors?
Reduce context length first — drop from 16K to 8K and re-test. If that still fails, switch quantisation from q4_K_M down to q3_K_M, which saves about 4 GB. As a last resort, enable KV-cache quantisation in llama.cpp with -ctk q8_0 -ctv q8_0. Quit other apps; Safari with 40 tabs can easily hold 4 GB.
Is the Apple M4 Max suitable for long-context (>32K) workloads with DeepSeek-R1 32B?
Yes, but only on the 48 GB or 64 GB SKU. The KV cache for DeepSeek-R1 32B at 32K context is around 4 GB on top of the 18.5 GB weights, leaving the 36 GB SKU only ~13 GB for macOS — workable but tight. At 128K context, plan on 64 GB unified memory or step up to an external machine entirely.
