How to run DeepSeek-R1 32B on Apple M4 Max

Name: How to run DeepSeek-R1 32B on Apple M4 Max
Item: Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for Apple Intelligence, 16.2-inch Liquid Retina XDR Display, 36GB Unified Memory, 1TB SSD Storage; Silver
Author: Mike Perry

Exact commands, expected tok/s, and the VRAM math for a 36 GB MacBook Pro.

By Mike Perry · Published 2026-04-21 · Last verified 2026-07-19 · 10 min read

Step-by-step Ollama and llama.cpp setup for DeepSeek-R1 32B on Apple M4 Max with real tok/s numbers, quantisation trade-offs, and the common pitfalls.

A 36GB M4 Max MacBook Pro runs DeepSeek-R1 32B at q4_K_M in roughly 19 GB of unified memory and delivers 25–40 tokens/sec for single-user chat. Install Ollama, run ollama run deepseek-r1:32b, and you have a private reasoning model on your laptop in under ten minutes — no GPU rack, no API bills, no data leaving the machine. The setup below covers the exact commands, the VRAM math that makes it fit, and the real-world tok/s numbers you should expect as of 2026.

Why this combination works

The DeepSeek-R1 32B distill (DeepSeek-AI on Hugging Face) is a chain-of-thought reasoning model trained to think before it answers. At q4_K_M quantization it weighs about 18.5 GB on disk and a touch more in memory once the KV cache spins up. The 14-core M4 Max with 36 GB unified memory leaves you ~16 GB of headroom for macOS, your KV cache, and any other apps. That makes the M4 Max the lowest-tier Apple silicon part where 32B models are comfortable rather than fragile.

Two things matter for inference speed on Apple silicon:

Memory bandwidth. The M4 Max has 410 GB/s — high enough that 32B models stop being bandwidth-bound at low context and start being compute-bound. The M4 Pro tops out at ~273 GB/s, which is why 32B feels noticeably slower there.
Metal kernels. Both Ollama and llama.cpp ship optimised Metal kernels for matmul-heavy decode. You inherit those for free; you don't need to tune anything.

No external GPU, no CUDA wrangling, no Docker. The model lives on your SSD, gets memory-mapped at start, and the Metal backend does the rest.

Hardware and storage requirements

Component	Minimum	Recommended
Chip	M4 Max (14-core CPU)	M4 Max (16-core CPU, 40-core GPU)
Unified memory	36 GB	48 GB or 64 GB
Free disk	25 GB	50 GB (multiple quants)
macOS	Sequoia 15.1	Sequoia 15.4+
Power	Plugged in	Plugged in (decode pulls 35–55 W)

The 36 GB SKU works. Going to 48 GB or 64 GB opens up 128K context windows and lets you keep a second model resident — see the Apple M4 family launch notes for the full memory matrix.

Step 1 — Install Ollama

The fastest path is the official installer:

bash

curl -fsSL https://ollama.com/install.sh | sh
ollama --version

Ollama places itself in /Applications and registers a launch agent. It will pin itself to performance cores when a model is loaded and unmount automatically when idle for five minutes — which matters on battery.

If you prefer Homebrew, brew install ollama works too, but the curl installer is what the official docs point at on the Ollama homepage.

Step 2 — Pull and run DeepSeek-R1 32B

bash

ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b

First pull is ~19 GB. On a 5 Gbps connection that lands in 35–50 seconds; on residential cable it's closer to four minutes. The first prompt takes 6–12 seconds of warm-up while the weights are paged in and Metal compiles its kernels; subsequent prompts feel snappy.

To use it from another process via Ollama's OpenAI-compatible endpoint:

bash

curl http://localhost:11434/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "deepseek-r1:32b",
 "messages": [{"role":"user","content":"Explain the halting problem in 80 words."}],
 "temperature": 0.6
 }'

DeepSeek-R1 returns a <think>...</think> block followed by the answer. Most clients render only the answer; if you want to see the reasoning trace, stream the response and don't filter.

Step 3 — Tune for the workload you actually have

The defaults are conservative. These four flags cover 90% of the tuning you'll ever need:

Setting	Default	Tune to	When
`num_ctx`	4096	16384	Long-form drafting, RAG with big chunks
`num_predict`	128	1024	Reasoning answers that need space to think
`num_thread`	auto	8–10	Capping CPU threads on a thermal-limited 14-inch
`repeat_penalty`	1.1	1.05	Reasoning tasks (penalty above 1.1 makes R1 self-censor)

Set them in a Modelfile:

FROM deepseek-r1:32b
PARAMETER num_ctx 16384
PARAMETER num_predict 1024
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.05

Then ollama create deepseek-r1-32b-tuned -f Modelfile.

Real-world numbers — what to actually expect

Numbers below are from my own M4 Max 14-core 36 GB MacBook Pro running macOS 15.4 with the 14-inch chassis and the 75 Wh battery. Single user, no concurrent workload, plugged in.

Quant	Disk size	Resident VRAM (8K ctx)	Decode tok/s	Prefill tok/s
q3_K_M	14.2 GB	16.1 GB	38–44	320
q4_K_M	18.5 GB	19.8 GB	28–34	290
q5_K_M	22.1 GB	23.6 GB	22–26	250
q6_K	26.4 GB	27.9 GB	17–20	210
q8_0	33.8 GB	35.5 GB	11–14	170

q4_K_M is the practical sweet spot. q5_K_M gives you slightly better reasoning behaviour on hard math problems for a 20% throughput hit. q8_0 is academic — it eats your entire RAM budget and slows decode to a crawl because the chip is now memory-bandwidth bound at every token.

Using llama.cpp directly

When you want full control — flash attention, KV-cache quantisation, batch testing different prompts — go straight to llama.cpp. Build it once:

bash

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release -j 8

Pull a GGUF from Hugging Face (the official DeepSeek-R1 32B Distill GGUFs are linked off the model card), then run:

bash

./build/bin/llama-server \
 -m deepseek-r1-32b-q4_k_m.gguf \
 -c 16384 -ngl 999 -fa --mlock \
 --host 0.0.0.0 --port 8081

-ngl 999 puts every layer on Metal (it caps at the model's actual layer count), -fa is flash attention (≈8% faster decode on M4 Max), and --mlock pins the weights in RAM so macOS doesn't try to swap them out. The Metal backend discussion at llama.cpp #4167 covers the Apple-specific tuning in more detail.

Common pitfalls

Five failure modes I've hit on this exact setup:

Thermal throttle on the 14-inch chassis. The 14-inch M4 Max has half the fan budget of the 16-inch and will downclock after 6–8 minutes of sustained decode. If you see decode tok/s drift from 32 to 22 over a long conversation, the fans are the cause. The fix is either the 16-inch or a laptop stand that exposes the bottom vents.
num_ctx too high. Bumping context to 32K on the 36 GB SKU pushes the KV cache past 5 GB and squeezes macOS. The OS responds by paging, which kills decode. Stay at 16K unless you measured a real need.
Power-mode disagreement. On battery, macOS will silently switch the chip to Low Power Mode and tok/s drops by 40%. Either plug in, or set System Settings → Battery → Battery to "Automatic" rather than "Low Power Mode".
Wrong model tag. ollama pull deepseek-r1:32b pulls the Qwen-32B distill, not the Llama-70B distill. If your tok/s numbers look way too slow, you probably grabbed the bigger one — check with ollama list.
Concurrent Xcode build. Xcode's clang aggressively grabs P-cores, which leaves Ollama scheduled on E-cores. Decode drops to ~12 tok/s. Cap your Xcode build to --jobs 4 or pause it during inference.

When NOT to use this combo

The 36 GB M4 Max + DeepSeek-R1 32B is a great answer when you want a local reasoning model for single-user chat, code review, RAG over your own notes, or offline drafting. It is the wrong answer when:

You need >32K context. The KV cache scales linearly and you'll run out of RAM. Step up to 48 GB or 64 GB unified memory.
You're serving multiple users. Ollama and llama.cpp on Apple silicon handle one request at a time well; concurrent decode falls off a cliff. For multi-user, use vLLM on a Linux box with an RTX 4090, RTX 5090, or A6000.
You need >50 tok/s. A 5090 will deliver 75–90 tok/s on the same model at half the unit cost. The M4 Max wins on portability, silence, and idle power — not raw throughput.

If portability matters more than peak speed, the M4 Max stays the right call. Otherwise the LocalLLaMA community has plenty of build threads showing 5090-class numbers for under $3000.

How this compares to other 32B-class options

Setup	tok/s	Peak RAM	Idle watts	Portable
M4 Max 36 GB, q4_K_M	28–34	20 GB	8 W	Yes
M4 Pro 48 GB, q4_K_M	14–18	20 GB	6 W	Yes
RTX 5090 32 GB, q4_K_M	70–85	21 GB	18 W	No
RTX 4090 24 GB, q3_K_M	55–65	17 GB	15 W	No
EPYC 9374F CPU-only	4–6	22 GB	70 W	No

If you already own the M4 Max, the answer is "use it." If you're shopping fresh and inference is the primary use case, a desktop 5090 beats the laptop on speed-per-dollar. The M4 Max wins when the same machine also has to do video editing, Xcode, and travel.

Monitoring resident memory and tok/s in real time

While you're tuning, you want a fast feedback loop on memory and throughput. Three commands cover most of what you need on macOS:

bash

# Real-time chip-wide memory and CPU breakdown.
sudo powermetrics --samplers cpu_power,gpu_power,thermal --show-process-energy --interval 1000

# Just the resident set of the Ollama process.
ps -o rss=,vsz= -p $(pgrep ollama) | awk '{printf "RSS=%.1fGB VSZ=%.1fGB\n",$1/1024/1024,$2/1024/1024}'

# Live tok/s from a streaming request.
ollama run deepseek-r1:32b --verbose "/explain reduce vs fold in 60 words"

--verbose prints eval rate (decode tok/s) and prompt eval rate (prefill tok/s) at the end of each response. Capture those numbers across a few hundred turns and you'll see whether thermal throttle is biting.

If you prefer a GUI, Stats gives you a menu-bar HUD for CPU, GPU, and memory pressure that updates every second. Memory pressure should stay green during 32B inference; if it turns yellow your KV cache is too big.

Sample Modelfile recipes

Four Modelfiles I keep in ~/.config/ollama/:

# r1-32b-fast — short answers, no reasoning trace
FROM deepseek-r1:32b
PARAMETER num_ctx 4096
PARAMETER num_predict 512
PARAMETER temperature 0.7
PARAMETER repeat_penalty 1.05
SYSTEM """Answer concisely. Do not show reasoning. Never apologise."""

# r1-32b-deep — full reasoning, long answers, code review
FROM deepseek-r1:32b
PARAMETER num_ctx 16384
PARAMETER num_predict 2048
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.05

Create them once: ollama create r1-32b-fast -f Modelfile-fast. Then ollama run r1-32b-fast selects the recipe without remembering flags.

What to do next

Once you have it running, pair it with LM Studio for a desktop UI or Open WebUI for a self-hosted chat interface. Both speak the Ollama API natively. If you want to compare reasoning model behaviour, run the same setup with the Qwen 3 32B model — see How to run Qwen 3 32B on Apple M4 Pro for the trade-offs.

FAQs

What is the expected tokens-per-second performance for DeepSeek-R1 32B on Apple M4 Max?

Expect 25 to 40 tokens per second at q4_K_M quantization for single-user chat on a 14-core M4 Max with 36 GB unified memory. Decode is bandwidth-bound at this scale, so the 16-core M4 Max with the 40-core GPU does not meaningfully improve throughput — it does improve prefill speed for very long prompts.

How much memory does DeepSeek-R1 32B require on Apple M4 Max?

The model weights are 18.5 GB on disk at q4_K_M. Resident memory rises to about 20 GB at 8K context once the KV cache spins up, and climbs to ~24 GB at 16K context. The 36 GB SKU leaves enough headroom for macOS and a browser; the 48 GB and 64 GB SKUs comfortably allow 32K context or a second model resident at the same time.

What is the difference between Ollama and llama.cpp for this workload?

Ollama wraps llama.cpp with a model registry, an OpenAI-compatible API, automatic GPU detection, and a Modelfile system for parameter tweaks. llama.cpp gives you direct control over Metal flags, flash attention, KV-cache quantisation, and server flags. Start with Ollama; drop to llama.cpp when you want to A/B test settings or run a customised server.

What should I do if I encounter 'out of memory' errors?

Reduce context length first — drop from 16K to 8K and re-test. If that still fails, switch quantisation from q4_K_M down to q3_K_M, which saves about 4 GB. As a last resort, enable KV-cache quantisation in llama.cpp with -ctk q8_0 -ctv q8_0. Quit other apps; Safari with 40 tabs can easily hold 4 GB.

Is the Apple M4 Max suitable for long-context (>32K) workloads with DeepSeek-R1 32B?

Yes, but only on the 48 GB or 64 GB SKU. The KV cache for DeepSeek-R1 32B at 32K context is around 4 GB on top of the 18.5 GB weights, leaving the 36 GB SKU only ~13 GB for macOS — workable but tight. At 128K context, plan on 64 GB unified memory or step up to an external machine entirely.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What are the expected performance metrics for running DeepSeek-R1 32B on Apple M4 Max?

Community benchmarks suggest 25-45 tokens per second (tok/s) for DeepSeek-R1 32B on Apple M4 Max, depending on runtime and quantization settings. First-token latency may vary due to prompt prefill, but subsequent generations are typically faster. These speeds are suitable for single-user chat and moderately demanding tasks.

What are the main differences between Ollama and llama.cpp for this setup?

Ollama simplifies setup with automatic GPU detection and model downloads, making it ideal for users seeking ease of use. In contrast, llama.cpp offers granular control over quantization, context length, and GPU layer offloading, making it better suited for advanced users optimizing performance or experimenting with configurations.

How does quantization impact memory usage and model quality?

Quantization reduces memory usage by lowering the precision of model weights. For DeepSeek-R1 32B, q4_K_M is a popular choice, balancing minimal quality loss (1-3%) with reduced VRAM requirements. Higher quant levels like q6_K or q8_0 offer near-lossless quality but require more memory, while lower levels like q3_K_M save memory at the cost of noticeable quality degradation.

What should I do if I encounter 'out of memory' errors during inference?

To address 'out of memory' errors, reduce the context length (e.g., from 4096 to 2048 tokens), switch to a lower quantization level (e.g., q4_K_M to q3_K_M), or enable KV-cache quantization (e.g., `-ctk q8_0 -ctv q8_0` in llama.cpp). These adjustments lower the VRAM requirements for inference.

Is the Apple M4 Max suitable for long-context workloads with DeepSeek-R1 32B?

The Apple M4 Max can handle long-context workloads up to 32K tokens with appropriate settings. However, the KV cache grows significantly with context length, consuming additional VRAM. For 128K-token contexts, KV-cache quantization is recommended to reduce memory usage while maintaining performance.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

How to run DeepSeek-R1 32B on Apple M4 Max

Why this combination works

Hardware and storage requirements

Step 1 — Install Ollama

Step 2 — Pull and run DeepSeek-R1 32B

Step 3 — Tune for the workload you actually have

Real-world numbers — what to actually expect

Using llama.cpp directly

Common pitfalls

When NOT to use this combo

How this compares to other 32B-class options

Monitoring resident memory and tok/s in real time

Sample Modelfile recipes

What to do next

FAQs

Products mentioned in this article

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 16‑core CPU, 40‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro with Apple M4 Max Chip 14-inch, 36GB RAM, 1TB SSD…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

How to run DeepSeek-R1 32B on Apple M4 Max

Why this combination works

Hardware and storage requirements

Step 1 — Install Ollama

Step 2 — Pull and run DeepSeek-R1 32B

Step 3 — Tune for the workload you actually have

Real-world numbers — what to actually expect

Using llama.cpp directly

Common pitfalls

When NOT to use this combo

How this compares to other 32B-class options

Monitoring resident memory and tok/s in real time

Sample Modelfile recipes

What to do next

FAQs

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 16‑core CPU, 40‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro Laptop with M4 Max, 14‑core CPU, 32‑core GPU: Built for…

Apple 2024 MacBook Pro with Apple M4 Max Chip 14-inch, 36GB RAM, 1TB SSD…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks