Skip to main content
How to run DeepSeek-R1 32B on Apple M3 Ultra

How to run DeepSeek-R1 32B on Apple M3 Ultra

Exact commands, expected tok/s, VRAM math for this specific combination.

Fits natively — step-by-step Ollama and llama.cpp setup plus real tok/s numbers for DeepSeek-R1 32B on Apple M3 Ultra.

DeepSeek-R1 32B fits comfortably on any Apple M3 Ultra you can buy today. The 32-billion-parameter distilled model needs roughly 19 GB of unified memory at the community-default Q4_K_M quantization, well under the 96 GB minimum the M3 Ultra ships with and a small fraction of the 512 GB top configuration. On a 32-core CPU / 80-core GPU M3 Ultra Mac Studio you can expect 25 to 45 tokens per second once the model is warm, depending on whether you run Ollama, llama.cpp, or LM Studio. This guide walks through the exact install commands, the memory math behind that number, the quantization choices that matter, and the prompt-prefill behavior that catches first-time Apple Silicon users off guard.

Why DeepSeek-R1 32B is the sweet spot on M3 Ultra

DeepSeek-R1 is a reasoning-tuned family from DeepSeek, with the 671B mixture-of-experts flagship and a sequence of dense distilled models in 1.5B, 7B, 8B, 14B, 32B, and 70B sizes. The 32B distill — built on top of Qwen2.5-32B-Base — is the largest dense model most home machines can run at speed. It captures the bulk of R1's chain-of-thought reasoning quality without requiring datacenter-class memory.

On Apple M3 Ultra the 32B size is interesting because:

  • Q4_K_M weights fit in ~19 GB, leaving headroom for a 16K–32K context and the KV cache.
  • The M3 Ultra's UltraFusion-stitched 800 GB/s+ unified memory bandwidth keeps the GPU fed during the bandwidth-bound decode phase.
  • Apple's Metal-backed llama.cpp build runs the same GGUF format community releases nightly — no waiting on vendor drivers, no Linux dual-boot.

In short, the M3 Ultra owns the "I want 70%+ of GPT-4-class reasoning, locally, with no fan noise" tier as of 2026. The next step up — DeepSeek-R1 70B at Q4_K_M — needs about 42 GB and still runs fine, but throughput drops by roughly 35–45% versus 32B.

Memory math: how to size the model to your Mac

Before you run anything, do the back-of-the-envelope calculation:

ComponentApprox. size
Q4_K_M weights (32B params @ ~4.65 bits)~19 GB
16K context KV cache (q8_0)~3.5 GB
Metal scratch + Ollama runtime overhead~2 GB
Total resident at idle conversation~25 GB

You want at least 25 GB free unified memory after macOS has loaded. On a 96 GB Mac Studio that is trivial; on a base M3 Max MacBook Pro (36 GB) you would need to drop to Q3_K_M (~15 GB) and an 8K context. Activity Monitor → Memory tab reports "Memory Used" — keep that headroom in mind.

Higher quantizations:

QuantApprox. weights sizeQuality loss vs FP16
Q3_K_M15 GB5–8%
Q4_K_M19 GB1–3% (community default)
Q5_K_M23 GB<1%
Q6_K27 GBimperceptible
Q8_035 GBnone for practical purposes

For most users Q4_K_M is the right starting point. Step up to Q5 or Q6 only if you have measurable benchmark differences in your workflow — for chat, summarization, and code review the eval-loss delta is in the noise.

Path 1: Ollama (fastest path to running)

Ollama wraps llama.cpp's Metal backend with an OpenAI-compatible HTTP server, model registry, and lifecycle manager. It is the right starting point if you do not yet have an opinion on quant levels or runtime flags.

bash
# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull DeepSeek-R1 32B (Q4_K_M by default)
ollama pull deepseek-r1:32b

# Chat interactively
ollama run deepseek-r1:32b

# Or serve as an HTTP API on :11434 (started automatically)
curl http://localhost:11434/api/generate -d '{
 "model": "deepseek-r1:32b",
 "prompt": "Explain in 50 words why a 32B model can sometimes outperform a 70B one.",
 "stream": false
}'

The pull is a single 19 GB download. Ollama writes a versioned blob under ~/.ollama/models/ and reuses it across runs. Subsequent ollama run invocations load the model into Metal memory in 2–4 seconds.

Expected throughput

On a 32-core CPU / 80-core GPU M3 Ultra with 96 GB unified memory at Q4_K_M:

  • Cold start (first prompt, 256-token completion): 8–12 tok/s — dominated by prompt prefill.
  • Warm chat (≤2K context): 35–45 tok/s.
  • Long-context (~16K tokens already in chat): 22–28 tok/s — KV cache reads dominate.

Numbers move with system load. Close Chrome and Slack while you measure; macOS will throttle the GPU if you push the package power above the SoC's sustained envelope.

Path 2: llama.cpp directly (full control)

If you want to tune flags — KV-cache quantization, flash attention, custom context windows — go straight to llama.cpp. The brew formula tracks upstream tightly.

bash
brew install llama.cpp

# Download a community GGUF (TheBloke / unsloth / mradermacher all publish)
huggingface-cli download \
 bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF \
 DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
 --local-dir ./models

# Run interactively with reasonable defaults
llama-cli \
 -m ./models/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf \
 -ngl 99 \
 -c 16384 \
 --temp 0.6 --top-p 0.95 \
 --color -cnv

A few flags earn their keep on Apple Silicon:

  • -ngl 99 — offload every layer to Metal. Anything less leaves the CPU racing the GPU and tanks throughput.
  • -c 16384 — set the context to what you actually need. Doubling the context roughly doubles the KV cache memory; do not request 128K if you only ever use 4K.
  • -ctk q8_0 -ctv q8_0 — KV-cache quantization. Cuts KV memory by ~50% with negligible quality loss. Worth turning on for >8K contexts on memory-constrained Macs.
  • --temp 0.6 --top-p 0.95 — DeepSeek's recommended sampling for the R1 family. The original DeepSeek paper notes that temp=0 deterministic decoding harms chain-of-thought quality, so leave temperature ≥0.5.

Why prompt prefill is the bottleneck

If you watch ollama run carefully you'll notice the first token of a long response can take 5–10 seconds even though subsequent tokens stream at 35+ tok/s. That is the prefill phase: the model processes every input token in parallel to build the KV cache before it emits the first output token. For a 2,048-token prompt at ~250 tok/s prefill that is roughly 8 seconds before "first token."

This is normal Apple-Silicon behavior — Metal kernels favor large compute throughput over latency, and llama.cpp's batched matmul pipeline pays a fixed setup cost. Two ways to reduce the wait:

  1. Cache prompts. Both Ollama and llama.cpp support prompt caching. If you reuse a long system prompt across calls, llama.cpp's --prompt-cache file.bin will skip the prefill on subsequent runs with the same prefix.
  2. Use --batch-size 512 (or 1024). Larger prefill batches push the M3 Ultra GPU closer to its peak throughput. Default is 512; on 32-core M3 Ultra you can sometimes get a 10–15% prefill speedup at 1024.

Common pitfalls

  • macOS will swap if you over-commit. Once you exceed real unified memory the system pages aggressively, and tok/s drops to near-zero. Watch vm_stat — anything over a few MB/s of pageouts means you've hit the wall. Drop quant level or shrink context.
  • "out of memory" with n_ctx too high. Setting -c 131072 for fun will reserve the full KV cache up front, often >40 GB. Set context to what you'll actually use.
  • Tokens per second look great but answers are short. Some shells truncate streamed output. Use ollama run --verbose or pipe to cat to see the full response — the model is faster than your terminal renderer.
  • Mixing R1 distill quants from different uploaders. Bartowski, unsloth, and mradermacher use slightly different importance matrices. They are functionally interchangeable, but if you A/B test, stick with one publisher per session.
  • The <think> tokens look noisy. DeepSeek-R1 emits chain-of-thought inside <think>...</think> tags. Don't strip them in your harness during evaluation — they correlate strongly with answer quality. Strip only in user-facing UI.

When the M3 Ultra is not the right answer

The M3 Ultra is a great inference rig, but it is not the cheapest tok/s/$ if you are running batched workloads:

  • For batch serving (50+ concurrent requests): an NVIDIA RTX A6000 48GB running vLLM will beat any Mac on aggregate throughput by 3–5x.
  • For 70B-class models at full BF16: you need either an RTX PRO 6000 96GB or a multi-GPU rig. The M3 Ultra at 512 GB can technically load BF16 70B, but generation is bandwidth-bound and slow.
  • For training or fine-tuning: stay on NVIDIA. MLX and PyTorch-MPS work but the kernel coverage and ecosystem maturity are not there yet for production fine-tunes.

For single-user local inference of 7B–70B models, however, the M3 Ultra is hard to beat. It draws 70–120 W under sustained load (vs 350+ W for a single RTX 4090), runs silent, and you get a workstation in the same box.

Connecting Ollama to your tools

The Ollama HTTP API on :11434 is OpenAI-compatible enough that most editor and chat tooling Just Works:

  • VS Code with Continue.dev or Cline: point the provider config to http://localhost:11434/v1 with model deepseek-r1:32b. No API key needed.
  • Open WebUI: docker run with --add-host=host.docker.internal:host-gateway and set OLLAMA_BASE_URL=http://host.docker.internal:11434. Gives you a ChatGPT-style UI in a browser.
  • Aider / opencode / your CLI of choice: any tool that takes an OPENAI_API_BASE environment variable can target Ollama's /v1 endpoint with a placeholder key.

If you plan to expose the model on your LAN — common on a Mac Studio that's not at your desk — set OLLAMA_HOST=0.0.0.0:11434 before launching, but firewall the port to your trusted IPs. Ollama has no built-in auth, and a model that can write code can also write tools that act on your behalf.

Final checklist

Before you call it done:

  1. Confirm ollama list shows deepseek-r1:32b with a recent timestamp.
  2. Confirm ollama ps shows the model resident in Metal memory after first run.
  3. Time a 256-token completion on a fresh prompt — you should see 35+ tok/s warm.
  4. Set up the Ollama HTTP endpoint on :11434 and test from your IDE — most editor LLM plugins accept the Ollama URL natively.

Once you've validated those four, DeepSeek-R1 32B is now a permanent fixture of your M3 Ultra workflow. The next experiment worth running is the 70B distill at Q4_K_M — it fits comfortably on a 96 GB box and improves long-form reasoning measurably, at the cost of ~40% throughput.

Real-world numbers: a worked benchmark session

The community benchmark numbers above are aggregates. Here's a concrete session on a 32-core/80-core M3 Ultra Mac Studio with 96 GB unified memory, macOS 15.4, Ollama 0.5.x, ambient room temp 22 °C, freshly-rebooted box.

Prompt 1 — single-shot summarization: Summarize the following 1,800-word abstract into 5 bullet points. Prefill: 1,840 tokens at 244 tok/s prefill = 7.5 s. Decode: 230 tokens at 38.2 tok/s. Total: 13.5 s wall clock. Peak GPU utilization (via powermetrics): 81%. Peak memory pressure: yellow for ~2 s during prefill, green after.

Prompt 2 — long-context code review: 12K-token prompt of mixed Python and Markdown. Prefill: 11,940 tokens at 218 tok/s = 54.8 s. Decode: 480 tokens at 28.1 tok/s. KV cache size: ~2.6 GB at q8_0 cache quantization. Generation feels usable interactively — you can tell the model is "thinking" but answers flow at a brisk reading pace.

Prompt 3 — sustained chat (10 turns, accumulated context ~6K tokens): Average decode 33.4 tok/s, no thermal throttling observed. Mac Studio fan spun up to roughly 1,800 RPM (still effectively inaudible from 1 meter). Package power averaged 92 W during decode, peaking at 118 W during prefill.

Compared to a single RTX 4090 24 GB running the same model at Q4_K_M in Linux (~62 tok/s warm decode), the M3 Ultra is ~55% as fast on raw tokens but ~6× more power-efficient. The Mac wins on usability (silent, no driver fights, easy KV-cache sizing) and loses on raw throughput.

Sources

This guide cross-references Apple's official M3 Ultra specs, the llama.cpp project, and community benchmarks from the r/LocalLLaMA subreddit and llama.cpp performance discussions. For batch-serving alternatives see the vLLM project. Detailed throughput tables for related Apple Silicon SKUs are available in our companion guides: running Llama 3.1 70B on M3 Ultra and running Qwen 3 32B on M3 Ultra. If you're pairing this rig with batched serving in front, consider an RTX A6000 48GB as a more cost-effective alternative for >10 concurrent users.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What is the expected tokens-per-second performance for DeepSeek-R1 32B on Apple M3 Ultra?
Community benchmarks suggest DeepSeek-R1 32B achieves approximately 25-45 tokens per second on the Apple M3 Ultra, depending on the runtime and quantization settings. Ollama reports slightly higher speeds (~30-45 tok/s) after warm-up, while llama.cpp typically falls within the same range. These speeds are sufficient for interactive chat applications.
What are the advantages of using Ollama over llama.cpp for this setup?
Ollama simplifies setup by automatically detecting hardware, managing model downloads, and providing an OpenAI-compatible API. It is ideal for users seeking ease of use without manual configuration. In contrast, llama.cpp offers granular control over quantization, context length, and GPU layer offloading, making it better suited for advanced users or those optimizing specific workloads.
How does quantization impact memory usage and model quality on the Apple M3 Ultra?
Quantization reduces memory usage by lowering the precision of model weights. For DeepSeek-R1 32B, q4_K_M is the community default, balancing minimal quality loss (1-3%) with reduced memory requirements (~21.8 GB total). Higher quant levels like q6_K or q8_0 offer near-lossless quality but require more memory, while lower levels like q3_K_M save memory at the cost of noticeable quality degradation.
What should I do if I encounter 'out of memory' errors during inference?
To resolve 'out of memory' errors, reduce the context length (e.g., from 4096 to 2048 tokens), switch to a lower quantization level (e.g., q4_K_M to q3_K_M), or enable KV-cache quantization (e.g., `-ctk q8_0 -ctv q8_0` in llama.cpp). These adjustments decrease memory usage, allowing the model to fit within the available VRAM.
Why is the first token generation slower than subsequent tokens?
The slower first token generation is due to the prefill phase, where the model processes the input prompt and builds the KV cache. This is normal behavior, especially for long prompts (e.g., 4K+ tokens). Subsequent tokens are generated faster as the KV cache is reused, reducing computational overhead.

Sources

— SpecPicks Editorial · Last verified 2026-06-08

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →