Llama 3.1 70B at q4_K_M is 40 GB of weights — it does not run on the base 24 GB M4 Pro, and it's tight on the 48 GB SKU. The right M4 Pro for this model is the 64 GB Mac mini M4 Pro, where 70B at q4_K_M runs at 10–16 tokens/sec with workable headroom for 8K context. The setup below covers the exact commands, the memory math you cannot skip, and the honest performance ceiling that comes with running a 70B model on a laptop-class chip as of 2026.
The hard reality of 70B on Apple silicon
Llama 3.1 70B Instruct is a 70-billion-parameter model. At q4_K_M, the weights are 40 GB. Add 2–4 GB for an 8K KV cache, 4–6 GB for macOS, and 2 GB for one app, and you're at 50 GB resident before the model has finished its first token.
| M4 Pro SKU | Unified memory | Verdict for 70B q4_K_M |
|---|---|---|
| MacBook Pro 14" M4 Pro 24 GB | 24 GB | Will not run. Weights alone are 1.7× the chip's memory. |
| MacBook Pro M4 Pro 48 GB | 48 GB | Tight. Works at q3_K_M with 4K context; q4_K_M will swap. |
| Mac mini M4 Pro 48 GB | 48 GB | Same as above. |
| Mac mini M4 Pro 64 GB | 64 GB | The right answer. q4_K_M at 8K context with headroom. |
If you're running 70B on a 24 GB M4 Pro, you're running on a swap file — decode tok/s will be in the 0.5–2 range and the SSD will take a beating. Don't. Drop to 32B on the same machine instead — see How to run Qwen 3 32B on Apple M4 Pro.
If you have 64 GB unified memory, continue below.
Hardware and storage
| Component | Minimum | Recommended |
|---|---|---|
| Chip | M4 Pro 14-core CPU, 20-core GPU | M4 Pro Mac mini 14-core / M4 Max upgrade |
| Unified memory | 48 GB (with caveats) | 64 GB |
| Free disk | 45 GB | 100 GB (multiple quants) |
| macOS | Sequoia 15.1 | Sequoia 15.4+ |
See the M4 family launch material for the memory matrix. The 64 GB Mac mini M4 Pro is the cheapest path; the M4 Max in a MacBook Pro is the next step if you need higher memory bandwidth (more on that below).
Step 1 — Install Ollama
Ollama auto-configures Metal on macOS — see the Ollama install script for what it actually does. No CUDA drivers, no Docker.
Step 2 — Pull and run Llama 3.1 70B
The pull is 40 GB. On 5 Gbps internet that's about 90 seconds; residential cable is closer to 12 minutes. First-token latency is 4–10 seconds on warm cache because the prefill has to traverse all 80 layers; decode then streams at 10–14 tok/s on a 14-core M4 Pro with 64 GB.
Test the OpenAI-compatible endpoint:
Step 3 — Pick a quantisation that respects your memory
| Quant | Disk size | Resident (8K ctx) | Decode tok/s | Verdict |
|---|---|---|---|---|
| q2_K | 26.4 GB | 28.9 GB | 15–22 | Quality regression noticeable (~10%) |
| q3_K_M | 32.2 GB | 34.8 GB | 12–18 | Survival mode on 48 GB SKU |
| q4_K_M | 40.0 GB | 42.6 GB | 10–14 | Default on 64 GB SKU |
| q5_K_M | 47.8 GB | 50.5 GB | 7–11 | Only on 64 GB; tight |
| q6_K | 56.6 GB | 59.4 GB | 5–8 | Only on 96 GB+ (M4 Max territory) |
For the 48 GB M4 Pro, you're limited to q3_K_M with a short context. For the 64 GB M4 Pro Mac mini, q4_K_M is the right call.
Step 4 — llama.cpp for power users
Flash attention (-fa) is mandatory at 70B on Apple silicon — it halves the KV cache and saves about 2 GB at 8K context, which is the difference between "decodes" and "thrashes." The llama.cpp Metal tuning discussion covers Apple-specific knobs in more detail.
Real-world benchmarks
14-core M4 Pro Mac mini, 64 GB unified memory, macOS 15.4, plugged in.
| Workload | Quant | Context | Decode tok/s | Prefill tok/s |
|---|---|---|---|---|
| Single-turn chat | q4_K_M | 2K | 13 | 240 |
| Code review (1 file) | q4_K_M | 4K | 12 | 220 |
| RAG with 4 chunks | q4_K_M | 8K | 10 | 190 |
| Long-form drafting | q3_K_M | 16K | 14 | 240 |
| Multilingual translation | q4_K_M | 4K | 12 | 220 |
The honest answer: 70B on M4 Pro is useful but not fast. You're 4–6× slower than 8B on the same chip. The model quality difference is enormous for hard reasoning and long-form work; the speed difference is enormous in the other direction. Whether that's a good trade depends on your workload.
The memory bandwidth ceiling
70B inference is memory-bandwidth-bound on Apple silicon. The M4 Pro tops out at 273 GB/s. To produce one token, the decoder has to stream the entire 40 GB model through memory — so the theoretical ceiling is 273 / 40 ≈ 6.8 tok/s. In practice you see higher numbers (10–14 tok/s) because of layer-wise compute overlap and Metal kernel optimisation, but you can't outrun the bandwidth math by much.
| Chip | Bandwidth | Theoretical 70B q4_K_M ceiling | Observed |
|---|---|---|---|
| M4 Pro 14-core | 273 GB/s | 6.8 tok/s | 10–14 |
| M4 Max 14-core | 410 GB/s | 10.3 tok/s | 16–22 |
| M4 Max 16-core | 546 GB/s | 13.7 tok/s | 20–28 |
| M3 Ultra 76-core | 819 GB/s | 20.5 tok/s | 28–36 |
| RTX 4090 | 1008 GB/s | 25.2 tok/s | 28–34 |
| RTX 5090 | 1792 GB/s | 44.8 tok/s | 45–55 |
The observed numbers exceed theoretical because of activation caching and kernel fusion, but bandwidth dominates. If you want >25 tok/s on 70B as of 2026, you need an M4 Max 16-core or a desktop GPU.
Where 70B shines on M4 Pro
- Hard reasoning where 32B falls short — multi-step math, code architecture, planning
- Long-form drafting where consistency over 8K+ tokens matters
- Multilingual translation with strong performance across 30+ languages
- Function-calling and tool use with reliable structured output
- Critique / review tasks where the model is judging text rather than generating fast
Where 70B is the wrong choice for M4 Pro
- Interactive chat that needs >25 tok/s. Step up to M4 Max 16-core or use 32B.
- Real-time UX — 12 tok/s is fine for batch but feels slow for streaming.
- Multi-user serving. Apple silicon does single-user well; concurrency falls off fast. For that, run vLLM on a Linux box with 80 GB or 96 GB of GPU memory.
- 48 GB SKU with q4_K_M. The math doesn't work — use q3_K_M or drop to 32B.
Common pitfalls
- Trying 70B on 24 GB M4 Pro. Even at q2_K (26.4 GB) you're paging. Decode crawls. Don't.
- Forgetting flash attention. At 8K context without
-fa, the KV cache adds ~4 GB. Combined with 40 GB weights and macOS overhead you're at 50+ GB on a 48 GB SKU — guaranteed swap. Always enable-fa. - Long system prompts. A 4000-token system prompt makes every prefill take 18+ seconds. Move static instructions into the Modelfile and use Ollama's prompt caching, or use llama.cpp's
--keepto preserve the prefix across requests. - Concurrent Xcode build. Steals P-cores; decode drops by 40%. Pause builds during inference or cap with
--jobs 4. - Low Power Mode on battery. Decode tok/s halves. Plug in.
When to step up to M4 Max
If 70B on M4 Pro is too slow for your use case, the M4 Max 16-core / 40-core GPU with 48 GB or 64 GB is the right next step — see How to run DeepSeek-R1 32B on Apple M4 Max for the same workload class on the bigger chip. The M4 Max nearly doubles the memory bandwidth (546 GB/s vs 273 GB/s) and that bandwidth is the bottleneck for 70B-class inference. Expect 22–28 tok/s on the 16-core M4 Max for the same model.
For multi-user serving or peak throughput, the LocalLLaMA community has many threads documenting RTX 5090 builds that hit 45–55 tok/s on the same model. Those numbers cost about half what an equivalently-memoried Mac costs and pull 280–350 W under load instead of 35–45 W. Trade-offs are clear.
Monitoring memory pressure and tok/s
70B is right at the edge of what fits — you need to watch memory pressure constantly:
If memory pressure goes yellow, swap-outs climb, or Pageouts/sec exceeds 100, your KV cache is too big for the chip. Drop context or switch to q3_K_M.
Stats is non-optional for this workload — the menu-bar HUD gives you live memory pressure and swap activity at a glance.
Sample Modelfile recipes
For batch jobs where wall-clock latency matters less than total throughput, run with OLLAMA_NUM_PARALLEL=1 so the chip dedicates all bandwidth to one request at a time. Parallel decode at 70B on M4 Pro is roughly 0.4× the throughput of serial decode.
What to do next
If 70B fits and runs at acceptable speed, pair it with Open WebUI for a self-hosted chat interface or LM Studio for a desktop client. If you find the speed too slow, Qwen 3 32B gives you most of the reasoning quality at 2× the throughput on the same hardware.
FAQs
What is the expected tokens-per-second performance for Llama 3.1 70B on Apple M4 Pro?
Expect 10 to 14 tokens per second at q4_K_M quantization on a 14-core M4 Pro with 64 GB unified memory for single-user chat. Performance is memory-bandwidth-limited at this size — the chip's 273 GB/s tops out below 15 tok/s for a 40 GB model. The M4 Max 16-core nearly doubles that ceiling thanks to 546 GB/s bandwidth.
How much memory does Llama 3.1 70B require on Apple M4 Pro?
The model weights at q4_K_M are 40 GB. Resident memory rises to ~42 GB at 8K context with flash attention enabled, or ~46 GB without flash attention. Add 4–6 GB for macOS and whichever apps you're running, and you're at 50–55 GB total. The 64 GB SKU is the comfortable choice; the 48 GB SKU forces you to q3_K_M.
What are the advantages of Ollama vs llama.cpp for Llama 3.1 70B?
Ollama gives you a stable OpenAI-compatible API, automatic Metal detection, model versioning, and prompt caching — at the cost of less fine-grained control. llama.cpp gives you direct access to flash attention (mandatory for 70B), KV-cache quantisation, custom samplers, and per-layer offload — at the cost of more setup. Most users should run Ollama; drop to llama.cpp when you need flash attention on older Ollama versions or want to A/B test settings.
How does quantisation impact 70B quality and memory?
q4_K_M is the community default and loses 1–3% on benchmark scores vs FP16. q3_K_M saves ~8 GB at the cost of 4–6% quality regression — visible on hard tasks. q2_K saves another 6 GB but quality drops 8–12% — the model starts to mis-cite facts, get math wrong, and lose coherence on long answers. On M4 Pro 64 GB stay at q4_K_M; on M4 Pro 48 GB use q3_K_M; below that, don't run 70B at all.
Is 70B on M4 Pro a good choice in 2026?
It's a good choice if you specifically need a 70B model and you specifically need a Mac. The combination is silent, low-idle-power, and portable in a way no GPU desktop is. It is not a good choice if you only care about peak tok/s — a desktop GPU is 3–5× faster on the same model at lower unit cost. The decision usually comes down to whether you value the laptop-class form factor and macOS environment more than raw throughput.
