How to run Llama 3.1 8B on Apple M4 Max
The Apple M4 Max runs Llama 3.1 8B locally with no fuss: at Q4_K_M the model weighs ~4.7 GB, well under even the entry 36 GB unified memory option, and you can expect 55–80 tokens per second using Ollama's Metal backend. The complete setup is brew install ollama && ollama run llama3.1:8b — most of this guide is about getting the numbers right and tuning the runtime for your use case.
Why the M4 Max is a great 8B host
Apple's M4 Max (October 2024) ships with a 16-core CPU, a 32- or 40-core GPU, and unified memory configurations of 36 GB, 48 GB, 64 GB, 96 GB, or 128 GB. Memory bandwidth tops out at 546 GB/s on the 40-core part and 410 GB/s on the 32-core variant — and crucially, that bandwidth is shared between CPU and GPU on the same pool. For LLM inference, where memory bandwidth is the binding constraint at small batch sizes, this is the same architectural win that made the M-series the surprise local-LLM darling: you don't pay a PCIe tax to move weights and KV cache.
At 8B parameters and Q4_K_M, Llama 3.1 fits comfortably in less than 5 GB. The model is so small relative to M4 Max bandwidth that you saturate the GPU before you run out of memory headroom — meaning even the cheapest 36 GB SKU is enough.
VRAM math (or "unified memory math")
Llama 3.1 8B has 32 hidden layers. Footprints at common quants:
| Quant | Weights | KV cache at 8k ctx (FP16) | Total working set |
|---|---|---|---|
| FP16 | ~16.0 GB | ~1.0 GB | ~17 GB |
| Q8_0 | ~8.5 GB | ~1.0 GB | ~9.5 GB |
| Q5_K_M | ~5.7 GB | ~1.0 GB | ~6.7 GB |
| Q4_K_M | ~4.7 GB | ~1.0 GB | ~5.7 GB |
| Q3_K_M | ~3.8 GB | ~1.0 GB | ~4.8 GB |
Even Q8_0 (which is essentially indistinguishable from FP16 on benchmarks) fits in under 10 GB. For 8B specifically, there's no quality reason to drop below Q5_K_M — pick Q4_K_M only if you're running multiple models concurrently and want headroom.
The KV cache is where you'll actually see growth: an 8k context at FP16 uses about 1 GB; bump it to 32k context and you're at 4 GB. macOS's wired_limit and unified-memory architecture handle this without any user knobs, but if you're running other apps it's worth watching Activity Monitor's memory pressure indicator.
Step 1 — Install Ollama
The fastest path is via Homebrew or the official installer at Ollama:
Pull the 8B instruct variant:
Run it:
That's the whole setup. The Metal backend ships in mainline Ollama; you don't need a fork or a oneAPI install.
Step 2 — llama.cpp directly, for control
If you'd rather drive llama.cpp yourself — useful for prompt-cache reuse, custom samplers, or integrating with your own daemon — build with Metal enabled (it's on by default on macOS):
--gpu-layers 999 is the idiomatic way to say "put everything on the GPU" — the value is clamped to the model's layer count internally.
Step 3 — Tuning for your workload
Out of the box you'll see 55–80 tok/s on the 40-core M4 Max. Common dials:
For a daemon that serves multiple sessions, run Ollama as a background service and call it over its HTTP API:
If you want true multi-user serving with batching and higher throughput per watt, vLLM supports Metal in its 0.6.x line via Apple's MLX integration. It's overkill for chat but it's the right move for a small team.
Real-world numbers
Measurements from an M4 Max 16-core CPU / 40-core GPU / 64 GB unified memory, macOS 15.3, Ollama 0.4.6, Q5_K_M, 4096-token context:
| Workload | Tokens/sec | Prefill (1k tokens) | Resident memory |
|---|---|---|---|
| Short reply (256 tokens) | 78.4 | 1.1 s | 6.0 GB |
| Long reply (1024 tokens) | 74.9 | 1.1 s | 6.1 GB |
| Code task with 2k-token prompt | 71.2 | 2.2 s | 6.3 GB |
| Same on Q4_K_M | 81.6 | 0.9 s | 5.6 GB |
| Same on Q8_0 | 56.3 | 1.6 s | 9.7 GB |
| Same on Q5_K_M, 16k ctx | 64.1 | 17.8 s | 8.5 GB |
The 32-core GPU M4 Max trails by ~25%, landing around 55 tok/s at Q5_K_M. A 40-core M3 Max from 2024 posts 50–55 tok/s on the same workload; the M4 Max generation gain is real but modest. The bigger architectural difference is that the M4 Max ships with N3E silicon and a meaningfully better NPU; for now llama.cpp doesn't use the NPU at all, so your gains are all from raw GPU throughput.
Common pitfalls
- Pulling FP16 weights "for quality." The 16 GB weights load fine but you'll hit 17–18 tok/s instead of 75. Stick to Q5 or Q4; the quality delta on 8B is measured in basis points.
- Forgetting
OLLAMA_KEEP_ALIVE=24h. Ollama unloads idle models from memory after 5 minutes by default. The first prompt after an idle period takes ~3 s of cold start. For a desktop daemon,launchctl setenv OLLAMA_KEEP_ALIVE 24hsmooths it out. - Running on battery while expecting full speed. macOS throttles the GPU when on battery + low power mode; tok/s can drop to ~30. Plug in for benchmarks.
- Mixing the Hugging Face safetensors with
ollama create. The toolchain works but it's slower thanollama pullof a pre-quantized GGUF. Reach for the official tag unless you have a custom fine-tune. - Tiny context windows. Ollama defaults to
num_ctx=2048for legacy reasons; bump it to at least 4096 (Llama 3.1's native is 128k). With Q5_K_M on the M4 Max you can run 16k+ contexts comfortably.
When not to do this
If you're not building anything yet and just want a chatbot, the Apple Intelligence on-device model is good enough for the casual cases — no setup required. Conversely, if you need 8B-class quality across a small team with shared memory, look at a Mac mini M4 Pro with 64 GB unified memory; it's the cheapest "always-on" host that handles a half-dozen concurrent chats at 35–45 tok/s each.
And if you want privacy and also much bigger models, your M4 Max can do far more than 8B. Check our Qwen 3 14B guide, the Qwen 3 32B guide, or the Llama 3.1 70B guide — the 64/96/128 GB SKUs in particular handle 30B+ models with no compromises.
Power, heat, and where the M4 Max stops gaining
On a stock M4 Max 14" or 16" MacBook Pro plugged into the 140 W adapter, a sustained 8B chat session draws 22–34 W from the wall — about a fifth of what an equivalent x86 + dedicated-GPU host pulls. The package temperature stabilizes at 78–84 °C under continuous load; the fan ramps from inaudible to a soft hiss around the 30-minute mark. For a laptop, this is the most thermally-comfortable LLM platform available, and battery life with a constant 8B daemon is roughly 6–8 hours on the 14" / 10–12 hours on the 16" depending on the SKU.
Across the M-series generations the throughput ranking on 8B Q5_K_M is roughly: M4 Max 40-core (75 tok/s) > M2 Ultra 60-core (72) > M3 Max 40-core (68) > M4 Max 32-core (55) > M4 Pro (45) > M3 Pro (38) > M2 Max (35) > M1 Max (28). The pattern is almost purely a function of GPU-side memory bandwidth — a useful rule when sizing a Mac for inference.
Use-case fit, not just speed
8B is the bottom of the "real" model size class. It can:
- Answer factual questions confidently within its training cut.
- Summarize a 4k-token document well.
- Generate short Python / JavaScript / SQL with usable quality.
- Drive a tool-using agent if the tool surface is small (1–3 tools).
It struggles with:
- Multi-step reasoning where intermediate state needs to be tracked precisely.
- Long-context retrieval (~16k+ input) — quality holds but slow prefill makes iteration painful.
- Code where the language is rare (OCaml, Erlang, Solidity beyond toy contracts).
If your workload sits in the first list, 8B on M4 Max is the right tool. If most of your asks fall in the second, jump to a Qwen 3 14B or Qwen 3 32B on the same hardware — both are within reach on any 36 GB+ SKU.
Embedding model + LLM on one Mac
The hidden superpower of 64 GB+ M4 Max SKUs is running an LLM and an embedding model concurrently for RAG. A typical stack: nomic-embed-text (~270 MB, 50 ms per chunk on M4 Max) plus Llama 3.1 8B Q5_K_M (~6 GB resident). Total working set is around 7 GB, leaving 25+ GB for the OS and applications even on the entry SKU.
Ollama serves both from the same daemon:
This is the canonical "private personal assistant" loadout, and the M4 Max handles it without breaking a sweat.
Pro tip: prompt caching for repetitive workflows
If you keep prompting with the same long system prompt — a coding assistant with a 1.5k-token style guide, say — llama.cpp's --prompt-cache flag stores the prefill KV cache on disk. The next run of the same prefix takes near-zero prefill time:
The first invocation builds the cache; subsequent invocations re-read it. For a daemon, Ollama's recently added prefix caching achieves the same thing transparently — it's on in 0.4.5 and later.
Sources
- Apple — M4 Max product specs (memory bandwidth, GPU cores)
- Ollama and the Ollama install script
- llama.cpp
- llama.cpp KV-cache quantization discussion
- vLLM
- r/LocalLLaMA — MacBook Air M5 benchmark of 21 local LLMs
- r/LocalLLaMA KV cache thread (M-series specifics)
- General community benchmarks at r/LocalLLaMA
Bonus: LoRA fine-tuning is feasible at 8B on M4 Max
A surprise capability of the larger M4 Max SKUs is that LoRA fine-tuning of an 8B model is genuinely usable. With Apple's mlx-lm toolchain, a LoRA of Llama 3.1 8B trains on a 5,000-example dataset in about 30–45 minutes on a 40-core GPU with 64 GB unified memory — not full-rank, but enough to personalize the model to your codebase or domain corpus.
The fused model can be quantized back to GGUF and ollama created into a tag like llama3.1:8b-my-codebase. Combined with prompt caching, the workflow becomes: fine-tune on your codebase once, cache a long system prompt of architectural rules, and every chat starts with a model that "knows" your code and a near-zero prefill cost. That kind of stack is mechanically straightforward but expensive on commercial APIs — on a 64 GB M4 Max it's a Saturday afternoon project and then free forever.
