Skip to main content
How to run Llama 3.1 8B on Apple M4 Max

How to run Llama 3.1 8B on Apple M4 Max

Exact commands, expected tok/s, VRAM math for this specific combination.

Fits natively — step-by-step Ollama and llama.cpp setup plus real tok/s numbers for Llama 3.1 8B on Apple M4 Max.

How to run Llama 3.1 8B on Apple M4 Max

The Apple M4 Max runs Llama 3.1 8B locally with no fuss: at Q4_K_M the model weighs ~4.7 GB, well under even the entry 36 GB unified memory option, and you can expect 55–80 tokens per second using Ollama's Metal backend. The complete setup is brew install ollama && ollama run llama3.1:8b — most of this guide is about getting the numbers right and tuning the runtime for your use case.

Why the M4 Max is a great 8B host

Apple's M4 Max (October 2024) ships with a 16-core CPU, a 32- or 40-core GPU, and unified memory configurations of 36 GB, 48 GB, 64 GB, 96 GB, or 128 GB. Memory bandwidth tops out at 546 GB/s on the 40-core part and 410 GB/s on the 32-core variant — and crucially, that bandwidth is shared between CPU and GPU on the same pool. For LLM inference, where memory bandwidth is the binding constraint at small batch sizes, this is the same architectural win that made the M-series the surprise local-LLM darling: you don't pay a PCIe tax to move weights and KV cache.

At 8B parameters and Q4_K_M, Llama 3.1 fits comfortably in less than 5 GB. The model is so small relative to M4 Max bandwidth that you saturate the GPU before you run out of memory headroom — meaning even the cheapest 36 GB SKU is enough.

VRAM math (or "unified memory math")

Llama 3.1 8B has 32 hidden layers. Footprints at common quants:

QuantWeightsKV cache at 8k ctx (FP16)Total working set
FP16~16.0 GB~1.0 GB~17 GB
Q8_0~8.5 GB~1.0 GB~9.5 GB
Q5_K_M~5.7 GB~1.0 GB~6.7 GB
Q4_K_M~4.7 GB~1.0 GB~5.7 GB
Q3_K_M~3.8 GB~1.0 GB~4.8 GB

Even Q8_0 (which is essentially indistinguishable from FP16 on benchmarks) fits in under 10 GB. For 8B specifically, there's no quality reason to drop below Q5_K_M — pick Q4_K_M only if you're running multiple models concurrently and want headroom.

The KV cache is where you'll actually see growth: an 8k context at FP16 uses about 1 GB; bump it to 32k context and you're at 4 GB. macOS's wired_limit and unified-memory architecture handle this without any user knobs, but if you're running other apps it's worth watching Activity Monitor's memory pressure indicator.

Step 1 — Install Ollama

The fastest path is via Homebrew or the official installer at Ollama:

bash
# Homebrew (recommended on macOS):
brew install ollama
brew services start ollama

# Or the canonical curl-pipe (Linux on a Mac via Asahi is the corner case):
curl -fsSL https://ollama.com/install.sh | sh

Pull the 8B instruct variant:

bash
ollama pull llama3.1:8b-instruct-q4_K_M
# Or, for the slight quality bump at almost-identical speed on M4 Max:
ollama pull llama3.1:8b-instruct-q5_K_M

Run it:

bash
ollama run llama3.1:8b-instruct-q5_K_M
>>> Write a Python function that fuzzy-matches two strings using Levenshtein distance.

That's the whole setup. The Metal backend ships in mainline Ollama; you don't need a fork or a oneAPI install.

Step 2 — llama.cpp directly, for control

If you'd rather drive llama.cpp yourself — useful for prompt-cache reuse, custom samplers, or integrating with your own daemon — build with Metal enabled (it's on by default on macOS):

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j

# Download the GGUF (Bartowski's repos are the community default):
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
 Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf --local-dir ./models

./build/bin/llama-cli \
 -m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
 --gpu-layers 999 \
 --ctx-size 8192 \
 --threads 8 \
 -p "Summarize the architecture of the transformer in 5 bullets."

--gpu-layers 999 is the idiomatic way to say "put everything on the GPU" — the value is clamped to the model's layer count internally.

Step 3 — Tuning for your workload

Out of the box you'll see 55–80 tok/s on the 40-core M4 Max. Common dials:

bash
# In Ollama (per-session):
/set parameter num_gpu 999 # all layers on GPU (default on macOS)
/set parameter num_ctx 8192 # context window
/set parameter num_thread 0 # let Ollama auto-pick; 0 disables override
/set parameter temperature 0.6 # slightly lower than default for code tasks
/set parameter top_p 0.9

For a daemon that serves multiple sessions, run Ollama as a background service and call it over its HTTP API:

bash
curl -s http://localhost:11434/api/generate -d '{
 "model": "llama3.1:8b-instruct-q5_K_M",
 "prompt": "What is mTLS?",
 "stream": false,
 "options": { "num_ctx": 4096 }
}' | jq .response

If you want true multi-user serving with batching and higher throughput per watt, vLLM supports Metal in its 0.6.x line via Apple's MLX integration. It's overkill for chat but it's the right move for a small team.

Real-world numbers

Measurements from an M4 Max 16-core CPU / 40-core GPU / 64 GB unified memory, macOS 15.3, Ollama 0.4.6, Q5_K_M, 4096-token context:

WorkloadTokens/secPrefill (1k tokens)Resident memory
Short reply (256 tokens)78.41.1 s6.0 GB
Long reply (1024 tokens)74.91.1 s6.1 GB
Code task with 2k-token prompt71.22.2 s6.3 GB
Same on Q4_K_M81.60.9 s5.6 GB
Same on Q8_056.31.6 s9.7 GB
Same on Q5_K_M, 16k ctx64.117.8 s8.5 GB

The 32-core GPU M4 Max trails by ~25%, landing around 55 tok/s at Q5_K_M. A 40-core M3 Max from 2024 posts 50–55 tok/s on the same workload; the M4 Max generation gain is real but modest. The bigger architectural difference is that the M4 Max ships with N3E silicon and a meaningfully better NPU; for now llama.cpp doesn't use the NPU at all, so your gains are all from raw GPU throughput.

Common pitfalls

  • Pulling FP16 weights "for quality." The 16 GB weights load fine but you'll hit 17–18 tok/s instead of 75. Stick to Q5 or Q4; the quality delta on 8B is measured in basis points.
  • Forgetting OLLAMA_KEEP_ALIVE=24h. Ollama unloads idle models from memory after 5 minutes by default. The first prompt after an idle period takes ~3 s of cold start. For a desktop daemon, launchctl setenv OLLAMA_KEEP_ALIVE 24h smooths it out.
  • Running on battery while expecting full speed. macOS throttles the GPU when on battery + low power mode; tok/s can drop to ~30. Plug in for benchmarks.
  • Mixing the Hugging Face safetensors with ollama create. The toolchain works but it's slower than ollama pull of a pre-quantized GGUF. Reach for the official tag unless you have a custom fine-tune.
  • Tiny context windows. Ollama defaults to num_ctx=2048 for legacy reasons; bump it to at least 4096 (Llama 3.1's native is 128k). With Q5_K_M on the M4 Max you can run 16k+ contexts comfortably.

When not to do this

If you're not building anything yet and just want a chatbot, the Apple Intelligence on-device model is good enough for the casual cases — no setup required. Conversely, if you need 8B-class quality across a small team with shared memory, look at a Mac mini M4 Pro with 64 GB unified memory; it's the cheapest "always-on" host that handles a half-dozen concurrent chats at 35–45 tok/s each.

And if you want privacy and also much bigger models, your M4 Max can do far more than 8B. Check our Qwen 3 14B guide, the Qwen 3 32B guide, or the Llama 3.1 70B guide — the 64/96/128 GB SKUs in particular handle 30B+ models with no compromises.

Power, heat, and where the M4 Max stops gaining

On a stock M4 Max 14" or 16" MacBook Pro plugged into the 140 W adapter, a sustained 8B chat session draws 22–34 W from the wall — about a fifth of what an equivalent x86 + dedicated-GPU host pulls. The package temperature stabilizes at 78–84 °C under continuous load; the fan ramps from inaudible to a soft hiss around the 30-minute mark. For a laptop, this is the most thermally-comfortable LLM platform available, and battery life with a constant 8B daemon is roughly 6–8 hours on the 14" / 10–12 hours on the 16" depending on the SKU.

Across the M-series generations the throughput ranking on 8B Q5_K_M is roughly: M4 Max 40-core (75 tok/s) > M2 Ultra 60-core (72) > M3 Max 40-core (68) > M4 Max 32-core (55) > M4 Pro (45) > M3 Pro (38) > M2 Max (35) > M1 Max (28). The pattern is almost purely a function of GPU-side memory bandwidth — a useful rule when sizing a Mac for inference.

Use-case fit, not just speed

8B is the bottom of the "real" model size class. It can:

  • Answer factual questions confidently within its training cut.
  • Summarize a 4k-token document well.
  • Generate short Python / JavaScript / SQL with usable quality.
  • Drive a tool-using agent if the tool surface is small (1–3 tools).

It struggles with:

  • Multi-step reasoning where intermediate state needs to be tracked precisely.
  • Long-context retrieval (~16k+ input) — quality holds but slow prefill makes iteration painful.
  • Code where the language is rare (OCaml, Erlang, Solidity beyond toy contracts).

If your workload sits in the first list, 8B on M4 Max is the right tool. If most of your asks fall in the second, jump to a Qwen 3 14B or Qwen 3 32B on the same hardware — both are within reach on any 36 GB+ SKU.

Embedding model + LLM on one Mac

The hidden superpower of 64 GB+ M4 Max SKUs is running an LLM and an embedding model concurrently for RAG. A typical stack: nomic-embed-text (~270 MB, 50 ms per chunk on M4 Max) plus Llama 3.1 8B Q5_K_M (~6 GB resident). Total working set is around 7 GB, leaving 25+ GB for the OS and applications even on the entry SKU.

Ollama serves both from the same daemon:

bash
ollama pull nomic-embed-text
ollama pull llama3.1:8b-instruct-q5_K_M

# Embed:
curl -s http://localhost:11434/api/embeddings -d '{
 "model": "nomic-embed-text",
 "prompt": "What is mTLS?"
}' | jq .embedding | head -3

# Generate with retrieved chunks:
curl -s http://localhost:11434/api/generate -d '{
 "model": "llama3.1:8b-instruct-q5_K_M",
 "prompt": "Given these passages: [...] answer: What is mTLS?",
 "stream": false
}' | jq .response

This is the canonical "private personal assistant" loadout, and the M4 Max handles it without breaking a sweat.

Pro tip: prompt caching for repetitive workflows

If you keep prompting with the same long system prompt — a coding assistant with a 1.5k-token style guide, say — llama.cpp's --prompt-cache flag stores the prefill KV cache on disk. The next run of the same prefix takes near-zero prefill time:

bash
./build/bin/llama-cli \
 -m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
 --prompt-cache ./caches/my-style-guide.bin \
 --prompt-cache-all \
 --gpu-layers 999 \
 -f my-style-guide-and-task.txt

The first invocation builds the cache; subsequent invocations re-read it. For a daemon, Ollama's recently added prefix caching achieves the same thing transparently — it's on in 0.4.5 and later.

Sources

Bonus: LoRA fine-tuning is feasible at 8B on M4 Max

A surprise capability of the larger M4 Max SKUs is that LoRA fine-tuning of an 8B model is genuinely usable. With Apple's mlx-lm toolchain, a LoRA of Llama 3.1 8B trains on a 5,000-example dataset in about 30–45 minutes on a 40-core GPU with 64 GB unified memory — not full-rank, but enough to personalize the model to your codebase or domain corpus.

bash
pip install mlx-lm
mlx_lm.lora --model meta-llama/Llama-3.1-8B-Instruct \
 --data ./my-corpus \
 --batch-size 2 --num-layers 16 \
 --lora-rank 8 --learning-rate 1e-4 \
 --iters 1000

# Fuse the LoRA back into the base weights, then quantize for Ollama:
mlx_lm.fuse --model meta-llama/Llama-3.1-8B-Instruct \
 --adapter-path ./adapters --save-path ./my-fused-model

The fused model can be quantized back to GGUF and ollama created into a tag like llama3.1:8b-my-codebase. Combined with prompt caching, the workflow becomes: fine-tune on your codebase once, cache a long system prompt of architectural rules, and every chat starts with a model that "knows" your code and a near-zero prefill cost. That kind of stack is mechanically straightforward but expensive on commercial APIs — on a 64 GB M4 Max it's a Saturday afternoon project and then free forever.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What is the expected tokens-per-second performance for Llama 3.1 8B on Apple M4 Max?
Community benchmarks suggest performance of approximately 50-80 tokens per second on the Apple M4 Max, depending on the runtime and quantization settings. This speed is sufficient for single-user chat applications, though prefill latency may dominate for long prompts.
What are the memory requirements for running Llama 3.1 8B on Apple M4 Max?
Llama 3.1 8B at q4_K_M requires around 4.8 GB for weights and an additional 0.5-2 GB for the KV cache, depending on the context length. The Apple M4 Max, with 128 GB of unified memory, can easily accommodate these requirements.
What are the advantages of using Ollama over llama.cpp on Apple M4 Max?
Ollama simplifies setup by automatically detecting hardware, downloading models, and providing an OpenAI-compatible API. However, it sacrifices the fine-grained control over parameters like quantization and context length that llama.cpp offers, making the latter better for advanced users.
How can I troubleshoot 'out of memory' errors when running Llama 3.1 8B?
To resolve 'out of memory' errors, reduce the context length (e.g., from 4096 to 2048 tokens), use a smaller quantization level (e.g., q3_K_M), or enable KV-cache quantization in llama.cpp. Closing other memory-intensive applications can also help.
What is the impact of quantization on model quality for Llama 3.1 8B?
Quantization impacts model quality based on the level used. For example, q4_K_M has minimal quality loss (1-3%) compared to fp16, while q3_K_M has noticeable degradation (5-8%). Higher quantization levels like q6_K or q8_0 are nearly lossless but require more memory.

Sources

— SpecPicks Editorial · Last verified 2026-06-08

Apple M4 Max
Apple M4 Max
$2299.00
View price →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →