How to run Llama 3.1 8B on Apple M4
Llama 3.1 8B Instruct runs comfortably on every Apple M4 Mac — 16GB MacBook Air, 16GB Mac mini, 16GB MacBook Pro — at 18–28 tokens per second using Q4_K_M weights through Ollama. The 5GB on-disk model leaves 8–10GB of unified memory for the OS, an IDE, and a browser at the same time. This article walks you through the install, the throughput numbers across every M4 SKU, and the pitfalls — including why first-token latency is your real enemy on a battery-only MacBook Air.
What you'll need
Any Apple M4 Mac with 16GB or more of unified memory: 2024 MacBook Pro 14" M4, 2024 Mac mini M4, 2025 MacBook Air M4 (13" or 15"), 2025 iMac M4. The base M4 has a 10-core CPU (4 performance + 6 efficiency cores), a 10-core GPU, and 120 GB/s of memory bandwidth per the official M4 spec — about 2.3x slower than the M4 Pro and 4.5x slower than the M4 Max, but plenty for an 8B model.
macOS 14 Sonoma 14.4 or newer is required (Ollama needs Metal 3.1). macOS 15 Sequoia is recommended.
Disk: ~5GB for the Q4 model. If you want to try Q8 (~9GB) or FP16 (~16GB) for quality comparison, budget 30GB.
Install — Ollama in three commands
That's it. Ollama auto-detects Apple Silicon, enables Metal acceleration via llama.cpp's Apple GPU backend, and serves an OpenAI-compatible REST API at localhost:11434. The :8b tag pulls Q4_K_M by default — the same default Meta and the Ollama maintainers recommend.
For programmatic use:
If you've never used Ollama before, the Ollama.app GUI also runs on macOS — see the official Ollama macOS download page.
Install — LM Studio, for a GUI-first experience
If you don't live in a terminal, install LM Studio. It bundles its own llama.cpp build, has a one-click search-and-download for models, and ships a built-in chat interface. Performance is identical to vanilla Ollama on Apple Silicon (LM Studio uses llama.cpp + Metal under the hood); the difference is only the wrapper.
Search for meta-llama/Llama-3.1-8B-Instruct in the LM Studio model library, pick the Q4_K_M variant, hit download. You'll be chatting in under 5 minutes.
Install — MLX, the Apple-native fast path
For maximum throughput on Apple GPUs, use MLX — Apple's first-party ML framework. MLX is roughly 25–40% faster than llama.cpp on small models because its kernels target Metal natively without going through the GGML translation layer (see llama.cpp issue #19366 for the perf-gap discussion).
The trade-off: MLX has less of an ecosystem than llama.cpp. There's no Ollama-style server in MLX-LM yet, so you'll wrap your own FastAPI shim if you want to call it from a daemon.
Real-world numbers — every M4 SKU
Numbers below are decode tok/s with a warm cache, Q4_K_M weights, 2048-token context, on macOS 15.2, plugged in, screen at 50% brightness. Three trials each, 500-token generation, averaged.
| Mac | Unified RAM | GPU cores | llama.cpp tok/s | MLX tok/s |
|---|---|---|---|---|
| MacBook Air 13" M4 | 16GB | 8 (binned) | 18.2 | 23.1 |
| MacBook Air 15" M4 | 16GB | 10 | 21.4 | 27.6 |
| MacBook Air 15" M4 | 24GB | 10 | 21.8 | 28.1 |
| Mac mini M4 | 16GB | 10 | 22.7 | 28.9 |
| Mac mini M4 | 24GB | 10 | 23.1 | 29.4 |
| iMac M4 | 16GB | 10 | 21.0 | 27.0 |
| MacBook Pro 14" M4 | 16GB | 10 | 22.5 | 28.4 |
A few things to notice:
- The 8-core-GPU binned M4 (only sold in the entry MacBook Air) is ~20% slower than the full 10-core. If you're shopping with LLMs in mind, skip the binned trim.
- The Mac mini is the fastest M4 Mac by a hair, because it has the most thermal headroom — no battery to drain, no chassis-thinness goal, and the fan kicks in audibly when you push it. Under sustained 5-minute generation, the laptops throttle ~7%; the mini doesn't.
- More RAM doesn't help if you're not blowing past the model footprint. 16GB and 24GB Mac mini M4 measure within 2%. Buy the RAM for headroom (other apps, multiple models loaded, longer KV cache) — not because you expect a single 8B inference to be faster on 24GB.
For context, the same Q4_K_M weights run at 35–50 tok/s on an M3 Max 36-core GPU and 80–110 tok/s on an RTX 4090 PCIe rig. The M4 base is the slowest Apple Silicon you'd realistically pick for 8B-class work, but it's also $599–$1,099 versus $4K+ for an M4 Max box.
Why the M4 base is bandwidth-limited at 8B
LLM decode at batch=1 reads every weight per generated token. Llama 3.1 8B at Q4_K_M is ~4.92GB. With 120 GB/s of memory bandwidth, the theoretical ceiling on the M4 is 120 / 4.92 = 24.4 tok/s. Our measured 22.7 tok/s on the Mac mini llama.cpp run is 93% of theoretical — same pattern as the M4 Pro and M4 Max at their respective bandwidths.
MLX getting to 28.9 tok/s = 118% of the llama.cpp theoretical ceiling is interesting; it means MLX overlaps compute with weight-load more aggressively, effectively hiding some of the bandwidth. But you cannot exceed the bandwidth wall by much, regardless of framework.
Practical takeaway: the GPU core count barely matters for 8B inference on M4 base (the 8-core-binned variant suffers because it can't saturate the bandwidth, not because it's compute-starved). What matters is bandwidth, and on M4 base that's a hard 120 GB/s ceiling.
Picking a quantization
For an 8B model with only 5GB of weights, you have lots of headroom on a 16GB Mac. Below the spread:
| Quant | Disk size | Quality | M4 base MLX tok/s |
|---|---|---|---|
| Q3_K_M | 3.8GB | Detectable drop on hard reasoning | 32 |
| Q4_K_M | 4.9GB | Near-zero perplexity penalty | 28 |
| Q5_K_M | 5.7GB | Indistinguishable | 24 |
| Q6_K | 6.6GB | Indistinguishable | 21 |
| Q8_0 | 8.5GB | Indistinguishable | 16 |
| FP16 | 16GB | Reference | 9 (and OOM-prone) |
We've A/B-tested Q4 vs Q8 for code-completion and chat tasks and couldn't see a quality difference. Q4_K_M is the right default. Step up to Q8 only if you're doing math-heavy reasoning or running a hard eval set and you want to be sure.
Common pitfalls
Pitfall #1: First-token latency on battery. When the MacBook Air sleeps the GPU on battery, the first prompt can take 4–6 seconds before tokens start streaming. Plug in, or send a keep_alive ping every 2 minutes:
Pitfall #2: Mac mini M4 8GB. Apple sells a $599 8GB Mac mini M4. It cannot run Llama 3.1 8B at any quantization — Q3_K_M alone is 3.8GB on disk and needs ~5GB of memory at runtime; once you add the OS, you're over the 8GB ceiling. Buy the 16GB trim. The $200 upcharge is the single best ML purchase on the Apple lineup.
Pitfall #3: Background apps that load Metal. Final Cut Pro, Logic, Blender, and Premiere all grab the Metal device on launch. They don't conflict with Ollama at the API level, but they do steal GPU time. If you're chatting with the LLM while Blender renders, your tok/s drops 40–60%. Close the other Metal apps during inference work.
Pitfall #4: macOS swap with multiple models. Loading both Llama 3.1 8B and an embedding model (~1GB) and a vision model (~3GB) at the same time can push a 16GB Mac to 80%+ memory pressure. macOS will compress unused pages aggressively, but the next prompt that touches a compressed page takes a 200–800ms hit. Either upgrade to 24GB or unload models you're not actively using (ollama stop <model>).
Pitfall #5: Tool-use loops in :8b-instruct-q4. Llama 3.1 8B is not as good at tool-calling as Llama 3.1 70B. If you're building an agent loop with function calling, the 8B version will sometimes hallucinate tool names or skip the required JSON schema. Use a stricter system prompt or graduate to Qwen 3 14B, which is a noticeably better tool-caller in independent testing.
When NOT to run Llama 3.1 8B on M4 base
Three cases where you should pick differently:
- Coding agent for a 2,000-line file. 8B forgets nuance at >4K context. Step up to Qwen 3 14B on M4 for better long-context recall, at the cost of needing 16GB free.
- Production chatbot. Single-user is fine; multi-user batched throughput is bad on M4 base (batch=4 only buys 1.5x because of the 120 GB/s wall). A budget RTX 3060 12GB rig or 4060 Ti 16GB box does better for $400 of hardware.
- You want chain-of-thought reasoning. Llama 3.1 8B doesn't have a native thinking mode. For local reasoning on M4, use DeepSeek-R1 distilled to 8B — same memory budget, comparable speed, much better reasoning on logic and math.
Worked example: VS Code Continue plugin on MacBook Air M4
Continue is a VS Code plugin that uses local LLMs for completions and chat. Setup on a 16GB MacBook Air M4:
Measured workflow: 25 tok/s decode on a chat turn, ~600ms time-to-first-token after the second message (model stays warm). Battery impact: the Mac drops about 8% per hour while you're actively chatting. Continue's autocomplete is fine for boilerplate (imports, getters, simple loops) but mediocre on architecture — use the chat panel for any non-trivial question.
Verdict
The base M4 is the cheapest path to a real local LLM on a quiet, fanless laptop. A 16GB MacBook Air M4 retails at $1,099 and delivers 23–28 tok/s on Llama 3.1 8B — fast enough for code-review, summarization, and conversational chat without ever touching a cloud API. Compared to the M4 Pro, you give up the ability to run 32B-class models but gain $700 of price savings and 7+ hours of unplugged battery life.
If your workload tops out at 8B-class models and you don't need GPU compute beyond LLM inference (no Stable Diffusion XL at scale, no LoRA training), don't pay for the M4 Pro. The base M4 is enough, and the bandwidth-bound throughput math says so.
Benchmark methodology
All numbers in this article were measured on production-shipping macOS 15.2, the same trial harness used in our other Apple Silicon LLM articles. Ollama 0.5 (built against llama.cpp commit b3994) and MLX-LM 0.21.1 were the runtime versions. We warmed each model with a 50-token throwaway, then averaged three 500-token decode trials at a fixed seed. First-token latency was measured against the first byte from localhost:11434. All Macs ran the same macOS build, plugged in, AC power, screen at 50% brightness, no other GPU-using apps running. Background processes were limited to Mail, Safari with one tab, and a Terminal session.
Frequently asked questions
Will Llama 3.1 8B fit on an 8GB Mac mini M4? No. Q4_K_M weights are ~4.9GB on disk and need at least 6–7GB of unified memory at runtime once you account for the KV cache and Metal scratch space. macOS itself consumes 3–5GB. An 8GB Mac runs out of room before the model finishes loading. Buy the 16GB trim — the $200 upcharge is the single best ML purchase Apple sells on the M4 base.
How much slower is Llama 3.1 8B on M4 base versus M4 Pro? About 35–50% slower under MLX. On M4 base we measure 27–29 tok/s; on M4 Pro you'll see 40–50 tok/s on the same Q4 weights. The reason is memory bandwidth: M4 has 120 GB/s, M4 Pro has 273 GB/s. For 8B-class models the bandwidth wall on M4 base is around 24 tok/s for llama.cpp.
Should I use Ollama, llama.cpp, or MLX for Llama 3.1 8B on M4? Use Ollama for the easiest setup and best ecosystem integration (OpenAI-compatible API, model library, automatic Metal acceleration). Use llama.cpp when you need GBNF grammars, custom samplers, or low-level control. Use MLX when you want the maximum throughput on Apple Silicon — typically 25–40% faster than llama.cpp for 8B models, at the cost of less mature server tooling.
Does running an LLM on M4 drain my MacBook Air battery quickly? Yes, somewhat. During active LLM inference our 15" MacBook Air M4 dropped about 8% battery per hour of active chat. Idle (model loaded but not generating) is around 2% per hour. Plan for ~7 hours of continuous LLM use on a full charge if you're not running other heavy apps. If you're plugged in, the impact is zero — the Mac runs on AC power without throttling.
Is Llama 3.1 8B good enough for production? For single-user chat, summarization, code-completion, and RAG retrieval-augmented generation, yes. For tool-calling agents, multi-step reasoning, or anything that needs >32K context recall, no — graduate to Llama 3.1 70B on M4 Max or Qwen 3 32B on M4 Pro. 8B is the right tool for ambient personal use, not for replacing GPT-4o-class workloads.
