In 2026 the best LLMs to run on a Raspberry Pi 5 8GB as a 24/7 local server are Llama 3.2 1B at q4_K_M (15-22 tok/s, ~700 MB RAM) for snappy chat and tool-calling, Phi-3 Mini 3.8B at q4_K_M (4-6 tok/s, ~2.6 GB) for higher-quality answers, and Qwen 2.5 1.5B at q5_K_M (8-12 tok/s, ~1.6 GB) for code and JSON. Anything above 4B parameters runs but is too slow to be useful interactively. Stick to q4_K_M and q5_K_M; q8 wastes RAM, q2 wastes quality.
Why turn a Pi 5 into a local LLM server in 2026?
A Raspberry Pi 5 8GB is the sweet spot of the local-LLM hobbyist stack right now. The board pulls 5-9 W under inference load — about 1/40th of a desktop with a discrete GPU — and the Cortex-A76 cluster at 2.4 GHz is roughly 2x the integer throughput of the Pi 4, which finally pushes 1B-3B parameter models into "interactive" territory. We've been running a Pi 5 next to our Ryzen 5800X + RTX 3060 workstation for almost a year as a privacy-first sidecar, and it handles 90% of the prompts we'd otherwise send to a cloud API: shell completion, calendar summarization, Home Assistant intent parsing, RSS digest generation, voice-assistant intent routing, and lightweight RAG over a personal Markdown vault.
You don't get cloud-scale latency, and you absolutely shouldn't try to summarize a 50k-token transcript on it — model load and prompt prefill dominate quickly. But for a fixed-budget, always-on inference endpoint that costs ~$10/year in electricity, the Pi 5 is the only ARM board worth running llama.cpp on. The Orange Pi 5 Plus posts higher peak throughput but its driver story is a mess in 2026 — we covered that head-to-head in our Pi 5 vs Orange Pi 5 Plus benchmark piece. Pi wins on operability.
This guide walks through the model selection, the quantization tradeoff matrix, the actual llama.cpp setup, a voice-assistant build, and the Hailo AI HAT add-on. Numbers in every table come from our own runs on a 2024-batch Pi 5 8GB with active cooling on Raspberry Pi OS Bookworm 64-bit, llama.cpp build dated 2026-04-12.
Key takeaways
- Top model pick: Llama 3.2 1B at q4_K_M — 15-22 tok/s, ~700 MB RAM, runs alongside a voice assistant pipeline without thermal throttling.
- Best quality/perf trade: Phi-3 Mini 3.8B at q4_K_M — 4-6 tok/s, ~2.6 GB RAM, the smallest model that can reliably write multi-paragraph English.
- Skip anything ≥7B. A Mistral 7B at q4_K_M runs but generates at 1.4-1.9 tok/s. That's "read while you type" speed, not server speed.
- Pi 5 is 2.1-2.6x faster than Pi 4 on the same quantization for any model under 4B parameters. Worth the upgrade.
- q4_K_M is the universal default. q5_K_M is fine when you have RAM to spare; q8_0 is a waste; q2_K is unusable below 3B parameters.
- Active cooling is mandatory. Without it the Pi 5 throttles to ~1.5 GHz within 90s of sustained inference and your tok/s drops 35%.
- The Hailo AI HAT helps Whisper, not llama.cpp. Llama.cpp has no NPU path in 2026 — use the HAT for the speech-to-text leg of a voice assistant, not the LLM leg.
Can a Raspberry Pi 5 actually run useful LLMs in 2026?
Yes — but the definition of "useful" matters. The Pi 5 has 8 GB of unified LPDDR4X-4267 RAM with about 17 GB/s of effective bandwidth (after kernel + GPU reservations). For decoder-only transformer inference, generation speed is bandwidth-bound: each token requires reading roughly (parameter_count × bits_per_weight / 8) bytes from RAM. That gives a hard ceiling.
For a 1B-parameter model at q4_K_M (~4.5 bits/weight on average), one token is ~565 MB of reads, which caps you at roughly 17 GB/s ÷ 0.565 GB = 30 tok/s in the limit. We measure 15-22 tok/s in practice — about 60% of the ceiling, which is good for ARM. For a 3.8B Phi-3 at q4_K_M, the ceiling is around 17 / 2.15 = 7.9 tok/s; we see 4-6 tok/s. For a 7B Mistral at q4_K_M, ceiling is 17 / 3.95 = 4.3 tok/s and we see 1.4-1.9 tok/s. The gap widens because the 7B model spills out of L3 and the prefetcher can't hide latency.
Translation: anything above 4B parameters is non-interactive on a Pi 5. A 1B model in interactive chat reads about 30-45 words per second of output. A 3.8B model reads at the speed of a moderately attentive human. A 7B model reads at the speed of a tired person. Anything bigger is overnight-job territory and you should run it on a desktop instead.
Which quantization should you use on 8GB RAM?
Quantization is the single largest knob. The table below covers the seven main GGUF quantizations across the three model families we recommend.
| Model | Quant | File size | Resident RAM | Generation tok/s | Quality (1-5) |
|---|---|---|---|---|---|
| Llama 3.2 1B | q2_K | 0.48 GB | 0.55 GB | 19-26 | 1 |
| Llama 3.2 1B | q3_K_S | 0.55 GB | 0.62 GB | 18-24 | 2 |
| Llama 3.2 1B | q4_K_M | 0.68 GB | 0.72 GB | 15-22 | 4 |
| Llama 3.2 1B | q5_K_M | 0.79 GB | 0.84 GB | 13-19 | 4 |
| Llama 3.2 1B | q6_K | 0.92 GB | 0.97 GB | 11-16 | 5 |
| Llama 3.2 1B | q8_0 | 1.20 GB | 1.25 GB | 8-12 | 5 |
| Llama 3.2 1B | fp16 | 2.20 GB | 2.30 GB | 4-7 | 5 |
| Phi-3 Mini 3.8B | q4_K_M | 2.30 GB | 2.60 GB | 4-6 | 4 |
| Phi-3 Mini 3.8B | q5_K_M | 2.65 GB | 2.95 GB | 3-5 | 5 |
| Phi-3 Mini 3.8B | q8_0 | 4.10 GB | 4.55 GB | 1.8-2.6 | 5 |
| Qwen 2.5 1.5B | q4_K_M | 0.99 GB | 1.10 GB | 11-16 | 4 |
| Qwen 2.5 1.5B | q5_K_M | 1.13 GB | 1.25 GB | 9-13 | 4 |
| Qwen 2.5 1.5B | q8_0 | 1.65 GB | 1.85 GB | 6-9 | 5 |
| Llama 3.2 3B | q4_K_M | 2.02 GB | 2.30 GB | 5-7 | 4 |
| Mistral 7B v0.3 | q4_K_M | 4.07 GB | 4.55 GB | 1.4-1.9 | 5 |
The shape repeats across families: q4_K_M is the inflection point. Going lower (q3_K_S, q2_K) costs you significant quality with a modest speed gain because the bottleneck is bandwidth, not arithmetic. Going higher (q5_K_M, q6_K, q8_0) costs you speed proportional to file size with diminishing quality gains.
Practical recipe: keep Llama 3.2 1B at q4_K_M loaded permanently as your default endpoint (~700 MB), and have Phi-3 Mini at q4_K_M ready to hot-swap when you need better prose (~2.6 GB). Together they fit in 3.3 GB resident, leaving ~4 GB for the OS, Whisper, your application code, and prompt context.
How much faster is the Pi 5 vs Pi 4 for inference?
The Pi 4 8GB is still cheaper and more available, so the question comes up a lot. We tested both on the same llama.cpp build with -t 4 -ngl 0 (CPU only, four threads, no GPU offload).
| Model + quant | Pi 4 8GB prefill (tok/s) | Pi 4 8GB gen (tok/s) | Pi 5 8GB prefill (tok/s) | Pi 5 8GB gen (tok/s) | Pi 5 speedup |
|---|---|---|---|---|---|
| Llama 3.2 1B q4_K_M | 42 | 7-10 | 95 | 15-22 | 2.2x |
| Phi-3 Mini 3.8B q4_K_M | 14 | 1.8-2.6 | 38 | 4-6 | 2.3x |
| Qwen 2.5 1.5B q4_K_M | 32 | 4.5-6.5 | 78 | 11-16 | 2.4x |
| Llama 3.2 3B q4_K_M | 18 | 2.0-2.8 | 41 | 5-7 | 2.5x |
| Mistral 7B v0.3 q4_K_M | 5 | 0.6-0.8 | 14 | 1.4-1.9 | 2.4x |
Both the prefill (prompt processing) and generation legs are consistently 2.1-2.6x faster on the Pi 5. The gain is almost entirely from the LPDDR4X memory subsystem; the Cortex-A76 versus A72 IPC improvement is the secondary factor. If you already own a Pi 4, our Pi 4 sidecar tuning guide covers how to squeeze the last 15% out of it before you upgrade. If you don't own either, buy the Pi 5.
How do you set up llama.cpp on Raspberry Pi OS?
This is the install path we run on every Pi 5 that joins our fleet. Bookworm 64-bit, 8 GB model, official 27 W USB-C PSU, active cooler (any of the official cooler, Argon ONE V3, or a generic 30mm aluminum heatsink + fan).
After it boots, hit it from any other machine on the LAN:
Things that will trip you up:
- CPU governor. Set it to
performance. The defaultondemanddoesn't ramp aggressively enough for inference bursts and you lose 12-18% throughput.echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor. -c 4096context. Default is 512, which truncates almost anything useful. 4096 is enough for chat + light RAG; 8192 if you have RAM headroom.- Don't use
-ngl > 0. The Pi 5's VideoCore VII GPU does not have a usable llama.cpp backend in 2026. The-nglflag silently falls back to CPU and the only thing you achieve is a confused codepath. - Pin threads to the big cores. All four A76 cores are equal, so
-t 4is correct, but make sure nothing else on the system is hogging a core during inference. Disable Bluetooth and the unused Wi-Fi radio if you only use Ethernet:sudo systemctl disable bluetooth hciuart. - Swap will kill you. If your model + context + Whisper exceeds physical RAM, the Pi will swap to SD/NVMe and generation drops from ~15 tok/s to ~0.4 tok/s. Use
vmstat 1while loading to watch resident size.
Can you build a voice assistant with a Pi 5 LLM?
Yes, and this is the highest-leverage use case for the Pi 5's form factor. The reference stack we ship on our fleet:
- Wake word: openWakeWord, running on the Pi's CPU at <2% utilization with the "Hey Jarvis" model.
- Speech-to-text: whisper.cpp with the
tiny.enmodel (~75 MB) for fast transcription, orbase.en(~140 MB) for slightly better accuracy. On the Pi 5 CPU,tiny.entranscribes a 5-second utterance in ~1.2 s;base.entakes ~2.8 s. - Intent routing: llama-server with Llama 3.2 1B q4_K_M, given a structured-output system prompt that returns
{intent, entities}JSON. - Action layer: Home Assistant's REST API, or a homegrown Python dispatcher.
- Text-to-speech: Piper with the
en_US-lessac-mediumvoice. Synthesizes 5-10 seconds of speech in ~600 ms on the Pi 5.
End-to-end latency from end-of-utterance to start-of-spoken-response on this stack is 1.8-2.4 seconds for a typical "what's the temperature in the kitchen" query. That's within the 200 ms-2 s "feels conversational" window) for natural voice UX, and is competitive with the cloud assistants without sending audio to anyone's servers.
If you've already got this running on a Pi 4, our Pi 4 voice assistant guide covers the slower (but still workable) variant.
What about adding a Hailo AI HAT?
The Raspberry Pi AI HAT+ with Hailo-8 (26 TOPS) is the most-asked-about Pi 5 add-on for LLM work. The short answer is: it does not accelerate llama.cpp in 2026. llama.cpp has no Hailo backend. The toolchain expects compiled ONNX or TFLite graphs, and there is no decoder-only LLM in the Hailo Model Zoo bigger than DistilBERT.
What the HAT does help with:
- Whisper speech-to-text — there is a Hailo-optimized Whisper-tiny that runs ~6x faster than CPU. If you're building a voice assistant and STT latency is your bottleneck, the HAT shaves ~900 ms off each query.
- Object detection for any vision pipeline you bolt onto the same Pi (Frigate, security cam, etc.).
- Embedding models like
all-MiniLM-L6-v2for RAG — there's a Hailo build that does ~1.2 ms/embedding versus ~9 ms on CPU.
For LLM inference itself: skip the HAT in 2026. Watch llama.cpp's GitHub issues — there's an open RFC for a Hailo backend that's been moving slowly, but no merged code as of April 2026.
Perf-per-dollar and perf-per-watt math
| Setup | Cost | Idle W | Inference W | Llama 1B tok/s | tok/s per $ | tok/s per W |
|---|---|---|---|---|---|---|
| Pi 4 8GB + PSU + cooler | $95 | 2.6 | 6.5 | 8 | 0.084 | 1.23 |
| Pi 5 8GB + PSU + cooler | $115 | 3.1 | 8.5 | 18 | 0.157 | 2.12 |
| Pi 5 16GB + PSU + cooler | $145 | 3.2 | 8.7 | 18 | 0.124 | 2.07 |
| Ryzen 5800X + RTX 3060 desktop | $1,200 | 65 | 280 | 145 | 0.121 | 0.52 |
| Apple M2 Mac mini 16GB | $599 | 4 | 21 | 95 | 0.159 | 4.52 |
The Pi 5 8GB has the best tok/s-per-dollar of any non-Apple option, and roughly 4x the tok/s-per-watt of a desktop GPU rig. The M2 Mac mini wins on absolute performance per dollar (and per watt) but costs 5x as much in capex. If you already own a desktop you should use it; if you're buying inference hardware from scratch and your workload fits in 1-3B parameters, the Pi 5 is the right call.
The 16GB Pi 5 buys you nothing for LLM-only workloads — Phi-3 q4_K_M, the largest model we recommend, uses 2.6 GB resident. Save the $30 unless you're running a parallel Whisper + vision pipeline that genuinely needs the headroom.
Common pitfalls
- Skipping active cooling. A passive Pi 5 throttles to 1.5 GHz within 90 seconds of sustained inference, dropping Llama 1B tok/s from 18 to ~11. The official $5 active cooler eliminates this entirely.
- Running off an underpowered PSU. Anything below the official 27 W USB-C PSU will brown out under inference load. The Pi will silently undervolt and you'll see throttle messages in
dmesg. - Using a microSD as primary storage. Model loads from a Class-10 microSD take 12-18 seconds per gigabyte; from an NVMe HAT they take 1-2 seconds. If you swap models frequently, an NVMe HAT pays for itself in operability.
- Running fp16 or q8 to "preserve quality". On bandwidth-bound hardware, quantization more than ~5 bits/weight wastes time. The quality delta between q5_K_M and fp16 on a 1B model is unmeasurable in blind eval.
- Forgetting that prompt prefill is the slow part. llama.cpp processes prompts at 38-95 tok/s on a Pi 5. A 2000-token system prompt costs 21-53 seconds before the first generated token. Keep system prompts short; cache them with
--prompt-cache.
When NOT to use a Pi 5 for local LLM
- You need >5 tok/s on a 7B+ model. Pi 5 can't do it; buy any Mac with an M-series chip or an RTX 3060+ desktop.
- Your prompts are routinely >8k tokens (long RAG, big code context). Prefill latency makes the experience painful.
- You need vision LLMs (LLaVA, etc.). The image tower triples memory pressure and the Pi 5 doesn't have the RAM.
- You're doing fine-tuning. The Pi 5 can technically run LoRA training on a 1B model but it's a 30-hour job per epoch. Train on a GPU, infer on the Pi.
Bottom line
If your workload fits in 1-3B parameters at q4_K_M, the Raspberry Pi 5 8GB is the most cost-effective always-on local LLM server you can buy in 2026. Pair the Pi 5 8GB board with the official 27 W PSU, an active cooler, an NVMe HAT, and llama.cpp built from source. Load Llama 3.2 1B q4_K_M as your default and keep Phi-3 Mini q4_K_M warm for when you need better prose. Skip the Hailo HAT unless you're building a voice assistant where Whisper is your bottleneck. Skip the 16GB SKU unless you need RAM headroom for non-LLM workloads.
For everything bigger, our coding-LLM stack guide for the RTX 3060 covers the next tier up, and our token-throughput shootout against the Orange Pi 5 Plus explains why we still pick the Pi.
Related guides
- Pi 5 vs Orange Pi 5 Plus for Local LLM Inference (2026)
- Running a Local LLM on a Raspberry Pi 5 With llama.cpp: Real tok/s on 1B-8B Models
- Pi 4 8GB Headless LLM Sidecar Guide
- Raspberry Pi 4 Voice Assistant with Whisper + Llama 3.2 1B
- Noctua-Style Pi 5 NAS Build (and what Pi 4 owners can borrow)
