Running Local LLMs on a Raspberry Pi 5 in 2026: What Works, What Doesn't

Running Local LLMs on a Raspberry Pi 5 in 2026: What Works, What Doesn't

A realistic envelope for Pi 5 LLM inference: which models fit, which build flags matter, and where the Pi loses to a Mac mini or Jetson.

Yes, a raspberry pi 5 local llm setup works in 2026, with caveats. An 8GB Pi 5 runs Llama 3.2 3B, Phi-3 Mini, and Qwen 2.5 1.5B at q4 at 3-8 tok/s on CPU alone. Anything above 7B at q4 is unrealistic without external acceleration.

Running Local LLMs on a Raspberry Pi 5 in 2026: What Works, What Doesn't

Yes, a raspberry pi 5 local llm setup works in 2026, with caveats. An 8GB Pi 5 comfortably runs Llama 3.2 3B, Phi-3 Mini, and Qwen 2.5 1.5B at q4 quantization, generating between 3 and 8 tokens per second on CPU alone. Anything above 7B at q4 is unrealistic without external acceleration. This guide breaks down the realistic envelope, the build flags that matter, and where the Pi actually sits on the perf-per-dollar curve.

Why anyone wants a Pi to run LLMs

The Pi 5 LLM use case is small but real. You want offline inference for privacy (no API calls), a low-power always-on assistant for home automation (Home Assistant + Ollama), an air-gapped dev workstation for sensitive code, or a learning platform for understanding how llama.cpp actually works under the hood. None of these workloads need GPT-4-class intelligence; they need a small model that runs reliably, doesn't melt the SD card, and can be reasoned about end-to-end. The Pi 5 (4-core Cortex-A76 @ 2.4 GHz, 4 or 8 GB LPDDR4X, no NPU, no GPU compute) is the cheapest hardware that meets that bar in 2026. It is not the fastest. A used Mac mini M1 with 16 GB shipping for $300 on eBay will outperform a Pi 5 8GB by 5-10x on the same model. A Jetson Orin Nano Super at $250 will outperform it by 8-15x with proper CUDA acceleration. The Pi wins exclusively on three axes: $80 (4GB) / $90 (8GB) sticker price, sub-10W power draw, and the world's largest community of similar tinkerers to copy configs from. If your workload survives those constraints, sbc llm inference on a Pi 5 is delightful. If it doesn't, buy a Mac mini.

Key Takeaways

  • An 8GB Pi 5 runs Llama 3.2 3B at q4 at 5-7 tok/s sustained
  • Phi-3 Mini (3.8B) runs at q4_K_M comfortably with ~2.4 GB resident
  • Anything above 7B at q4 hits swap and tok/s collapses to <0.5
  • llama.cpp on ARM with NEON enabled is the canonical inference path; SVE is not yet a Pi 5 win
  • pi 5 ollama is the easiest install path but adds ~10% overhead vs raw llama.cpp
  • The Hailo-8L AI HAT helps for vision/classification, not for transformer inference
  • Cooling matters; without an active cooler the Pi 5 throttles within minutes

H2: What's the largest model a Pi 5 can run?

The hard ceiling on an 8GB Pi 5 is set by RAM. The model + KV cache + system overhead must fit. Llama 3.2 3B at q4_K_M is ~2.0 GB resident, leaves headroom for a 4K context, and runs ~6 tok/s. Phi-3 Mini 3.8B at q4_K_M is ~2.4 GB resident, ~4.5 tok/s. Qwen 2.5 1.5B at q4 is ~1.0 GB, ~10 tok/s. Llama 3.1 8B at q4_K_M is ~5.0 GB resident: technically fits, but with KV cache for any meaningful context you're looking at 6.5+ GB and the system has nothing left. Tok/s drops to 1-1.5. At q3_K_M the 8B drops to ~3.7 GB and runs at ~2 tok/s but quality degrades visibly. Anything at 13B and above is impractical even on the 8GB Pi.

H2: Quantization matrix

QuantBits/weight3B Model RAM7B Model RAMTok/s on Pi 5 (3B)Quality vs fp16
fp16166.0 GB14.0 GBOOMReference
q8_083.2 GB7.2 GB2.5Indistinguishable
q6_K6.52.5 GB5.5 GB3.5Indistinguishable
q5_K_M5.52.2 GB4.8 GB4.5Near-identical
q4_K_M4.52.0 GB4.0 GB6.0Slight degradation
q3_K_M3.51.6 GB3.2 GB7.5Noticeable degradation
q2_K2.51.2 GB2.4 GB8.5Often unusable

For Pi 5 work, q4_K_M is the universal sweet spot.

H2: llama.cpp on ARM — current build flags + NEON/SVE

Build llama.cpp from source rather than using a packaged binary; the Pi 5's Cortex-A76 benefits from explicit NEON tuning that distro packages often miss. The canonical 2026 build invocation is cmake -B build -DGGML_NATIVE=ON -DGGML_CPU_AARCH64=ON -DCMAKE_BUILD_TYPE=Release followed by cmake --build build --config Release -j 4. NEON is the SIMD path that matters; the llama.cpp arm CPU kernels have been progressively rewritten to use NEON intrinsics directly and the speedup over a generic build is 1.5-2x. SVE (Scalable Vector Extension) is supported in llama.cpp but the Pi 5's A76 cores do not implement SVE; the toggle is irrelevant on this hardware. Set thread count to 4 (one per core) and pin to performance cores if you've enabled the cpufreq governor switching trick. Do not run with -t 8; the SMT-style oversubscription hurts more than it helps.

H2: Ollama on Pi 5 — install + pitfalls

For the user who wants pi 5 ollama up in five minutes, curl -fsSL https://ollama.com/install.sh | sh followed by ollama pull llama3.2:3b is the path. Ollama wraps llama.cpp and adds a friendly REST API, model pulling, and chat templating. The trade-off is roughly 10% overhead vs raw llama.cpp from source, plus an extra ~150 MB of resident memory for the Ollama daemon. The pitfalls: (1) the default model store at /usr/share/ollama may live on the SD card; symlink it to USB SSD storage to avoid card wear, (2) the default context size is 2K; bump to 4K-8K via the OLLAMA_NUM_CTX env var only if you actually need it (RAM cost is non-trivial), (3) ollama serve binds to localhost only by default; for LAN access set OLLAMA_HOST=0.0.0.0:11434 and front it with a reverse proxy with auth.

H2: Prefill vs generation — where the bottleneck lives

LLM inference splits into two phases. Prefill processes the prompt through the model in one batch and is compute-bound. Generation produces tokens one at a time and is memory-bandwidth bound. On a Pi 5, prefill of a 1K-token prompt for a 3B model takes 8-12 seconds; generation runs at the 5-7 tok/s steady-state. For interactive chat with short prompts, prefill is invisible. For RAG pipelines or document-summarization workloads with long prompts, prefill dominates wall time. Cache prefilled prompts when possible (llama.cpp's --prompt-cache) and keep system prompts short.

H2: Context-length impact — 2K vs 8K vs 32K

KV cache scales linearly with context length and is the second largest RAM consumer after model weights. For Llama 3.2 3B q4_K_M: 2K context costs ~150 MB of KV, 8K context costs ~600 MB, 32K context costs ~2.4 GB. On an 8GB Pi 5 with the model resident, 8K is comfortable, 16K is tight, 32K is impossible. For longer-context use cases (large doc summarization), use a smaller model (Qwen 2.5 1.5B fits 32K context comfortably) or chunk the input.

H2: Cooling and power — does throttling matter?

Yes. The Pi 5 ships with no cooler and the official Active Cooler ($5) is the difference between sustained 2.4 GHz operation and throttling to 1.5 GHz within 90 seconds of inference start. With the Active Cooler, expect the SoC to sit at 60-70°C under continuous llama.cpp load and never throttle. Power draw under load is ~8-9 W; the official 27W USB-C PD power supply has plenty of headroom. The case-with-passive-heatsink option works for short bursts only. If you're running an always-on inference service, get the Active Cooler.

H2: Does adding an AI HAT (Hailo-8L) help?

Mostly no, with one exception. The Hailo-8L AI HAT is a 13 TOPS NPU optimized for fixed-function neural network inference (object detection, pose estimation, image classification). It does NOT accelerate transformer-style LLM decoding because the Hailo SDK doesn't yet expose the right primitives (attention, RoPE, quantized matmul) and the model conversion pipeline targets vision workloads. If you want hardware acceleration for LLMs on a Pi-form-factor board, the right move is to step up to a Jetson Orin Nano Super (CUDA + cuDNN + llama.cpp CUDA backend) or to wait for the rumored Pi 5 NPU HAT that targets GenAI specifically.

Spec table: Pi 5 8GB vs Jetson Orin Nano vs Mini PC

PlatformComputeRAMPowerLlama 3.2 3B q4 tok/sPrice
Raspberry Pi 5 8GB4-core A76 @ 2.4 GHz8 GB8 W6.0$90
Jetson Orin Nano Super 8GB1024 CUDA + 32 Tensor8 GB15 W35-45$250
Mini PC (N100, 16GB)4-core Alder Lake-N16 GB15 W12-18$180
Used M1 Mac mini 16GB8-core M1 + Metal16 GB20 W50-65$300

Benchmark table: tok/s on Pi 5 8GB

ModelQuantResident RAMTok/s
Llama 3.2 1Bq4_K_M0.7 GB14.0
Llama 3.2 3Bq4_K_M2.0 GB6.0
Phi-3 Mini 3.8Bq4_K_M2.4 GB4.5
Qwen 2.5 1.5Bq4_K_M1.0 GB10.0
Llama 3.1 8Bq4_K_M5.0 GB1.4

Bottom line + perf-per-dollar + perf-per-watt

For tok/s per dollar, the Pi 5 8GB ($90, 6 tok/s on 3B) sits at 0.067 tok/s/$. The Jetson Orin Nano Super ($250, 40 tok/s) sits at 0.16 tok/s/$. The N100 mini PC ($180, 15 tok/s) sits at 0.083 tok/s/$. The Pi loses on raw value. For tok/s per watt, the Pi 5 (8 W, 6 tok/s) at 0.75 tok/s/W is competitive. The Pi wins exclusively when you need a $90 always-on box that draws under 10 W. For everything else, a Mac mini or Jetson is the right buy.

FAQ

What's the biggest model that fits on an 8GB Pi? Llama 3.1 8B at q4 fits but runs at ~1.4 tok/s. Practical ceiling is 3-4B at q4.

Is the Pi 5 4GB usable? For 1B-1.5B models, yes. For anything above, no.

Should I use Ollama or raw llama.cpp? Ollama for ease, llama.cpp for maximum tok/s and minimum RAM overhead.

Does the AI HAT help LLM inference? No, not in 2026. It's for vision/classification.

Can a Pi 5 do RAG? Yes, with a small embedding model (BGE-small) + a vector DB like sqlite-vec, and a 1-3B chat model. It's slow but functional.

Citations and sources

  • llama.cpp GitHub README and ARM build documentation
  • Ollama Install Documentation, 2026 update
  • Raspberry Pi Foundation Pi 5 BCM2712 SoC datasheet
  • r/LocalLLaMA Pi 5 benchmark megathread (2025-2026)
  • Jeff Geerling YouTube: Pi 5 LLM Inference Test Series

Related guides

  • Best Microphone for Streaming and Podcasting Under $200
  • Best SSD for PS5 and Xbox Series X Storage Expansion 2026
  • AI-Driven Vintage Driver Install on WinXP
  • Best Gaming Monitors for Console and PC in 2026

— SpecPicks Editorial · Last verified 2026-05-07