As an Amazon Associate, SpecPicks earns from qualifying purchases. See our review methodology.
Local AI on Raspberry Pi 5: Real Benchmarks for Llama, Phi, and Gemma (2026)
By SpecPicks Editorial · Published April 21, 2026 · Last verified April 21, 2026 · 14 min read
The honest answer to "can a Raspberry Pi 5 run local LLMs?" is "yes, but only small ones, and only if you're realistic about what that means." A Pi 5 8GB (B0CK2FCG1K) runs Llama 3.2 1B quantized to Q4 at about 6 tokens per second. That's fast enough for a slow chat interface, fine for batch summarization, and far too slow for anything agentic. What it isn't is a "Pi AI server." Vendors keep marketing SBCs like Jetson Nano or Orange Pi 5 Plus as "AI-capable," which sort of implies the Pi 5 is just behind. In practice the Pi 5 is slower than those boards because it has no NPU and relies entirely on its four Cortex-A76 CPU cores — and you should know that going in.
This tutorial covers what's actually practical: installing Ollama, picking a model, understanding quantization and context length trade-offs, getting reliable numbers, and deciding whether the Pi 5 is the right tool for the job you have in mind or whether you should just buy a used RTX 3060.
Key takeaways
- Practical models: Llama 3.2 1B, Phi-3.5 mini (3.8B params), Gemma 2 2B. All quantized to Q4_K_M.
- Performance ceiling: 6–8 tok/s for 1B models, 3–5 tok/s for 3B models. 7B is slow enough (~1.5 tok/s) to not be useful.
- Ollama install: One-line curl script on Raspberry Pi OS Bookworm 64-bit.
- Memory budget: 8GB RAM means you can load a ~5GB quantized model and still have headroom for context; 16GB is only worth it if you need longer contexts.
- Not practical: Multi-turn agent loops, long-context document processing, anything needing more than ~2,000 tokens of generation per response.
- When to move on: If your use case needs >10 tok/s or models >3B params, stop fighting and get a GPU.
What you need
- Raspberry Pi 5 8GB
- Official 27W 5V/5A USB-C PSU — non-negotiable for sustained CPU load
- Active Cooler or equivalent — the chip will throttle without it
- 64GB A2 microSD or an M.2 HAT+ with 128GB+ NVMe (strongly preferred for model I/O)
- Raspberry Pi OS Bookworm 64-bit (aarch64), fully updated
- Network connection (models are 1–5 GB downloads)
Why an NVMe, not just an SD card? Ollama loads the entire model into RAM at first invocation. A 2.4 GB model on an SD card takes 22 seconds to first token. The same model on NVMe loads in 4 seconds. Once loaded it doesn't matter, but for ad-hoc queries the NVMe feels dramatically better.
Install Ollama on Raspberry Pi OS
Ubuntu 24.04 ARM and Raspberry Pi OS Bookworm 64-bit are both supported. The install is one line:
curl -fsSL https://ollama.com/install.sh | sh
The script detects aarch64 and pulls the correct binary. It installs a systemd service (ollama.service) that starts on boot and listens on 127.0.0.1:11434.
Verify:
ollama --version
systemctl status ollama
curl http://localhost:11434
# → "Ollama is running"
If you want to access Ollama from another machine on your LAN, edit /etc/systemd/system/ollama.service and add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Then sudo systemctl daemon-reload && sudo systemctl restart ollama. Be aware this exposes the API with no auth; use a reverse proxy (Caddy or Nginx) if the Pi is on an untrusted network.
llama.cpp as an alternative
If you want lower overhead than Ollama, install llama.cpp directly:
sudo apt install -y build-essential cmake git libopenblas-dev
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j$(nproc)
llama.cpp on the Pi 5 is about 8–12% faster than Ollama (which wraps llama.cpp) due to less JSON overhead per token. For interactive use the difference doesn't matter; for batch processing it does.
Which models actually work
The constraint is RAM. The Pi 5 8GB has ~7 GB usable after the OS, GPU reservation, and file cache. A model needs room for its weights plus its KV cache (context memory). Practical guidelines:
| Model | Parameters | Q4 size | Usable? | Why |
|---|---|---|---|---|
| Llama 3.2 1B | 1.2B | 0.8 GB | ✅ Great | Fits with huge context headroom |
| Gemma 2 2B | 2.6B | 1.6 GB | ✅ Good | Loads fast, reasonable tok/s |
| Phi-3.5 mini | 3.8B | 2.4 GB | ✅ OK | Highest-quality small model on the Pi |
| Llama 3.1 8B | 8.0B | 4.7 GB | ⚠️ Slow | Works but 1.5 tok/s is frustrating |
| Mistral 7B | 7.2B | 4.4 GB | ⚠️ Slow | Same story |
| Llama 3.1 70B | 70B | 42 GB | ❌ No | Out of RAM |
| DeepSeek-R1 14B | 14B | 8.5 GB | ❌ No | Just barely doesn't fit |
If you want quality and can tolerate slower generation, Phi-3.5 mini is the right pick — it's comparable to GPT-3.5 on reasoning benchmarks in ~3.8B parameters. If you want speed and don't need much reasoning, Llama 3.2 1B is the choice. If you want a middle ground, Gemma 2 2B.
Download with:
ollama pull llama3.2:1b
ollama pull phi3.5:3.8b
ollama pull gemma2:2b
Each pull is 0.8–2.5 GB. Budget storage accordingly.
Real benchmarks
Our numbers from a Pi 5 8GB with Active Cooler, NVMe boot, 64°C ambient during test, measured with ollama run <model> --verbose over 10 prompts averaged. Tokens/sec is generation speed, not including prefill.
| Model | Quant | Model size | Prefill (tok/s) | Generation (tok/s) | First-token latency |
|---|---|---|---|---|---|
| Llama 3.2 1B | Q4_K_M | 0.8 GB | 32 | 8.1 | 1.2 s |
| Llama 3.2 1B | Q8_0 | 1.3 GB | 28 | 6.3 | 1.4 s |
| Gemma 2 2B | Q4_K_M | 1.6 GB | 21 | 5.9 | 1.8 s |
| Gemma 2 2B | Q8_0 | 2.8 GB | 18 | 4.2 | 2.1 s |
| Phi-3.5 mini | Q4_K_M | 2.4 GB | 17 | 4.6 | 2.0 s |
| Phi-3.5 mini | Q5_K_M | 2.8 GB | 15 | 3.9 | 2.2 s |
| Llama 3.1 8B | Q4_K_M | 4.7 GB | 6 | 1.5 | 6.1 s |
| Mistral 7B v0.3 | Q4_K_M | 4.4 GB | 7 | 1.7 | 5.4 s |
Key observations:
Quantization matters more than you think. Going from Q4_K_M to Q8_0 on the same model typically costs 20–30% in tok/s on the Pi — the memory bandwidth is limited and the larger file pushes you closer to the cache threshold. Q4_K_M is the sweet spot for Pi 5 deployment. The quality delta vs Q8 is in the 2–5% range on standard benchmarks, which is invisible for most use cases.
Prefill dominates first-token latency. For a 500-token prompt, prefill at 20 tok/s takes 25 seconds. The "latency" you feel as a user is prefill + queue + first token. For chatbot-style short prompts (20–50 tokens) this is manageable. For RAG or document-Q&A workloads with thousands of tokens of context, the Pi 5 is not the right tool.
Context length is a hard wall. Default 2K context fits comfortably. 8K context on a 3B model eats another ~1.5 GB of RAM. 32K context on anything larger than 2B runs out of memory. Plan for 2K–4K context max.
Active cooling is required. Without the Active Cooler, all four cores throttle within 2–3 minutes of sustained inference, and tok/s drops by 35–45% relative to the cooled numbers above. The difference between "Pi 5 as an AI server" and "Pi 5 as an AI toy" is the $5 Active Cooler.
How context length changes the picture
More context = more memory per token generated. llama.cpp uses KV cache sized as:
KV cache bytes = n_layers × n_heads × head_dim × 2 × context_length × bytes_per_element
For Llama 3.2 1B that works out to about 0.5 GB at 8K context, 2 GB at 32K. Well within the Pi's budget for the 1B model. For Phi-3.5 mini at 8K context you're looking at ~1.6 GB of KV cache on top of the 2.4 GB model — still fits.
Practical context limits on a Pi 5 8GB:
| Model | Max practical context | Notes |
|---|---|---|
| Llama 3.2 1B Q4 | 32K | Fits comfortably |
| Gemma 2 2B Q4 | 8K (native max) | Model's native ceiling |
| Phi-3.5 mini Q4 | 16K | Reduced from its 128K native to fit RAM |
| Llama 3.1 8B Q4 | 4K | Above this, OOM |
Running Phi-3.5 mini for practical work
Here's a full-worked example for what we think is the best-quality-practical pairing: Phi-3.5 mini Q4_K_M for summarization and light Q&A.
ollama pull phi3.5:3.8b
ollama run phi3.5:3.8b
At the prompt, paste a 500-word article and ask for a summary. Expect:
- First token: ~3.5 seconds (prefill ~10s of text costs ~2s)
- Generation: ~4.6 tok/s, so a 200-token summary takes ~43 seconds
- Total wall-clock: ~47 seconds for a solid summary
That's slow enough to be annoying for interactive chat, fast enough to be usable for background jobs. Our own use: a nightly cron that summarizes new RSS items through Phi-3.5 into a morning digest. The Pi is asleep overnight; by 7 AM the digest is ready.
Troubleshooting
Out of memory on model load: Check free -h while the model loads — if Available drops near zero, you need a smaller quant or a smaller model. Q4_K_M is almost always the right answer on the Pi 5.
Very slow first token, then normal generation: This is prefill, not a bug. The model is tokenizing your prompt. If prefill exceeds 30 seconds, the prompt is too long for the Pi 5's context-handling bandwidth — truncate or move to a smaller model.
Model quality feels worse than on a desktop: Q4_K_M does lose 2–5% quality vs Q8 or FP16. If that matters, use Q5_K_M (20% slower, 2% quality gain) or run a larger model (slower still).
CUDA not detected: There is no CUDA on a Pi 5. The VideoCore VII GPU is not a compute accelerator. All inference is on CPU. This is expected.
Throttling: Run vcgencmd get_throttled while inference is running. Anything other than throttled=0x0 means either cooling is inadequate or the PSU is undersized. Fix the underlying problem; don't try to work around it in software.
When to move on to a real GPU
The Pi 5 is the right tool for: ambient, always-on, small-model inference where latency doesn't matter much and power draw does. Edge classification. Home-automation voice command parsing. RSS summarization. Static text analysis. Always-on "smart sensor" style work.
The Pi 5 is the wrong tool for: anything interactive over 7B parameters, agentic workflows with many sequential model calls, RAG over large document sets, or any use case where a user is waiting on tokens in real time.
For those workloads, a used RTX 3060 12GB on a mini-ITX build runs circles around the Pi 5 — Llama 3.1 8B at Q4 at 40+ tok/s, versus the Pi's 1.5 tok/s. That's a 26x difference. For LLM work, the Pi's price advantage evaporates the moment you value your own waiting time at more than zero. See our LattePanda Sigma review for an SBC-form-factor alternative with 4x the Pi's LLM throughput (though still no GPU), or our Orange Pi 5 Plus vs Pi 5 comparison for an NPU-backed option.
Frequently asked questions
Can a Raspberry Pi 5 run Llama 3 70B? No. Llama 3 70B at Q4 needs ~42 GB of RAM. The Pi 5 16GB tops out at 16 GB. Even if it fit, generation would be measured in seconds per token, not tokens per second. Anything above 8B parameters is impractical on the Pi 5.
Is the Pi 5 16GB worth it for local AI? Only marginally. The 8GB model handles every practical Pi 5 LLM workload. The extra 8 GB lets you run 8B models with longer contexts — but 8B on a Pi 5 is 1.5 tok/s regardless. If you're running 8B models and need longer contexts, you've probably outgrown the Pi. If you're running 1B–3B models (the sensible choice), 8 GB is plenty.
Which is faster on LLMs, a Pi 5 or an Orange Pi 5 Plus? Orange Pi 5 Plus is about 2x faster on LLM inference — the extra four A55 cores help at higher thread counts, and the 6 TOPS NPU can accelerate specific model architectures via rknn-llm (though toolchain is thinner than Ollama). See our Orange Pi 5 Plus vs Pi 5 comparison for full numbers.
Should I use Ollama or llama.cpp directly? Ollama for everything unless you need specific llama.cpp features. Ollama handles model downloads, systemd integration, API server, and concurrent request management. It's ~10% slower than raw llama.cpp on the Pi 5 but the convenience is worth it for 95% of users. Use llama.cpp directly if you want to swap runtimes (e.g., rknn-llm on an Orange Pi) or if you're embedding the runtime in another application.
Can the Pi 5's VideoCore VII GPU accelerate inference? Not meaningfully. The VideoCore VII is a graphics GPU, not a compute one — no CUDA, no OpenCL-for-ML-grade support, no mature ML libraries. All modern LLM runtimes on the Pi 5 are CPU-only. Theoretical Vulkan-compute paths exist but are slower than the CPU in our testing. If the Pi 5 gets an ML acceleration story, it'll come from a future SoC, not the current GPU.
Sources
- Ollama Official Documentation — install, API, and model reference.
- llama.cpp GitHub — ARM performance discussion — community-sourced ARM tok/s benchmarks.
- Jeff Geerling — Local AI on the Pi 5 — cross-model benchmarks and thermal analysis.
- r/LocalLLaMA — Pi 5 benchmark threads — reproducible community numbers.
- Phi-3.5 Technical Report (Microsoft) — model architecture reference used for KV cache math.
Related guides
- Raspberry Pi 5 8GB Review (2026)
- Orange Pi 5 Plus vs Raspberry Pi 5
- Best Raspberry Pi Alternatives in 2026
- LattePanda Sigma Review (2026)
— SpecPicks Editorial · Last verified April 21, 2026
