A Raspberry Pi 4 8GB or Pi 5 8GB can comfortably run sub-3B language models at usable speeds in 2026 — think Phi-3.5 Mini, Qwen 2.5 1.5B, and Llama 3.2 3B — plus most small vision and speech models. Anything above 7B parameters becomes painful at 2–4 tok/s. Vision and audio workloads (object detection, wake-word spotting, transcription) are where the Pi actually shines.
Why the Raspberry Pi keeps showing up in "local AI" conversations
The Raspberry Pi Foundation has shipped two boards in the last two years that meaningfully expanded what "local AI on a Pi" means in practical terms: the Pi 5, which roughly doubled CPU and memory bandwidth versus the Pi 4, and the Pi AI Kit, a Hailo-8L M.2 accelerator that adds 13 TOPS of neural compute over the Pi 5's PCIe interface. The Pi 4 8GB remains in production and stayed cheap, and it is still the most common SBC that newcomers reach for when they hear "run an LLM at home."
In 2026 the answer to "what can I run on a Pi?" depends almost entirely on three numbers: how much DRAM the board has, what its memory bandwidth is, and whether a coprocessor (Hailo, Coral, or the Pi 5's own VideoCore VII) is doing the heavy lifting. CPU clock speed barely matters — even the Pi 5's quad Cortex-A76 at 2.4 GHz is bandwidth-bound on transformer generation, not compute-bound.
This guide is the honest, no-hype version of what actually runs on a Pi 4 8GB and a Pi 5 8GB in 2026, where the bottlenecks sit, and where you should just pay for a Jetson or a Mac mini instead.
Key takeaways
- A Pi 4 8GB will run language models up to ~3B parameters at q4_K_M at usable speed (3–8 tok/s). A Pi 5 8GB does the same job at roughly 2× the throughput.
- For object detection, wake-word, and transcription workloads the Pi is genuinely excellent — these models are small (<200 MB) and most run faster than real time.
- The Pi AI Kit (Hailo-8L) drops object-detection latency by 8–15× compared to CPU-only on the Pi 5 and is the single biggest accelerator-per-dollar upgrade in the category.
- Anything 7B and up is masochism on a Pi. A used Jetson Orin Nano 8GB or an Apple Silicon Mac mini delivers a 5–10× better experience for the same money.
- Storage is the most-skipped variable: USB 3.0 SSDs trump SD cards for model loading. Plan on a 1 TB SATA SSD over USB 3.0 if you want to swap models without waiting 30 seconds.
Hardware reality: what each Pi actually delivers in 2026
| Board | CPU | Memory BW | Max RAM | TOPS (CPU only) | Notes |
|---|---|---|---|---|---|
| Pi Zero 2 W | 4× A53 @ 1 GHz | 1.6 GB/s | 512 MB | <0.05 | Vision only |
| Pi 4 8GB | 4× A72 @ 1.5 GHz | 4.0 GB/s | 8 GB | 0.20 | Workhorse for sub-3B LLMs |
| Pi 5 8GB | 4× A76 @ 2.4 GHz | 17 GB/s | 8 GB | 0.40 | 2× LLM throughput vs Pi 4 |
| Pi 5 + AI Kit | + Hailo-8L | + dedicated | 8 GB | + 13 TOPS INT8 | Object detection +8–15× |
The headline number for LLM work is memory bandwidth. The Pi 4's 4 GB/s LPDDR4 has not aged well: at q4_K_M a 3B model has to stream 1.6 GB of weights through that 4 GB/s pipe for every generated token, which puts a hard ceiling around 2.5 tok/s before any other inefficiency. The Pi 5 quadruples bandwidth to 17 GB/s, which is the entire reason a 3B model goes from "barely usable" to "actually usable" between generations.
Geerlingguy's bench data on the Pi 5 vs Pi 4 is the canonical reference for these throughput numbers; community benchmarks on r/LocalLLaMA have backed them up across a dozen model sizes and quants.
What language models actually run on a Pi 4 8GB?
Numbers below are best-case at q4_K_M on llama.cpp with 4-thread inference on a Pi 4 8GB Model B with a heatsink and active cooling. The Pi 4 throttles aggressively without cooling; expect 30–40% lower numbers on a bare board.
| Model | Params | Quant | RAM used | tok/s | Verdict |
|---|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | q4_K_M | 0.8 GB | 12–15 | Fine for toy chat |
| Phi-3.5 Mini | 3.8B | q4_K_M | 2.4 GB | 5–7 | Best practical chat |
| Llama 3.2 3B Instruct | 3.0B | q4_K_M | 2.0 GB | 6–8 | Most reliable answers |
| Qwen 2.5 1.5B Instruct | 1.5B | q4_K_M | 1.0 GB | 9–12 | Strong tiny-model option |
| Gemma 2 2B IT | 2.6B | q4_K_M | 1.8 GB | 7–9 | Solid for short prompts |
| Llama 3.1 8B Instruct | 8.0B | q4_K_M | 4.7 GB | 1.5–2.0 | Painful |
| Mistral 7B v0.3 | 7.0B | q4_K_M | 4.1 GB | 1.5–2.2 | Painful |
The pattern: anything 3B and under at q4 is useful on a Pi 4 — 5–8 tok/s is faster than most humans can read. Anything 7B and above runs but produces tokens slower than you can stay engaged with them.
A second pattern worth noting: short prompts dominate. The Pi 4 takes 2–4 seconds of prefill on a 1K-token prompt for a 3B model; bump to 4K context and prefill alone can take 12–18 seconds. For agentic workflows or RAG, that prefill cost is the actual user-facing bottleneck, not the per-token generation speed.
What runs on a Pi 5 8GB
The Pi 5's 17 GB/s memory bandwidth roughly doubles every LLM number from the Pi 4 — same q4_K_M models, same quants, same llama.cpp build, just faster.
| Model | tok/s on Pi 5 8GB |
|---|---|
| Llama 3.2 3B Instruct | 12–16 |
| Phi-3.5 Mini 3.8B | 10–14 |
| Qwen 2.5 1.5B Instruct | 18–22 |
| Gemma 2 2B IT | 14–18 |
| Llama 3.1 8B Instruct | 3–4 |
| Mistral 7B v0.3 | 3–4 |
The Pi 5 makes 3B-class chat genuinely real-time, and 8B-class chat technically possible. It still does not turn the Pi into a serious LLM host — that takes a discrete GPU or a dedicated AI accelerator — but it does open the door to a useful on-device assistant for things like home automation NLU, simple code completion, or a local "knock-on-it" summarizer.
The biggest practical upgrade if you have a Pi 5 is heat. The board boosts to 2.4 GHz only when thermal headroom allows, and aggressive boards like the official Active Cooler or a small case fan keep sustained inference workloads from throttling.
Where the Pi actually shines: vision and audio
Language models stress the part of the Pi (memory bandwidth) that's weakest. Vision and audio models stress compute, which the Pi is surprisingly good at. Concrete examples that run real-time or faster on a Pi 5 8GB:
- YOLOv8n object detection at 640×640: 15–20 FPS on CPU, 60+ FPS with the Hailo-8L AI Kit. Per the Hailo developer documentation, the Pi 5 + AI Kit hits 90+ FPS on YOLOv8s and still maintains 30+ FPS on the heavier YOLOv8m model.
- Wake-word detection (openWakeWord, picovoice): single-digit milliseconds latency, runs on a single core.
- Whisper.cpp small/base transcription: roughly 1.5× real-time on a Pi 5, near-real-time on a Pi 4. Tiny model is comfortably faster than real time on both.
- Background-removal models (BackgroundMattingV2): 5–10 FPS on Pi 5 at 720p — usable for low-frame-rate "smart camera" applications.
- TTS (Piper, Coqui XTTS small): faster than real time on a Pi 5.
The reason these workloads work where LLMs struggle: they fit entirely in cache or compress weights effectively, and they don't require streaming gigabytes of weights per inference. A 50 MB YOLO model gets re-used across thousands of frames with the same weights resident on-chip; an LLM has to walk every byte of weights through the CPU for every single generated token.
The Pi AI Kit and other accelerators
If "AI on a Pi" is your project, the Pi 5 + Hailo-8L AI Kit is the configuration to buy. Raspberry Pi launched the AI Kit in mid-2024; it pairs the Pi 5's M.2 HAT+ with a Hailo-8L accelerator that delivers 13 TOPS of INT8 neural compute over PCIe Gen 2 x1. For vision pipelines specifically it is the largest single performance jump available on the platform.
It does not help LLMs. Hailo-8L is a neural accelerator tuned for convolutional and small transformer-block workloads with dedicated SRAM and INT8 paths; LLM inference at q4 doesn't map onto its architecture, and the existing community runtimes don't target it for autoregressive decoding. The honest framing is: AI Kit for cameras, not for chat.
Alternatives worth knowing:
- Coral USB Accelerator (Edge TPU): cheaper, USB 2.0 only, lower TOPS — buy it only if you already have the Coral software stack working.
- Jetson Orin Nano 8GB: not a Pi, but the natural step-up — adds 40 TOPS GPU + 7B-LLM-feasible memory bandwidth for around $250.
Storage matters more than people think
The single most-skipped variable on a "Pi LLM" build is the storage path. SD-card I/O is slow and unreliable for sustained model loading — a 3B model on an A1 SD card can take 25–35 seconds to load, and the card occasionally hits checkpoints during inference. Use a USB 3.0 SSD on a Pi 4 or an NVMe HAT on a Pi 5. Even an entry-grade SATA SSD like the Crucial BX500 1TB or SanDisk Ultra 3D 1TB over a USB 3.0 enclosure drops model-load times to 4–8 seconds and removes the SD-card variability entirely.
The Pi 5's M.2 HAT+ adds an NVMe slot at PCIe Gen 2 x1 (theoretical 500 MB/s); paired with a low-end NVMe drive, model loads land under 3 seconds.
Common pitfalls
- No active cooling. A Pi 4 will throttle from 1.5 GHz to 600 MHz within 30 seconds of sustained LLM inference. Add a heatsink and a 5V fan; for the Pi 5 use the Active Cooler or a passive case with airflow.
- Running off SD card. Symptoms: long pauses mid-response, occasional kernel hangs under load. Switch to USB SSD or NVMe.
- Wrong quant. People reach for q8 or fp16 GGUFs to "get better quality." On a Pi those quants are 2× slower for no perceivable benefit at the 1B–3B scale. Stay at q4_K_M.
- Underpowered PSU. The Pi 5 wants 27W official PSU; a 15W generic supply browns out under sustained AI Kit + CPU load and corrupts SD cards.
- Expecting Pi 4 to do Pi 5 numbers. Half the "Pi can run Llama!" videos online silently use a Pi 5. Check before buying — a Pi 4 8GB in 2026 is cheaper but it is meaningfully slower.
When NOT to use a Pi for AI
A Pi is the wrong tool when:
- You need to run a 7B+ language model at conversational speed. Just buy a used Jetson Orin Nano 8GB or an Apple Silicon Mac mini.
- You need batched inference. The Pi runs one request at a time, period.
- You need GPU-accelerated training or fine-tuning. The Pi has no usable GPU compute for ML.
- You need sustained throughput more than 10 tok/s on anything above a 3B model. Bandwidth is the wall.
A realistic 2026 Pi-based AI rig
The build people actually keep:
- Raspberry Pi 5 8GB + Active Cooler + 27W PSU
- Pi AI Kit (Hailo-8L) for vision pipelines
- 1 TB SATA SSD over USB 3.0 (Crucial BX500 or SanDisk Ultra 3D) for model storage
- Ollama as the LLM serving layer for q4_K_M Phi-3.5 Mini and Llama 3.2 3B
- Frigate or Viseron as the vision orchestrator on top of the AI Kit
- Optional: USB microphone array + Piper TTS for a voice-loop assistant
This setup runs a 3B chat assistant, a real-time vision pipeline, and a Whisper-based transcription path concurrently without thermal throttling. It is genuinely useful for a single user. It will never compete with a discrete-GPU machine, and that's fine.
Related guides
- Raspberry Pi 5 Local LLM Server: Best Models for 8GB RAM in 2026
- Raspberry Pi 4 8GB vs Zero 2W for an AI Camera (2026)
- Raspberry Pi 5 16GB Ships — Is the Pi 4 8GB Still Worth It?
- Best Raspberry Pi Accessories for Home-Lab Builds in 2026
- Google's Tiny Gemma 3 Board: What a $0 SBC Gemma Demo Means
Citations and sources
- Raspberry Pi Foundation — official site and product pages
- Raspberry Pi AI Kit announcement and product page
- Hailo Developer Zone — Hailo-8L documentation and benchmarks
- Jeff Geerling — Pi 5 power and performance bench
- NVIDIA — Jetson Orin Nano Developer Kit
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
