Skip to main content
Local AI on a Raspberry Pi in 2026: What Actually Runs (and What Doesn't)

Local AI on a Raspberry Pi in 2026: What Actually Runs (and What Doesn't)

Pi 4, Pi 5, and the AI Kit — the honest map of what each board does well and what to skip.

A Pi 4 8GB or Pi 5 8GB runs sub-3B LLMs at usable speeds. Vision and audio workloads shine. Anything 7B+ becomes a slideshow — buy a Jetson instead.

A Raspberry Pi 4 8GB or Pi 5 8GB can comfortably run sub-3B language models at usable speeds in 2026 — think Phi-3.5 Mini, Qwen 2.5 1.5B, and Llama 3.2 3B — plus most small vision and speech models. Anything above 7B parameters becomes painful at 2–4 tok/s. Vision and audio workloads (object detection, wake-word spotting, transcription) are where the Pi actually shines.

Why the Raspberry Pi keeps showing up in "local AI" conversations

The Raspberry Pi Foundation has shipped two boards in the last two years that meaningfully expanded what "local AI on a Pi" means in practical terms: the Pi 5, which roughly doubled CPU and memory bandwidth versus the Pi 4, and the Pi AI Kit, a Hailo-8L M.2 accelerator that adds 13 TOPS of neural compute over the Pi 5's PCIe interface. The Pi 4 8GB remains in production and stayed cheap, and it is still the most common SBC that newcomers reach for when they hear "run an LLM at home."

In 2026 the answer to "what can I run on a Pi?" depends almost entirely on three numbers: how much DRAM the board has, what its memory bandwidth is, and whether a coprocessor (Hailo, Coral, or the Pi 5's own VideoCore VII) is doing the heavy lifting. CPU clock speed barely matters — even the Pi 5's quad Cortex-A76 at 2.4 GHz is bandwidth-bound on transformer generation, not compute-bound.

This guide is the honest, no-hype version of what actually runs on a Pi 4 8GB and a Pi 5 8GB in 2026, where the bottlenecks sit, and where you should just pay for a Jetson or a Mac mini instead.

Key takeaways

  • A Pi 4 8GB will run language models up to ~3B parameters at q4_K_M at usable speed (3–8 tok/s). A Pi 5 8GB does the same job at roughly 2× the throughput.
  • For object detection, wake-word, and transcription workloads the Pi is genuinely excellent — these models are small (<200 MB) and most run faster than real time.
  • The Pi AI Kit (Hailo-8L) drops object-detection latency by 8–15× compared to CPU-only on the Pi 5 and is the single biggest accelerator-per-dollar upgrade in the category.
  • Anything 7B and up is masochism on a Pi. A used Jetson Orin Nano 8GB or an Apple Silicon Mac mini delivers a 5–10× better experience for the same money.
  • Storage is the most-skipped variable: USB 3.0 SSDs trump SD cards for model loading. Plan on a 1 TB SATA SSD over USB 3.0 if you want to swap models without waiting 30 seconds.

Hardware reality: what each Pi actually delivers in 2026

BoardCPUMemory BWMax RAMTOPS (CPU only)Notes
Pi Zero 2 W4× A53 @ 1 GHz1.6 GB/s512 MB<0.05Vision only
Pi 4 8GB4× A72 @ 1.5 GHz4.0 GB/s8 GB0.20Workhorse for sub-3B LLMs
Pi 5 8GB4× A76 @ 2.4 GHz17 GB/s8 GB0.402× LLM throughput vs Pi 4
Pi 5 + AI Kit+ Hailo-8L+ dedicated8 GB+ 13 TOPS INT8Object detection +8–15×

The headline number for LLM work is memory bandwidth. The Pi 4's 4 GB/s LPDDR4 has not aged well: at q4_K_M a 3B model has to stream 1.6 GB of weights through that 4 GB/s pipe for every generated token, which puts a hard ceiling around 2.5 tok/s before any other inefficiency. The Pi 5 quadruples bandwidth to 17 GB/s, which is the entire reason a 3B model goes from "barely usable" to "actually usable" between generations.

Geerlingguy's bench data on the Pi 5 vs Pi 4 is the canonical reference for these throughput numbers; community benchmarks on r/LocalLLaMA have backed them up across a dozen model sizes and quants.

What language models actually run on a Pi 4 8GB?

Numbers below are best-case at q4_K_M on llama.cpp with 4-thread inference on a Pi 4 8GB Model B with a heatsink and active cooling. The Pi 4 throttles aggressively without cooling; expect 30–40% lower numbers on a bare board.

ModelParamsQuantRAM usedtok/sVerdict
TinyLlama 1.1B1.1Bq4_K_M0.8 GB12–15Fine for toy chat
Phi-3.5 Mini3.8Bq4_K_M2.4 GB5–7Best practical chat
Llama 3.2 3B Instruct3.0Bq4_K_M2.0 GB6–8Most reliable answers
Qwen 2.5 1.5B Instruct1.5Bq4_K_M1.0 GB9–12Strong tiny-model option
Gemma 2 2B IT2.6Bq4_K_M1.8 GB7–9Solid for short prompts
Llama 3.1 8B Instruct8.0Bq4_K_M4.7 GB1.5–2.0Painful
Mistral 7B v0.37.0Bq4_K_M4.1 GB1.5–2.2Painful

The pattern: anything 3B and under at q4 is useful on a Pi 4 — 5–8 tok/s is faster than most humans can read. Anything 7B and above runs but produces tokens slower than you can stay engaged with them.

A second pattern worth noting: short prompts dominate. The Pi 4 takes 2–4 seconds of prefill on a 1K-token prompt for a 3B model; bump to 4K context and prefill alone can take 12–18 seconds. For agentic workflows or RAG, that prefill cost is the actual user-facing bottleneck, not the per-token generation speed.

What runs on a Pi 5 8GB

The Pi 5's 17 GB/s memory bandwidth roughly doubles every LLM number from the Pi 4 — same q4_K_M models, same quants, same llama.cpp build, just faster.

Modeltok/s on Pi 5 8GB
Llama 3.2 3B Instruct12–16
Phi-3.5 Mini 3.8B10–14
Qwen 2.5 1.5B Instruct18–22
Gemma 2 2B IT14–18
Llama 3.1 8B Instruct3–4
Mistral 7B v0.33–4

The Pi 5 makes 3B-class chat genuinely real-time, and 8B-class chat technically possible. It still does not turn the Pi into a serious LLM host — that takes a discrete GPU or a dedicated AI accelerator — but it does open the door to a useful on-device assistant for things like home automation NLU, simple code completion, or a local "knock-on-it" summarizer.

The biggest practical upgrade if you have a Pi 5 is heat. The board boosts to 2.4 GHz only when thermal headroom allows, and aggressive boards like the official Active Cooler or a small case fan keep sustained inference workloads from throttling.

Where the Pi actually shines: vision and audio

Language models stress the part of the Pi (memory bandwidth) that's weakest. Vision and audio models stress compute, which the Pi is surprisingly good at. Concrete examples that run real-time or faster on a Pi 5 8GB:

  • YOLOv8n object detection at 640×640: 15–20 FPS on CPU, 60+ FPS with the Hailo-8L AI Kit. Per the Hailo developer documentation, the Pi 5 + AI Kit hits 90+ FPS on YOLOv8s and still maintains 30+ FPS on the heavier YOLOv8m model.
  • Wake-word detection (openWakeWord, picovoice): single-digit milliseconds latency, runs on a single core.
  • Whisper.cpp small/base transcription: roughly 1.5× real-time on a Pi 5, near-real-time on a Pi 4. Tiny model is comfortably faster than real time on both.
  • Background-removal models (BackgroundMattingV2): 5–10 FPS on Pi 5 at 720p — usable for low-frame-rate "smart camera" applications.
  • TTS (Piper, Coqui XTTS small): faster than real time on a Pi 5.

The reason these workloads work where LLMs struggle: they fit entirely in cache or compress weights effectively, and they don't require streaming gigabytes of weights per inference. A 50 MB YOLO model gets re-used across thousands of frames with the same weights resident on-chip; an LLM has to walk every byte of weights through the CPU for every single generated token.

The Pi AI Kit and other accelerators

If "AI on a Pi" is your project, the Pi 5 + Hailo-8L AI Kit is the configuration to buy. Raspberry Pi launched the AI Kit in mid-2024; it pairs the Pi 5's M.2 HAT+ with a Hailo-8L accelerator that delivers 13 TOPS of INT8 neural compute over PCIe Gen 2 x1. For vision pipelines specifically it is the largest single performance jump available on the platform.

It does not help LLMs. Hailo-8L is a neural accelerator tuned for convolutional and small transformer-block workloads with dedicated SRAM and INT8 paths; LLM inference at q4 doesn't map onto its architecture, and the existing community runtimes don't target it for autoregressive decoding. The honest framing is: AI Kit for cameras, not for chat.

Alternatives worth knowing:

  • Coral USB Accelerator (Edge TPU): cheaper, USB 2.0 only, lower TOPS — buy it only if you already have the Coral software stack working.
  • Jetson Orin Nano 8GB: not a Pi, but the natural step-up — adds 40 TOPS GPU + 7B-LLM-feasible memory bandwidth for around $250.

Storage matters more than people think

The single most-skipped variable on a "Pi LLM" build is the storage path. SD-card I/O is slow and unreliable for sustained model loading — a 3B model on an A1 SD card can take 25–35 seconds to load, and the card occasionally hits checkpoints during inference. Use a USB 3.0 SSD on a Pi 4 or an NVMe HAT on a Pi 5. Even an entry-grade SATA SSD like the Crucial BX500 1TB or SanDisk Ultra 3D 1TB over a USB 3.0 enclosure drops model-load times to 4–8 seconds and removes the SD-card variability entirely.

The Pi 5's M.2 HAT+ adds an NVMe slot at PCIe Gen 2 x1 (theoretical 500 MB/s); paired with a low-end NVMe drive, model loads land under 3 seconds.

Common pitfalls

  • No active cooling. A Pi 4 will throttle from 1.5 GHz to 600 MHz within 30 seconds of sustained LLM inference. Add a heatsink and a 5V fan; for the Pi 5 use the Active Cooler or a passive case with airflow.
  • Running off SD card. Symptoms: long pauses mid-response, occasional kernel hangs under load. Switch to USB SSD or NVMe.
  • Wrong quant. People reach for q8 or fp16 GGUFs to "get better quality." On a Pi those quants are 2× slower for no perceivable benefit at the 1B–3B scale. Stay at q4_K_M.
  • Underpowered PSU. The Pi 5 wants 27W official PSU; a 15W generic supply browns out under sustained AI Kit + CPU load and corrupts SD cards.
  • Expecting Pi 4 to do Pi 5 numbers. Half the "Pi can run Llama!" videos online silently use a Pi 5. Check before buying — a Pi 4 8GB in 2026 is cheaper but it is meaningfully slower.

When NOT to use a Pi for AI

A Pi is the wrong tool when:

  • You need to run a 7B+ language model at conversational speed. Just buy a used Jetson Orin Nano 8GB or an Apple Silicon Mac mini.
  • You need batched inference. The Pi runs one request at a time, period.
  • You need GPU-accelerated training or fine-tuning. The Pi has no usable GPU compute for ML.
  • You need sustained throughput more than 10 tok/s on anything above a 3B model. Bandwidth is the wall.

A realistic 2026 Pi-based AI rig

The build people actually keep:

  • Raspberry Pi 5 8GB + Active Cooler + 27W PSU
  • Pi AI Kit (Hailo-8L) for vision pipelines
  • 1 TB SATA SSD over USB 3.0 (Crucial BX500 or SanDisk Ultra 3D) for model storage
  • Ollama as the LLM serving layer for q4_K_M Phi-3.5 Mini and Llama 3.2 3B
  • Frigate or Viseron as the vision orchestrator on top of the AI Kit
  • Optional: USB microphone array + Piper TTS for a voice-loop assistant

This setup runs a 3B chat assistant, a real-time vision pipeline, and a Whisper-based transcription path concurrently without thermal throttling. It is genuinely useful for a single user. It will never compete with a discrete-GPU machine, and that's fine.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Can a Raspberry Pi 4 run an LLM in 2026?
Yes, but only small ones. A Pi 4 8GB will run language models up to about 3B parameters at 4-bit quantization at usable speeds — typically 3 to 8 tokens per second for models like Phi-3.5 Mini and Llama 3.2 3B. Anything larger than 3B becomes painfully slow because the Pi 4's 4 GB/s memory bandwidth is the bottleneck.
Is the Pi 5 meaningfully faster than the Pi 4 for AI?
Yes — roughly 2x faster for LLM inference and significantly faster for vision workloads. The Pi 5 quadruples memory bandwidth from 4 to 17 GB/s, which is the single most important spec for transformer generation. A 3B model that runs at 6 tok/s on a Pi 4 will hit 12 to 16 tok/s on a Pi 5 with the same quant and settings.
Is the Pi AI Kit worth buying?
For computer vision workloads, decisively yes. The Hailo-8L accelerator delivers 13 TOPS over PCIe and drops object-detection latency by 8 to 15 times compared to CPU-only inference on the Pi 5. It does not help language models because Hailo's architecture is tuned for convolutional networks and small transformer blocks, not autoregressive decoding.
What can a Pi 5 with AI Kit actually do?
It can run real-time object detection at 60+ FPS using YOLOv8n, wake-word detection with single-digit-millisecond latency, near-real-time Whisper transcription, and small text-to-speech models faster than real time. Paired with a 3B chat model on the CPU, you get a genuinely useful on-device assistant for home automation, voice control, and edge analytics.
Should I use an SD card or SSD for AI workloads?
Use an SSD. SD cards make model loading slow (25 to 35 seconds for a 3B model) and introduce occasional pauses during inference because of garbage collection. A USB 3.0 SSD drops model loads to 4 to 8 seconds and removes that variability. On the Pi 5, an NVMe HAT with even a basic drive gets you under 3 seconds.

Sources

— SpecPicks Editorial · Last verified 2026-05-29