Yes, a Raspberry Pi 4 8GB can run a local LLM — but slowly, on small models, and only at the speeds you'd accept for background tasks rather than interactive chat. With a 3B-class quantized model and a fast USB-attached SSD, expect 2-6 tokens per second of generated output. That's enough for home-automation intents, classification pipelines, and short summaries; it's not enough for a responsive coding assistant or a chat partner.
The Pi-as-LLM-host idea has been circulating since the first LLaMA leaked, but the gap between "loads" and "runs usefully" is wide on this hardware. CPU-only inference at 8GB has fundamental ceilings — memory bandwidth caps the token rate, single-thread performance caps the prompt-processing rate, and the absence of dedicated tensor hardware means there's no way to close the gap with software alone. Knowing those ceilings helps you pick the right tasks for the platform and avoid the disappointment that comes from expecting desktop-class speed.
Key takeaways
- A Pi 4 8GB runs 1B-3B models at q4 quantization at 2-6 tokens per second — slow but usable for background tasks.
- 7B-class models technically load in 8GB at low quant but generate at 1-2 tok/s, which is too slow for chat.
- A USB-attached SSD (WD Blue SN550 or Crucial BX500 in an enclosure) is required for usable load times and long-term reliability.
- Best use cases: home-automation intent classification, offline assistants for narrow domains, prototype edge-AI projects, learning the local-LLM tooling.
- For interactive chat or larger models, step up to a Pi 5, a mini-PC with iGPU, or a 12GB GPU desktop rig.
What you'll need checklist
- A Raspberry Pi 4 Model B 8GB — the 8GB variant is non-negotiable; 4GB and below run out of headroom with any model larger than 1B.
- USB-attached storage: a WD Blue SN550 1TB NVMe in a USB 3.0 enclosure, or a Crucial BX500 1TB SATA SSD with a USB bridge. Either dramatically outperforms a microSD card for model loads.
- Active cooling: a fan-equipped case or the Argon ONE V2 chassis. The Pi 4 throttles at 80°C and you'll hit that within minutes of any sustained inference workload on a passive heatsink.
- A 3 A USB-C power supply (the official Raspberry Pi PSU is the safest pick — undervoltage on inference workloads causes silent throttling and crashes).
- Raspberry Pi OS 64-bit (32-bit can't address the full 8GB for a single process).
Which models fit in 8GB?
The Pi 4's 8GB is shared between the OS, network stack, swap, and the model. Realistic footprints:
| Model | Quant | VRAM/RAM | Notes |
|---|---|---|---|
| TinyLlama 1.1B | q4_K_M | 700 MB | Fast, weak — good for classification |
| Phi-3 Mini 3.8B | q4_K_M | 2.4 GB | Best 8GB sweet spot |
| Gemma 2 2B | q4_K_M | 1.5 GB | Lightweight chat |
| Qwen2.5 3B | q4_K_M | 2.0 GB | Strong for size |
| Llama 3.1 8B | q4_K_M | 4.9 GB | Loads but slow generation |
| Llama 3.1 8B | q2_K | 3.1 GB | Loads with headroom, quality degraded |
The sweet spot for a Pi 4 8GB is the 2-3B class at q4. You leave 4-5 GB free for the OS and any other services running on the Pi, and the model generates at a usable speed for its size class.
Benchmark table: tok/s on the Pi 4 8GB
Community measurements from Phoronix's Raspberry Pi benchmark coverage and r/LocalLLaMA's Pi-specific threads land roughly here:
| Model | Quant | Prompt tok/s | Generation tok/s |
|---|---|---|---|
| TinyLlama 1.1B | q4_K_M | 35-50 | 12-18 |
| Gemma 2 2B | q4_K_M | 20-30 | 6-9 |
| Qwen2.5 3B | q4_K_M | 14-20 | 4-7 |
| Phi-3 Mini 3.8B | q4_K_M | 12-16 | 3-5 |
| Llama 3.1 8B | q4_K_M | 6-9 | 1-2 |
These assume a stable CPU temperature (no throttling), the model loaded once and reused, and llama.cpp built with NEON optimizations. Cold-load times for the 3-4B models run 8-15 seconds on USB SSD vs 30-60 seconds on microSD.
Why CPU-only inference is slow
Two architectural facts bound throughput:
- Memory bandwidth: The Pi 4's LPDDR4 maxes out around 4 GB/s of usable bandwidth for matrix-multiply workloads. Each generation step reads the full model from memory; a 3B model at q4 is 2 GB, so the upper limit is ~2 tokens per second per memory pass before any compute. Smaller models hit higher rates because they read less data per step.
- No dedicated tensor hardware: The Pi 4's Cortex-A72 cores have NEON SIMD but no equivalent of NVIDIA's tensor cores or Apple's Neural Engine. Every matrix multiply runs on general-purpose vector units, which is dramatically less power-efficient than dedicated hardware.
Software optimizations (NEON kernels, quantization-aware code paths) get you some of the way; the fundamental ceilings remain. There's no software trick that turns a 4 GB/s memory bus into a 360 GB/s one.
SSD vs microSD
microSD cards are slow (typically 50-100 MB/s sequential read on the Pi's interface) and have limited write endurance. For LLM workloads where the model file is large (multi-GB) and read frequently, the microSD becomes the load-time bottleneck and a long-term reliability risk.
A USB 3.0-attached SSD (the Pi 4 has USB 3.0; earlier Pis didn't) achieves 200-400 MB/s reads through the bridge chip. Model load times drop by 3-5×, and write endurance becomes a non-issue. The WD Blue SN550 NVMe in a USB enclosure or a Crucial BX500 SATA SSD with a USB bridge are both well-tested in this role.
A common upgrade pattern: boot the OS from microSD for the first install, then migrate the root filesystem to the SSD using rpi-clone or the official SD Card Copier. Subsequent boots happen from SSD at much higher speed and reliability.
Realistic use cases
Use cases where a Pi 4 LLM works well:
- Home automation intent classification: a small model translates spoken or typed commands into structured intents for Home Assistant or similar. Latency is acceptable (1-3 seconds) and the model is small.
- Offline narrow-domain assistants: a 3B model fine-tuned (or just system-prompted) for cooking recipes, gardening advice, or a specific game's lore. Limited scope keeps quality high despite small model size.
- Prototype edge-AI projects: validating that a workflow makes sense before deploying it to faster hardware. The Pi runs the same llama.cpp / Ollama stack as a desktop, just slower.
- Learning the local-LLM tooling: getting comfortable with model loading, prompt design, and inference-engine configuration on cheap hardware.
Use cases where a Pi 4 LLM doesn't work:
- Interactive chat: 4-7 tok/s is below the threshold where chat feels responsive.
- Coding assistance: too slow per response, too small for code reasoning quality.
- Image generation: not realistically possible — diffusion models need GPU acceleration.
- Long context: KV cache grows quickly and the 8GB ceiling caps useful context length.
When to step up
If your use case outgrows the Pi 4, the natural upgrade paths:
- Raspberry Pi 5 8GB: roughly 2-3× the LLM throughput at similar power, similar form factor. The right upgrade if you love the Pi platform.
- Intel N100 Mini PC: 4-6× the throughput, runs Ollama and Open WebUI comfortably with iGPU acceleration, lands around $200.
- 12GB GPU desktop rig: the cheapest serious local-LLM hardware (see our budget LLM build coverage). 50-100× the Pi 4's throughput, runs models 5-10× larger.
The Pi 4 is the right entry point for learning and tinkering. It's not the right platform for production local-LLM work; step up when your workload demands it.
Bottom line
A Pi 4 8GB runs local LLMs slowly, on small models, with constraints that come from architecture rather than software. It's a great learning platform, a credible edge-AI host for narrow tasks, and a fun tinkering target. For chat-class interactive use or any 7B+ model, you need different hardware. Pair the Pi with a USB SSD, install Ollama, grab a 3B-class q4 model, and treat the platform for what it is: a low-power, always-on, small-model edge node.
Frequently asked questions
How fast can a Raspberry Pi 4 8GB actually run an LLM?
Slowly by GPU standards. With small quantized models in the 1B-3B range, a Pi 4 produces a few tokens per second — usable for short prompts and background tasks but tedious for interactive chat. The 7B class technically loads in 8GB at low quantization but crawls at 1-2 tokens per second. Treat the Pi as an edge-AI and learning platform, not a responsive chat box. The hardware ceiling is memory bandwidth, not software optimization; no amount of tuning will turn the Pi 4 into a fast LLM host.
Do I need an SSD, or will a microSD card work?
An SSD over USB is strongly recommended. Model weights are large and read frequently at load, and microSD cards are slow and prone to wear-out under heavy use, causing corruption on long-running projects. A USB-attached SSD like the WD SN550 or Crucial BX500 dramatically improves load times and reliability, which matters when a model fills most of the Pi's resources. The cost delta is small and the reliability improvement is real.
Which models should I run on a Pi 4?
Stick to small, quantized models — 1B to 3B parameter classes at q4 — for acceptable responsiveness. These handle classification, simple Q&A, summarization of short text, and home-automation intents well. Larger 7B models fit but respond too slowly for interactive use. Match the model to lightweight, latency-tolerant tasks rather than expecting desktop-class conversation. Phi-3 Mini 3.8B at q4_K_M is the current sweet spot for capability-per-GB on this hardware.
Can I use Ollama on a Raspberry Pi?
Yes. Ollama runs on ARM Linux and is one of the easiest ways to pull and serve small models on a Pi 4. It handles model management and exposes an API your home-automation or scripts can call. Performance is bound by the Pi's CPU and memory bandwidth, so pick small models, but the software experience itself is straightforward. Installation is a single shell command and the same Ollama API your desktop uses works identically on the Pi.
When should I move off the Pi to real hardware?
If you want responsive chat, larger models, or image generation, the Pi's CPU-only inference becomes the bottleneck and a 12GB GPU rig or a capable mini-PC is the next step. Use the Pi to learn the tooling and prototype edge tasks, then graduate to a GPU when speed and model size start limiting what you can build. The threshold most builders cross is when prompt processing latency exceeds the 2-3 second mark and you start avoiding the model out of impatience.
Citations and sources
- Ollama on GitHub — local LLM runtime that works on ARM Linux
- Raspberry Pi 4 Model B product page — hardware spec reference
- Phoronix — Raspberry Pi benchmark coverage and performance data
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
