The 30-second answer
The Raspberry Pi 4 8GB runs small open-weights LLMs in 2026, but only at the 1B-3B parameter tier with usable throughput — 4-8 tok/s on a 1B model, 1-3 tok/s on a 3B model. A 7B model technically runs at sub-1 tok/s, which is fine for batch jobs and unusable for chat. The Pi 4 is a credible learning and tinkering platform for edge AI; for serious local-LLM work, the Pi 5 or a small x86 mini-PC delivers materially better numbers at similar cost.
Why this question keeps coming up
The Raspberry Pi 4 was a milestone product. The 8GB model in particular — released in May 2020, per the official Raspberry Pi 4 specifications — was the first single-board computer in the Pi line with enough RAM for serious workloads beyond hobbyist scripting. Combined with the quad-core Cortex-A72 SoC at 1.5GHz (later 1.8GHz on revised silicon), the 8GB Pi 4 became the default platform for self-hosted home services, Pi-hole DNS sinkholes, Home Assistant deployments, and a long tail of edge-compute projects.
The local-LLM wave that started in 2023 created an obvious follow-on question: can this same widely-deployed hardware run the models that everyone is suddenly interested in? The answer is yes for small models and no for the flagship 70B-class releases — but "small" in 2026 includes some genuinely useful chat and code-completion models, and the Pi 4's price-to-capability ratio for that workload remains interesting.
Key Takeaways
- The Pi 4 8GB is CPU-only inference; there is no GPU acceleration path beyond the integrated VideoCore which is not LLM-capable.
- Models in the 1B-3B class at 4-bit quantization fit comfortably and deliver hobbyist-grade throughput.
- 7B models technically run but at sub-1 tok/s — batch use only.
- Storage matters for model swapping; a fast SD card or external SSD is recommended.
- Active cooling is mandatory for sustained inference; the Pi 4 throttles aggressively under heat.
What models fit on 8GB of RAM
The Pi 4's 8GB of LPDDR4 RAM is the binding constraint. Subtract roughly 1GB for the OS and runtime, leaving ~7GB for model weights, KV-cache, and inference state. At 4-bit quantization:
| Model class | Approx VRAM/RAM | Pi 4 8GB fit | Realistic use |
|---|---|---|---|
| TinyLlama 1B | ~600MB | Comfortable | Interactive chat, edge inference |
| Phi 2 / Phi 3 mini (~3.8B) | ~2.4GB | Comfortable | Slow chat, batch summarization |
| Qwen 3 1.5B | ~1GB | Comfortable | Interactive chat |
| Gemma 4 2B | ~1.3GB | Comfortable | Edge applications |
| Llama 3.x 3B | ~2GB | Comfortable | Slow chat |
| Mistral 7B / Llama 3 8B | ~5GB | Tight | Batch only — sub-1 tok/s |
| Anything 13B+ | 8GB+ | Doesn't fit | N/A |
The 1B-3B tier is where the Pi 4 lives. Modern small LLMs have improved substantially since 2023 — TinyLlama, Phi 3 mini, Qwen 3 1.5B, and Gemma 4 2B are all genuinely useful for specific applications, not just toy demonstrations. For an edge-AI project running on a known prompt template (intent classification, log summarization, RAG over a small private dataset), the small-model tier on the Pi 4 is the right hammer.
Expected throughput numbers
Community measurements aggregated from the llama.cpp issue tracker and the broader Pi-AI community:
| Model | Quant | Pi 4 8GB tok/s | Pi 5 8GB tok/s (reference) |
|---|---|---|---|
| TinyLlama 1.1B | Q4_K_M | 4-8 | 9-15 |
| Phi 3 mini 3.8B | Q4_K_M | 1-3 | 3-6 |
| Qwen 3 1.5B | Q4_K_M | 4-7 | 9-13 |
| Gemma 4 2B | Q4_K_M | 2-5 | 5-10 |
| Llama 3.x 3B | Q4_K_M | 1.5-3 | 4-7 |
| Mistral 7B | Q4_K_M | 0.4-0.8 | 1-2 |
The Pi 5's higher clock and improved memory subsystem deliver roughly 2x the Pi 4's throughput. The Pi 6 — based on community estimates from publicly demonstrated platforms — will likely deliver 1.5-2x the Pi 5 again.
For interactive chat use, the practical floor is around 5 tok/s. That makes the 1B class the only one that's comfortable on the Pi 4. The 2-3B class is usable for slower workflows. Beyond that, the Pi 4 is a batch-processing platform, not an interactive one.
Storage configuration
Local LLM use on the Pi imposes more storage demands than typical Pi projects. A 4-bit Phi 3 mini is 2.4GB; a 4-bit Qwen 3 1.5B is 1GB; the Ollama runtime itself takes around 1GB. Switching between three or four models — which is the realistic workflow for someone experimenting — wants 16-32GB of free storage minimum, ideally on something faster than the slowest SD cards.
Two reasonable approaches:
- Fast microSD card (Sandisk Extreme Pro or equivalent, 64GB+). Cheap and self-contained. Boot times are reasonable, model load times are tolerable. This is the no-friction option.
- External SATA SSD over USB 3.0. The WD Blue SN550 1TB NVMe in a USB 3.0 NVMe enclosure delivers genuinely fast model load times and effectively unlimited capacity. The SanDisk Ultra 3D SATA SSD and Crucial BX500 in USB 3.0 SATA enclosures are cheaper alternatives at slightly lower speeds. This is the right setup if you'll experiment with many models or run inference workloads where model swapping is part of the pipeline.
The Pi 4's USB 3.0 ports are genuine USB 3.0 (unlike earlier Pis where some ports were shared), so external SSD throughput is in the 350-400MB/s range — well above microSD speeds and comparable to SATA SSDs on direct SATA.
Software setup — Ollama vs llama.cpp directly
Two paths into local LLM on the Pi:
Ollama — the easy mode
Ollama is the canonical "pull a model and run it" tool for local LLMs and works on the Pi 4 with ARM64 builds. Install with one command:
The Ollama API server runs at http://localhost:11434 and works with the same client libraries as the desktop / cloud installations, which means Pi-hosted Ollama can serve as a drop-in for any Ollama-compatible application.
The trade-off is some overhead versus calling llama.cpp directly. Ollama's defaults are tuned for GPU inference and on CPU it leaves a few percent of throughput on the table. For learning and quick experimentation, Ollama is the right pick.
llama.cpp directly — best throughput
For users who care about every last token-per-second, llama.cpp compiled with the right ARM optimization flags delivers the best Pi 4 inference performance. The build is straightforward:
The -t 4 flag tells llama.cpp to use all four Pi 4 cores. The native build flag tells the compiler to use ARM NEON and other SIMD extensions the Cortex-A72 supports.
Real-world delta versus Ollama is typically 5-15% more throughput in exchange for managing models and runtime manually. Worth it for production deployments, optional for tinkering.
Cooling matters
The Pi 4 throttles aggressively under sustained CPU load. Without active cooling, a 30-minute inference session sees clocks drop from 1.8GHz to 1.0GHz or lower, with corresponding throughput loss. The throttling is silent — no error, just slower numbers — which makes it easy to attribute lower-than-expected tok/s to "Pi just isn't fast enough" when the real issue is thermal.
The fix is mandatory active cooling. The Argon ONE V2 case with integrated fan and heatsink is the gold-standard pick for sustained workloads. Cheap aluminum heatsinks with a 30mm fan also work. The fanless tall-heatsink "passive" cases are not sufficient under sustained LLM load — they keep idle temperatures fine but cannot dissipate the heat of a continuous 100% CPU duty cycle.
What the Pi 4 is genuinely good for
Three use cases where the Pi 4 8GB plus a small LLM is a genuinely sensible build:
- Edge intent classification. A 1B model running locally for "is this user query about the lights, the thermostat, or the music?" beats sending every query to a cloud API on latency, cost, and privacy.
- Log / event summarization. A 3B model processing security camera motion events, system logs, or sensor telemetry into human-readable summaries works on the Pi 4 at batch speeds.
- Offline assistant for a specific narrow domain. RAG over a small knowledge base — household appliance manuals, a specific game's lore, a workshop's technical docs — runs comfortably on a 3B model with embeddings on the Pi.
What it is not good for: general-purpose chat, code generation across many languages, multi-turn agentic workflows. Those workloads want a real GPU.
Bottom line
The Pi 4 8GB is a credible platform for the 1B-3B class of modern open-weights LLMs in 2026, delivering 1-8 tok/s depending on model size and quantization. It is the right pick for edge AI projects where the Pi was already going to be the platform; it is the wrong pick when local LLM is the primary workload and you're spec'ing the box from scratch. For that, the Pi 5 8GB delivers roughly 2x the throughput at similar cost, and a small x86 mini-PC with an integrated GPU often beats both. But for the substantial population of Pi 4 8GB units already deployed in homes worldwide, adding small-LLM inference as a bonus capability is genuinely worthwhile, and the tooling has matured enough to make it a one-command setup.
A worked example — Pi 4 8GB as a private home assistant
To make the small-model-on-Pi case concrete, consider a realistic project: a private voice assistant for the home that runs entirely on local hardware, replacing Alexa or Google Home for a specific narrow set of commands. The hardware is the Pi 4 8GB plus a USB microphone and a small speaker; the software stack is whisper.cpp for speech-to-text, Phi 3.5 mini via Ollama for intent classification and response generation, and Piper for text-to-speech.
Performance profile on the Pi 4 8GB:
- Whisper.cpp tiny model: ~3x realtime transcription (3 seconds of audio in 1 second of compute).
- Phi 3.5 mini Q4_K_M: 2 tok/s for response generation.
- Piper text-to-speech: ~10x realtime synthesis.
End-to-end latency for "turn off the lights" → action: roughly 2-3 seconds, which is comfortably within the threshold that feels responsive. The model's response is short by design (no chit-chat, just confirm the action), so the slow 2 tok/s doesn't compound into long waits.
What this builds:
- A device that does not phone home to cloud servers.
- A voice interface where the latency stays predictable regardless of internet outages.
- A platform that the household can extend without paying per-query API fees.
The trade-off is the breadth of capability: Alexa's NLU is broader, and the cloud assistants have access to much larger models. For a narrow set of home-automation commands, this is the right trade.
Edge AI workloads where the Pi 4 makes more sense than a cloud API
A few workloads where the Pi 4's local-LLM capability is materially better than calling a cloud API:
- Anything in a home network that handles personal data (security camera triage, family calendar parsing, document classification). The privacy story is the win.
- Anything with intermittent connectivity (remote-cabin home automation, off-grid sensor networks, vehicle-mounted edge processing).
- High-volume narrow tasks where API cost compounds (continuous log summarization, sensor event tagging at high rate).
- Latency-sensitive workflows where the round-trip to the cloud API would dominate the budget.
For all of these, the small-model-on-Pi setup pays back the throughput limitation with privacy, predictability, and cost savings.
A note on the Pi 5 8GB versus 16GB choice
When the Pi 5 is the right pick over the Pi 4, the 8GB versus 16GB capacity decision is genuinely interesting. The 16GB variant unlocks the 7B class of models at usable throughput (~4-6 tok/s), which is the threshold where local LLM on the Pi crosses from "hobbyist tinkering" to "production-credible for serious workloads." For LLM-first builds, the extra $30-40 for the 16GB Pi 5 is well-spent.
Citations and sources
- Raspberry Pi 4 Model B — official specifications
- Ollama — community benchmarks and ARM support
- llama.cpp — ARM optimization and Pi-specific issues
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
