Run a Local LLM on a Raspberry Pi 4 8GB: What Works in 2026

Name: Run a Local LLM on a Raspberry Pi 4 8GB: What Works in 2026
Item: Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for Building Mini PC/Smart Robot/Game Console/Workstation/Media Center/Etc.
Author: Mike Perry

The Pi 4 8GB runs small quantized LLMs at hobbyist throughput. A practical guide to what fits, what to expect, and where the Pi 5 / Pi 6 step up.

By Mike Perry · Published 2026-06-05 · Last verified 2026-07-21 · 10 min read

What 7B and smaller models actually run on a Raspberry Pi 4 8GB in 2026, what tok/s to expect, and the storage and cooling setup that keeps it usable.

The 30-second answer

The Raspberry Pi 4 8GB runs small open-weights LLMs in 2026, but only at the 1B-3B parameter tier with usable throughput — 4-8 tok/s on a 1B model, 1-3 tok/s on a 3B model. A 7B model technically runs at sub-1 tok/s, which is fine for batch jobs and unusable for chat. The Pi 4 is a credible learning and tinkering platform for edge AI; for serious local-LLM work, the Pi 5 or a small x86 mini-PC delivers materially better numbers at similar cost.

Why this question keeps coming up

The Raspberry Pi 4 was a milestone product. The 8GB model in particular — released in May 2020, per the official Raspberry Pi 4 specifications — was the first single-board computer in the Pi line with enough RAM for serious workloads beyond hobbyist scripting. Combined with the quad-core Cortex-A72 SoC at 1.5GHz (later 1.8GHz on revised silicon), the 8GB Pi 4 became the default platform for self-hosted home services, Pi-hole DNS sinkholes, Home Assistant deployments, and a long tail of edge-compute projects.

The local-LLM wave that started in 2023 created an obvious follow-on question: can this same widely-deployed hardware run the models that everyone is suddenly interested in? The answer is yes for small models and no for the flagship 70B-class releases — but "small" in 2026 includes some genuinely useful chat and code-completion models, and the Pi 4's price-to-capability ratio for that workload remains interesting.

Key Takeaways

The Pi 4 8GB is CPU-only inference; there is no GPU acceleration path beyond the integrated VideoCore which is not LLM-capable.
Models in the 1B-3B class at 4-bit quantization fit comfortably and deliver hobbyist-grade throughput.
7B models technically run but at sub-1 tok/s — batch use only.
Storage matters for model swapping; a fast SD card or external SSD is recommended.
Active cooling is mandatory for sustained inference; the Pi 4 throttles aggressively under heat.

What models fit on 8GB of RAM

The Pi 4's 8GB of LPDDR4 RAM is the binding constraint. Subtract roughly 1GB for the OS and runtime, leaving ~7GB for model weights, KV-cache, and inference state. At 4-bit quantization:

Model class	Approx VRAM/RAM	Pi 4 8GB fit	Realistic use
TinyLlama 1B	~600MB	Comfortable	Interactive chat, edge inference
Phi 2 / Phi 3 mini (~3.8B)	~2.4GB	Comfortable	Slow chat, batch summarization
Qwen 3 1.5B	~1GB	Comfortable	Interactive chat
Gemma 4 2B	~1.3GB	Comfortable	Edge applications
Llama 3.x 3B	~2GB	Comfortable	Slow chat
Mistral 7B / Llama 3 8B	~5GB	Tight	Batch only — sub-1 tok/s
Anything 13B+	8GB+	Doesn't fit	N/A

The 1B-3B tier is where the Pi 4 lives. Modern small LLMs have improved substantially since 2023 — TinyLlama, Phi 3 mini, Qwen 3 1.5B, and Gemma 4 2B are all genuinely useful for specific applications, not just toy demonstrations. For an edge-AI project running on a known prompt template (intent classification, log summarization, RAG over a small private dataset), the small-model tier on the Pi 4 is the right hammer.

Expected throughput numbers

Community measurements aggregated from the llama.cpp issue tracker and the broader Pi-AI community:

Model	Quant	Pi 4 8GB tok/s	Pi 5 8GB tok/s (reference)
TinyLlama 1.1B	Q4_K_M	4-8	9-15
Phi 3 mini 3.8B	Q4_K_M	1-3	3-6
Qwen 3 1.5B	Q4_K_M	4-7	9-13
Gemma 4 2B	Q4_K_M	2-5	5-10
Llama 3.x 3B	Q4_K_M	1.5-3	4-7
Mistral 7B	Q4_K_M	0.4-0.8	1-2

The Pi 5's higher clock and improved memory subsystem deliver roughly 2x the Pi 4's throughput. The Pi 6 — based on community estimates from publicly demonstrated platforms — will likely deliver 1.5-2x the Pi 5 again.

For interactive chat use, the practical floor is around 5 tok/s. That makes the 1B class the only one that's comfortable on the Pi 4. The 2-3B class is usable for slower workflows. Beyond that, the Pi 4 is a batch-processing platform, not an interactive one.

Storage configuration

Local LLM use on the Pi imposes more storage demands than typical Pi projects. A 4-bit Phi 3 mini is 2.4GB; a 4-bit Qwen 3 1.5B is 1GB; the Ollama runtime itself takes around 1GB. Switching between three or four models — which is the realistic workflow for someone experimenting — wants 16-32GB of free storage minimum, ideally on something faster than the slowest SD cards.

Two reasonable approaches:

Fast microSD card (Sandisk Extreme Pro or equivalent, 64GB+). Cheap and self-contained. Boot times are reasonable, model load times are tolerable. This is the no-friction option.
External SATA SSD over USB 3.0. The WD Blue SN550 1TB NVMe in a USB 3.0 NVMe enclosure delivers genuinely fast model load times and effectively unlimited capacity. The SanDisk Ultra 3D SATA SSD and Crucial BX500 in USB 3.0 SATA enclosures are cheaper alternatives at slightly lower speeds. This is the right setup if you'll experiment with many models or run inference workloads where model swapping is part of the pipeline.

The Pi 4's USB 3.0 ports are genuine USB 3.0 (unlike earlier Pis where some ports were shared), so external SSD throughput is in the 350-400MB/s range — well above microSD speeds and comparable to SATA SSDs on direct SATA.

Software setup — Ollama vs llama.cpp directly

Two paths into local LLM on the Pi:

Ollama — the easy mode

Ollama is the canonical "pull a model and run it" tool for local LLMs and works on the Pi 4 with ARM64 builds. Install with one command:

curl https://ollama.com/install.sh | sh
ollama pull phi3:mini
ollama run phi3:mini

The Ollama API server runs at http://localhost:11434 and works with the same client libraries as the desktop / cloud installations, which means Pi-hosted Ollama can serve as a drop-in for any Ollama-compatible application.

The trade-off is some overhead versus calling llama.cpp directly. Ollama's defaults are tuned for GPU inference and on CPU it leaves a few percent of throughput on the table. For learning and quick experimentation, Ollama is the right pick.

llama.cpp directly — best throughput

For users who care about every last token-per-second, llama.cpp compiled with the right ARM optimization flags delivers the best Pi 4 inference performance. The build is straightforward:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_NATIVE=1
./main -m phi3-mini-q4_k_m.gguf -t 4 -n 256 -p "Your prompt here"

The -t 4 flag tells llama.cpp to use all four Pi 4 cores. The native build flag tells the compiler to use ARM NEON and other SIMD extensions the Cortex-A72 supports.

Real-world delta versus Ollama is typically 5-15% more throughput in exchange for managing models and runtime manually. Worth it for production deployments, optional for tinkering.

Cooling matters

The Pi 4 throttles aggressively under sustained CPU load. Without active cooling, a 30-minute inference session sees clocks drop from 1.8GHz to 1.0GHz or lower, with corresponding throughput loss. The throttling is silent — no error, just slower numbers — which makes it easy to attribute lower-than-expected tok/s to "Pi just isn't fast enough" when the real issue is thermal.

The fix is mandatory active cooling. The Argon ONE V2 case with integrated fan and heatsink is the gold-standard pick for sustained workloads. Cheap aluminum heatsinks with a 30mm fan also work. The fanless tall-heatsink "passive" cases are not sufficient under sustained LLM load — they keep idle temperatures fine but cannot dissipate the heat of a continuous 100% CPU duty cycle.

What the Pi 4 is genuinely good for

Three use cases where the Pi 4 8GB plus a small LLM is a genuinely sensible build:

Edge intent classification. A 1B model running locally for "is this user query about the lights, the thermostat, or the music?" beats sending every query to a cloud API on latency, cost, and privacy.
Log / event summarization. A 3B model processing security camera motion events, system logs, or sensor telemetry into human-readable summaries works on the Pi 4 at batch speeds.
Offline assistant for a specific narrow domain. RAG over a small knowledge base — household appliance manuals, a specific game's lore, a workshop's technical docs — runs comfortably on a 3B model with embeddings on the Pi.

What it is not good for: general-purpose chat, code generation across many languages, multi-turn agentic workflows. Those workloads want a real GPU.

Bottom line

The Pi 4 8GB is a credible platform for the 1B-3B class of modern open-weights LLMs in 2026, delivering 1-8 tok/s depending on model size and quantization. It is the right pick for edge AI projects where the Pi was already going to be the platform; it is the wrong pick when local LLM is the primary workload and you're spec'ing the box from scratch. For that, the Pi 5 8GB delivers roughly 2x the throughput at similar cost, and a small x86 mini-PC with an integrated GPU often beats both. But for the substantial population of Pi 4 8GB units already deployed in homes worldwide, adding small-LLM inference as a bonus capability is genuinely worthwhile, and the tooling has matured enough to make it a one-command setup.

A worked example — Pi 4 8GB as a private home assistant

To make the small-model-on-Pi case concrete, consider a realistic project: a private voice assistant for the home that runs entirely on local hardware, replacing Alexa or Google Home for a specific narrow set of commands. The hardware is the Pi 4 8GB plus a USB microphone and a small speaker; the software stack is whisper.cpp for speech-to-text, Phi 3.5 mini via Ollama for intent classification and response generation, and Piper for text-to-speech.

Performance profile on the Pi 4 8GB:

Whisper.cpp tiny model: ~3x realtime transcription (3 seconds of audio in 1 second of compute).
Phi 3.5 mini Q4_K_M: 2 tok/s for response generation.
Piper text-to-speech: ~10x realtime synthesis.

End-to-end latency for "turn off the lights" → action: roughly 2-3 seconds, which is comfortably within the threshold that feels responsive. The model's response is short by design (no chit-chat, just confirm the action), so the slow 2 tok/s doesn't compound into long waits.

What this builds:

A device that does not phone home to cloud servers.
A voice interface where the latency stays predictable regardless of internet outages.
A platform that the household can extend without paying per-query API fees.

The trade-off is the breadth of capability: Alexa's NLU is broader, and the cloud assistants have access to much larger models. For a narrow set of home-automation commands, this is the right trade.

Edge AI workloads where the Pi 4 makes more sense than a cloud API

A few workloads where the Pi 4's local-LLM capability is materially better than calling a cloud API:

Anything in a home network that handles personal data (security camera triage, family calendar parsing, document classification). The privacy story is the win.
Anything with intermittent connectivity (remote-cabin home automation, off-grid sensor networks, vehicle-mounted edge processing).
High-volume narrow tasks where API cost compounds (continuous log summarization, sensor event tagging at high rate).
Latency-sensitive workflows where the round-trip to the cloud API would dominate the budget.

For all of these, the small-model-on-Pi setup pays back the throughput limitation with privacy, predictability, and cost savings.

A note on the Pi 5 8GB versus 16GB choice

When the Pi 5 is the right pick over the Pi 4, the 8GB versus 16GB capacity decision is genuinely interesting. The 16GB variant unlocks the 7B class of models at usable throughput (~4-6 tok/s), which is the threshold where local LLM on the Pi crosses from "hobbyist tinkering" to "production-credible for serious workloads." For LLM-first builds, the extra $30-40 for the 16GB Pi 5 is well-spent.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Can the Raspberry Pi 4 8GB run real LLMs in 2026?

Yes, but only small ones — 1B to 3B parameter models at aggressive quantization fit in the Pi's 8GB of LPDDR4 with usable throughput. Larger models down to 7B technically run but at single-digit tok/s, which is fine for batch jobs and unusable for interactive chat. Treat the Pi 4 as a hobbyist platform for the smallest modern open-weights models.

What tok/s should I expect?

Community measurements on the Pi 4 8GB with llama.cpp put a 1B model at 4-8 tok/s, a 3B model at 1-3 tok/s, and a 7B model at 0.3-0.8 tok/s. Numbers vary by quantization, thread count, and ambient temperature. The Pi 5 roughly doubles these figures, and the Pi 6 doubles again. Plan for these ranges, not for desktop-class throughput.

Does the model fit in RAM or do I need an SSD?

Models in the 1B-3B class at 4-bit quantization fit comfortably in the 8GB Pi 4. A 7B model at 4-bit (~4GB) leaves enough headroom for the runtime and OS. Larger models would require disk-backed inference, which is impractically slow. An external SSD is recommended for storing multiple models and runtime files, not for inference itself.

Why not use Ollama if it's easier?

Ollama works on the Pi 4 and is the easiest path for first-time users — pull a model and run it with one command. The compromise is that Ollama's defaults are tuned for GPU inference; on a CPU-only Pi it adds some overhead versus calling llama.cpp directly. For learning and experimentation, Ollama is the right starting point. For best throughput on the Pi specifically, llama.cpp with manually-tuned thread counts wins.

Should I just buy a Pi 5 or Pi 6 instead?

If your only use case is local LLM inference, the Pi 5 8GB and especially the upcoming Pi 6 deliver substantially better throughput at similar price points. The Pi 4 8GB makes sense if you already own one, or if you're using the same unit for other tasks (home automation, Pi-hole, retro emulation) and want LLM inference as a bonus capability. For LLM-first builds, skip ahead.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Run a Local LLM on a Raspberry Pi 4 8GB: What Works in 2026

The 30-second answer

Why this question keeps coming up

Key Takeaways

What models fit on 8GB of RAM

Expected throughput numbers

Storage configuration

Software setup — Ollama vs llama.cpp directly

Ollama — the easy mode

llama.cpp directly — best throughput

Cooling matters

What the Pi 4 is genuinely good for

Bottom line

A worked example — Pi 4 8GB as a private home assistant

Edge AI workloads where the Pi 4 makes more sense than a cloud API

A note on the Pi 5 8GB versus 16GB choice

Citations and sources

Products mentioned in this article

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

SanDisk Ultra 3D NAND 1TB Internal SSD - SATA III 6 Gb/s, 2.5"/7mm, Up to 560…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Run a Local LLM on a Raspberry Pi 4 8GB: What Works in 2026

The 30-second answer

Why this question keeps coming up

Key Takeaways

What models fit on 8GB of RAM

Expected throughput numbers

Storage configuration

Software setup — Ollama vs llama.cpp directly

Ollama — the easy mode

llama.cpp directly — best throughput

Cooling matters

What the Pi 4 is genuinely good for

Bottom line

A worked example — Pi 4 8GB as a private home assistant

Edge AI workloads where the Pi 4 makes more sense than a cloud API

A note on the Pi 5 8GB versus 16GB choice

Citations and sources

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

SanDisk Ultra 3D NAND 1TB Internal SSD - SATA III 6 Gb/s, 2.5"/7mm, Up to 560…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks