Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup

Name: Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup
Item: Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for Building Mini PC/Smart Robot/Game Console/Workstation/Media Center/Etc.
Author: Mike Perry

Practical CPU-only inference on the 8GB Pi 4 — what fits, what tok/s to expect, and the SSD-boot setup that keeps model loads fast

By Mike Perry · Published 2026-05-31 · Last verified 2026-07-14 · 10 min read

Can a Raspberry Pi 4 8GB run local LLMs in 2026? Yes — for 1-3B models at 2-5 tok/s with Ollama. We cover the BOM, SSD-boot setup, and where a real GPU still wins.

Yes — a Raspberry Pi 4 8GB runs local LLMs in 2026, but only for the small tier. With Ollama plus a quantized 1-3B model like TinyLlama or Phi-3 Mini, expect 2-5 tokens per second of generation, fast enough for a personal offline assistant. For anything bigger, get a real GPU — but for the $80-150 SBC tier, the Pi 4 8GB plus an SSD boot disk is a credible little inference appliance.

Realistic expectations for CPU-only LLM inference on an SBC

There is a lot of breathless YouTube content about "running LLMs on a Raspberry Pi" that buries the actual tok/s figures. The honest 2026 picture: the Pi 4 8GB is a single-board ARM computer with no GPU acceleration for LLMs, four Cortex-A72 cores at 1.5 GHz, and a shared 8GB RAM pool. Per the official Raspberry Pi 4 product page, the memory bandwidth tops out around 4-5 GB/s — two orders of magnitude below a budget GPU. That is the fundamental ceiling.

So what does work? Small models at small quants. A 1.1B-class model like TinyLlama runs at 8-12 tok/s — usable for a chat-style assistant. A 3B model like Phi-3 Mini or Gemma 2B runs at 3-5 tok/s — usable but you feel each token. A 7B model at q4 technically fits in memory but drops to 0.5-1 tok/s, which is slow enough that you'd rather wait for the GPU.

The 8GB variant matters because the 4GB Pi 4 can only run the smallest models with no headroom for OS and apps. The 8GB Pi is the right pick for any LLM work; the 4GB Pi is the right pick for media servers and homelab tasks where LLMs aren't the use case.

Key takeaways

Raspberry Pi 4 8GB runs 1-3B quantized models at 2-12 tok/s — usable for tiny assistants
7B models technically work but at 0.5-1 tok/s — too slow for interactive chat
SSD boot via a Crucial BX500 1TB or Samsung 870 EVO 250GB is the single biggest QoL upgrade
For real LLM work, the 12GB RTX 3060 at $269 is 10-20x faster and runs models the Pi can't touch
The Pi 4 LLM rig is the right answer for offline-first tiny assistants and learning, not for production chat

BOM and setup

Part	Pick	Approx price
SBC	Raspberry Pi 4 Model B 8GB	$80
Boot SSD	Crucial BX500 1TB (any 2.5" SATA SSD works)	$60
Boot SSD (smaller)	Samsung 870 EVO 250GB	$35
SATA-to-USB 3.0 adapter	FIDECO SATA/IDE to USB 3.0	$25-30
Power supply	official Pi 4 USB-C 15W	$10
Case + fan	any aluminum case with active cooling	$15
Total		~$170-200

The Pi 4 8GB is the centerpiece. The SSD is on the official Pi 4 USB 3.0 boot path — drop a SATA SSD into the FIDECO adapter, plug it into one of the Pi's USB 3.0 ports, and configure the Pi to boot from USB. Total build is under $200 for a competent tiny inference box.

Why boot from an SSD instead of microSD

microSD cards have two problems for LLM work:

Slow sequential read. A typical Class 10 / U3 card delivers 50-100 MB/s sequential read. A 4GB quantized model takes 40-80 seconds to load from cold. An SSD over USB 3.0 delivers 300-400 MB/s sustained, dropping model load to 10-15 seconds.
Poor random-access throughput. OS operations — package installs, log writes, swap activity — hammer random-access I/O. microSD cards manage perhaps 1-2K random IOPS; SATA SSDs deliver 10-15K, NVMe via the Pi 5 jumps further. The difference shows in everyday responsiveness, not just LLM workloads.

For the Pi 4 specifically, USB 3.0 boot is the right SSD path. NVMe over the PCIe-via-HAT bridges is possible but adds cost and complexity for marginal gains over USB 3.0 SATA. The Crucial BX500 is the cheap big SSD; the Samsung 870 EVO is the smaller and slightly higher-IOPS pick.

Step-by-step: installing Ollama and pulling a small quantized model

On Raspberry Pi OS (64-bit, Bookworm or later), the install is straightforward:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull tinyllama
ollama run tinyllama "Explain how Hall-effect sensors work in one paragraph."

The first pull downloads the GGUF weights into /usr/share/ollama/.ollama/models/. The first run warms the model into memory; subsequent runs reuse the cached weights. Per the Ollama model library, the smaller models in the library — tinyllama, phi3:mini, gemma2:2b, qwen2.5:1.5b — are the right starting points for a Pi 4 build.

For more control, you can use llama.cpp directly. Per the llama.cpp GitHub project, the ARM NEON build path is well-optimized and what Ollama uses under the hood — there's no meaningful speed advantage to building llama.cpp by hand on a Pi unless you want non-default quantization formats.

Benchmark table: tok/s on the Pi 4 8GB

Community-reported numbers, quantized to q4_K_M unless noted, Raspberry Pi 4 8GB, 64-bit Bookworm, llama.cpp/Ollama, 512-token prompt:

Model	Params	Quant	Model size on disk	Prefill tok/s	Gen tok/s	Usable?
TinyLlama 1.1B	1.1B	q4_K_M	~700 MB	200-300	8-12	yes — snappy
Gemma 2 2B	2B	q4_K_M	~1.5 GB	150-220	5-8	yes — comfortable
Phi-3 Mini 3.8B	3.8B	q4_K_M	~2.3 GB	80-120	3-5	yes — usable but slow
Qwen 2.5 1.5B	1.5B	q4_K_M	~1.0 GB	180-260	6-9	yes — good balance
Llama 3.2 3B	3B	q4_K_M	~1.9 GB	100-140	4-6	yes — feels slow
Mistral 7B	7B	q4_K_M	~4.1 GB	30-50	0.8-1.2	borderline
Gemma 2 9B	9B	q4_K_M	~5.4 GB	20-35	0.5-0.8	no — too slow

Numbers vary with cooling, OS background load, and exact build of llama.cpp. The pattern is consistent: 1-3B models are the sweet spot; 7B is the practical ceiling; anything bigger is academic.

What's the largest model that fits in 8GB?

The 8GB shared between OS and apps caps the practical model size near 5-6GB on disk — leaving 2-3GB for the OS, the KV cache during inference, and any other apps you're running. That maps to:

7B at q4_K_M (~4.1 GB on disk) — fits with a 2K context window, swap-pressured on longer prompts
3-4B at q4_K_M or q5_K_M — fits comfortably with 8K+ context
1-2B at q6 or even FP16 — fits with huge context and feels fast

The Pi has no GPU VRAM to separate from system RAM, so model + KV cache + OS + Ollama daemon all share the same pool. Adding swap on the SSD lets you push past the limit, but performance crashes the moment the model spills to swap.

Prefill vs generation on an ARM CPU

Prefill on a Pi 4 hits 60-300 tok/s depending on model size — not bad for digesting prompts. Generation is sequential: walk every weight in the model once per output token. With ~4-5 GB/s of memory bandwidth, the math caps a 4GB model at roughly 1 tok/s theoretical and the NEON-optimized llama.cpp gets surprisingly close to that — 0.8-1.2 tok/s in practice for a 7B q4 model.

The Pi's VideoCore VI GPU is not a usable LLM target — its compute kernels are not exposed for matrix math the way CUDA or Metal are. So the Pi is purely an ARM CPU device for inference, and the bandwidth ceiling sets the speed limit cleanly.

Where the Pi 4 8GB LLM rig makes sense — and where it doesn't

It makes sense for:

Tiny offline assistants. A Pi running Phi-3 Mini that answers "what's the syntax for a Python list comprehension?" on a local network with no data leaving the building.
Personal RAG over a small notes folder. A 1-3B model can do retrieval-augmented Q&A over a few thousand notes well enough to be useful.
Learning and experimentation. The Pi is cheap, the models are free, and the workflow teaches the moving parts of local inference.
Offline-first or air-gapped deployments. The Pi runs on 15W, fits anywhere, and doesn't phone home.

It doesn't make sense for:

Interactive chat with reasoning quality. 7B q4 at 1 tok/s is unpleasant. Get a real GPU.
Code completion. The Pi is too slow for "type, see suggestion appear" speeds.
Image generation. Stable Diffusion on a Pi is a science fair project, not a workflow.
Anything that needs 70B+ models. Not happening.

Cross-link: the GPU path

For real local LLM work in 2026, the 12GB RTX 3060 at $269 is the budget pick. It runs Gemma 9B at 35-40 tok/s — roughly 30x faster than the Pi on the same model. The Pi is the right answer when $80 SBC and offline-first matter more than throughput; the RTX 3060 is the right answer when throughput matters.

For a complete budget LLM PC build see Best GPU for Local LLMs Under $300.

Bottom line

The Pi 4 8GB runs local LLMs in 2026, with realistic 2-5 tok/s for 1-3B models and the option to push a 7B model at slow speeds. With an SSD boot disk it's a competent little inference appliance for under $200 — the right pick for offline-first tiny assistants and learning, not for serious chat workloads. For serious work, save up for the RTX 3060 12GB.

Common pitfalls

Skipping the SSD. microSD-based LLM rigs feel three times as slow as they should.
Picking the 4GB Pi 4 to save $20. The 8GB is the right minimum for LLM work.
Pulling a 7B model without expecting 1 tok/s. The Pi is honest about its limits; reset expectations to the 1-3B tier.
No active cooling. Sustained CPU loads thermal-throttle the Pi to ~60-70% performance without a fan. A $15 case with active cooling is mandatory.
Treating the Pi 4 as a substitute for a GPU. It isn't. It's a different tool.

Real-world numbers — what 2-5 tok/s actually feels like

Output speed	What it feels like	Useful for
10+ tok/s	faster than fast reading	interactive chat, code completion
5-9 tok/s	comfortable reading pace	chat, batch summarization
3-4 tok/s	slower than typing	tolerable for short queries, painful for long
1-2 tok/s	each word visibly streams	batch jobs only, not chat
< 1 tok/s	wait-staring	leave it running, come back later

The Pi 4's 1-3B model performance lands at 3-9 tok/s, which sits at the "tolerable for short, painful for long" boundary. For a personal assistant that answers concise queries — "what's the syntax for X in Python?" — it works. For a tutoring conversation with long explanations, it's frustrating.

The 7B Mistral case at ~1 tok/s is in batch-only territory. Useful for "summarize these 20 short notes overnight"; useless for chat.

Power, thermals, and 24/7 always-on use

The Pi 4 8GB draws ~3-5W idle and 6-8W under sustained CPU load. With LLM inference active most of the time, expect ~6W average. Over a year at $0.15/kWh that's roughly $8/year in electricity — essentially free.

Thermals are the real constraint. Without cooling, sustained inference loads thermal-throttle the Pi at 80°C after 2-3 minutes, dropping clocks to 1.0 GHz and inference speed by ~30-40%. Active cooling (a $15 case with a 30mm fan) keeps the SoC at 55-65°C indefinitely, holding full 1.5 GHz throughout.

For 24/7 always-on use as a personal LLM appliance, the right setup is:

Pi 4 8GB in an aluminum case with active cooling
SSD on USB 3.0 (no microSD)
Ethernet, not Wi-Fi (more stable for headless operation)
A small UPS (any cheap mini-UPS) for clean shutdowns on power blips

That stack runs forever. Replace the SSD in 5-7 years; everything else lasts a decade.

Building a tiny RAG over personal notes — a worked example

A practical use case where the Pi 4 LLM rig pays off: a personal RAG over your notes folder.

Convert your notes (Markdown, Obsidian, Joplin export) to plain text
Use a small embedding model (BAAI/bge-small-en at ~130MB) to build a FAISS index
Wire a tiny query loop: query → embedding → top-K retrieval → context-stuffed prompt to a 3B model in Ollama
Total throughput: ~5-10 sec for retrieval + 3-15 sec for a 30-100 token reply, depending on prompt size

That's a workable offline knowledge assistant for a few thousand notes on a $200 box. Per llama.cpp's repo, the ARM NEON build path used by Ollama is well-optimized for exactly this kind of workload. The Pi handles retrieval-bound queries far better than it handles open-ended generation, because the retrieval step trims the prompt and the model only generates short responses.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Can a Raspberry Pi 4 8GB run local LLMs?

Yes, for the small tier. With Ollama and a quantized 1-3B model (TinyLlama, Phi-3 Mini, Gemma 2B), the Pi 4 8GB runs CPU-only inference at 2-5 tokens per second — usable for a personal assistant or a tiny RAG backend. 7B models technically run at q4 but drop to 0.5-1 tok/s, which is too slow for interactive chat. For models above 7B, get a real GPU.

Why boot from an SSD instead of microSD?

microSD cards have ~50-100 MB/s read and limited random-access throughput, so loading a 2-4GB model takes 20-60 seconds and the OS feels sluggish even outside LLM use. A SATA SSD over USB 3.0 jumps to 300-400 MB/s sustained and ~10K IOPS random — model loads finish in 5-10 seconds, and the OS feels like a real computer. SSD boot is the single biggest QoL upgrade for any Pi 4 build.

What's the largest model that fits in 8GB?

The Pi 4 8GB shares its memory between OS and apps, so the practical model ceiling is roughly 5-6GB to leave 2-3GB for the OS, KV cache, and overhead. That's enough for a 7B model at q4_K_M with a small context window, or a 3B model at q6 with plenty of context, or a 1-2B model unquantized for maximum quality. Above 7B, you're paging to swap and the system grinds.

Prefill vs generation on an ARM CPU with no GPU offload — what should I expect?

On the Pi 4's quad-core Cortex-A72 at 1.5 GHz, prefill processes around 60-120 tokens per second for a 3B model — long prompts take seconds to digest. Generation runs sequentially through the model weights and drops to 2-5 tok/s for the same model. The Pi has no GPU offload path for LLMs (its VideoCore VI is not LLM-supported), so it's pure ARM SIMD inference. llama.cpp's NEON build is meaningfully faster than the generic ARM build — use Ollama, which bundles llama.cpp with NEON enabled.

Where does the Pi 4 8GB LLM rig actually make sense?

Three places: tiny offline assistants (a local Phi-3 that answers questions without sending data anywhere), personal RAG over a small notes folder, and learning/experimentation projects where the Pi's $80 price tag is the right entry. For anything that needs reasoning quality or sub-second response time, the [12GB RTX 3060](/product/B08WRVQ4KR) at $269 is a better answer — 10-20x faster and runs models the Pi can't touch.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup

Realistic expectations for CPU-only LLM inference on an SBC

Key takeaways

BOM and setup

Why boot from an SSD instead of microSD

Step-by-step: installing Ollama and pulling a small quantized model

Benchmark table: tok/s on the Pi 4 8GB

What's the largest model that fits in 8GB?

Prefill vs generation on an ARM CPU

Where the Pi 4 8GB LLM rig makes sense — and where it doesn't

Cross-link: the GPU path

Bottom line

Common pitfalls

Real-world numbers — what 2-5 tok/s actually feels like

Power, thermals, and 24/7 always-on use

Building a tiny RAG over personal notes — a worked example

Related guides

Citations and sources

Products mentioned in this article

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

Samsung 870 EVO SATA SSD 250GB 2.5” Internal Solid State Drive, Upgrade…

FIDECO SATA/IDE to USB 3.0 Adapter, Hard Drive Adapter Cable Converter for…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup

Realistic expectations for CPU-only LLM inference on an SBC

Key takeaways

BOM and setup

Why boot from an SSD instead of microSD

Step-by-step: installing Ollama and pulling a small quantized model

Benchmark table: tok/s on the Pi 4 8GB

What's the largest model that fits in 8GB?

Prefill vs generation on an ARM CPU

Where the Pi 4 8GB LLM rig makes sense — and where it doesn't

Cross-link: the GPU path

Bottom line

Common pitfalls

Real-world numbers — what 2-5 tok/s actually feels like

Power, thermals, and 24/7 always-on use

Building a tiny RAG over personal notes — a worked example

Related guides

Citations and sources

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

Samsung 870 EVO SATA SSD 250GB 2.5” Internal Solid State Drive, Upgrade…

FIDECO SATA/IDE to USB 3.0 Adapter, Hard Drive Adapter Cable Converter for…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks