Skip to main content
Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup

Can a Raspberry Pi 4 8GB Run Local LLMs in 2026? Ollama tok/s + SSD-Boot Setup

Practical CPU-only inference on the 8GB Pi 4 — what fits, what tok/s to expect, and the SSD-boot setup that keeps model loads fast

Can a Raspberry Pi 4 8GB run local LLMs in 2026? Yes — for 1-3B models at 2-5 tok/s with Ollama. We cover the BOM, SSD-boot setup, and where a real GPU still wins.

Yes — a Raspberry Pi 4 8GB runs local LLMs in 2026, but only for the small tier. With Ollama plus a quantized 1-3B model like TinyLlama or Phi-3 Mini, expect 2-5 tokens per second of generation, fast enough for a personal offline assistant. For anything bigger, get a real GPU — but for the $80-150 SBC tier, the Pi 4 8GB plus an SSD boot disk is a credible little inference appliance.

Realistic expectations for CPU-only LLM inference on an SBC

There is a lot of breathless YouTube content about "running LLMs on a Raspberry Pi" that buries the actual tok/s figures. The honest 2026 picture: the Pi 4 8GB is a single-board ARM computer with no GPU acceleration for LLMs, four Cortex-A72 cores at 1.5 GHz, and a shared 8GB RAM pool. Per the official Raspberry Pi 4 product page, the memory bandwidth tops out around 4-5 GB/s — two orders of magnitude below a budget GPU. That is the fundamental ceiling.

So what does work? Small models at small quants. A 1.1B-class model like TinyLlama runs at 8-12 tok/s — usable for a chat-style assistant. A 3B model like Phi-3 Mini or Gemma 2B runs at 3-5 tok/s — usable but you feel each token. A 7B model at q4 technically fits in memory but drops to 0.5-1 tok/s, which is slow enough that you'd rather wait for the GPU.

The 8GB variant matters because the 4GB Pi 4 can only run the smallest models with no headroom for OS and apps. The 8GB Pi is the right pick for any LLM work; the 4GB Pi is the right pick for media servers and homelab tasks where LLMs aren't the use case.

Key takeaways

  • Raspberry Pi 4 8GB runs 1-3B quantized models at 2-12 tok/s — usable for tiny assistants
  • 7B models technically work but at 0.5-1 tok/s — too slow for interactive chat
  • SSD boot via a Crucial BX500 1TB or Samsung 870 EVO 250GB is the single biggest QoL upgrade
  • For real LLM work, the 12GB RTX 3060 at $269 is 10-20x faster and runs models the Pi can't touch
  • The Pi 4 LLM rig is the right answer for offline-first tiny assistants and learning, not for production chat

BOM and setup

PartPickApprox price
SBCRaspberry Pi 4 Model B 8GB$80
Boot SSDCrucial BX500 1TB (any 2.5" SATA SSD works)$60
Boot SSD (smaller)Samsung 870 EVO 250GB$35
SATA-to-USB 3.0 adapterFIDECO SATA/IDE to USB 3.0$25-30
Power supplyofficial Pi 4 USB-C 15W$10
Case + fanany aluminum case with active cooling$15
Total~$170-200

The Pi 4 8GB is the centerpiece. The SSD is on the official Pi 4 USB 3.0 boot path — drop a SATA SSD into the FIDECO adapter, plug it into one of the Pi's USB 3.0 ports, and configure the Pi to boot from USB. Total build is under $200 for a competent tiny inference box.

Why boot from an SSD instead of microSD

microSD cards have two problems for LLM work:

  1. Slow sequential read. A typical Class 10 / U3 card delivers 50-100 MB/s sequential read. A 4GB quantized model takes 40-80 seconds to load from cold. An SSD over USB 3.0 delivers 300-400 MB/s sustained, dropping model load to 10-15 seconds.
  2. Poor random-access throughput. OS operations — package installs, log writes, swap activity — hammer random-access I/O. microSD cards manage perhaps 1-2K random IOPS; SATA SSDs deliver 10-15K, NVMe via the Pi 5 jumps further. The difference shows in everyday responsiveness, not just LLM workloads.

For the Pi 4 specifically, USB 3.0 boot is the right SSD path. NVMe over the PCIe-via-HAT bridges is possible but adds cost and complexity for marginal gains over USB 3.0 SATA. The Crucial BX500 is the cheap big SSD; the Samsung 870 EVO is the smaller and slightly higher-IOPS pick.

Step-by-step: installing Ollama and pulling a small quantized model

On Raspberry Pi OS (64-bit, Bookworm or later), the install is straightforward:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull tinyllama
ollama run tinyllama "Explain how Hall-effect sensors work in one paragraph."

The first pull downloads the GGUF weights into /usr/share/ollama/.ollama/models/. The first run warms the model into memory; subsequent runs reuse the cached weights. Per the Ollama model library, the smaller models in the library — tinyllama, phi3:mini, gemma2:2b, qwen2.5:1.5b — are the right starting points for a Pi 4 build.

For more control, you can use llama.cpp directly. Per the llama.cpp GitHub project, the ARM NEON build path is well-optimized and what Ollama uses under the hood — there's no meaningful speed advantage to building llama.cpp by hand on a Pi unless you want non-default quantization formats.

Benchmark table: tok/s on the Pi 4 8GB

Community-reported numbers, quantized to q4_K_M unless noted, Raspberry Pi 4 8GB, 64-bit Bookworm, llama.cpp/Ollama, 512-token prompt:

ModelParamsQuantModel size on diskPrefill tok/sGen tok/sUsable?
TinyLlama 1.1B1.1Bq4_K_M~700 MB200-3008-12yes — snappy
Gemma 2 2B2Bq4_K_M~1.5 GB150-2205-8yes — comfortable
Phi-3 Mini 3.8B3.8Bq4_K_M~2.3 GB80-1203-5yes — usable but slow
Qwen 2.5 1.5B1.5Bq4_K_M~1.0 GB180-2606-9yes — good balance
Llama 3.2 3B3Bq4_K_M~1.9 GB100-1404-6yes — feels slow
Mistral 7B7Bq4_K_M~4.1 GB30-500.8-1.2borderline
Gemma 2 9B9Bq4_K_M~5.4 GB20-350.5-0.8no — too slow

Numbers vary with cooling, OS background load, and exact build of llama.cpp. The pattern is consistent: 1-3B models are the sweet spot; 7B is the practical ceiling; anything bigger is academic.

What's the largest model that fits in 8GB?

The 8GB shared between OS and apps caps the practical model size near 5-6GB on disk — leaving 2-3GB for the OS, the KV cache during inference, and any other apps you're running. That maps to:

  • 7B at q4_K_M (~4.1 GB on disk) — fits with a 2K context window, swap-pressured on longer prompts
  • 3-4B at q4_K_M or q5_K_M — fits comfortably with 8K+ context
  • 1-2B at q6 or even FP16 — fits with huge context and feels fast

The Pi has no GPU VRAM to separate from system RAM, so model + KV cache + OS + Ollama daemon all share the same pool. Adding swap on the SSD lets you push past the limit, but performance crashes the moment the model spills to swap.

Prefill vs generation on an ARM CPU

Prefill on a Pi 4 hits 60-300 tok/s depending on model size — not bad for digesting prompts. Generation is sequential: walk every weight in the model once per output token. With ~4-5 GB/s of memory bandwidth, the math caps a 4GB model at roughly 1 tok/s theoretical and the NEON-optimized llama.cpp gets surprisingly close to that — 0.8-1.2 tok/s in practice for a 7B q4 model.

The Pi's VideoCore VI GPU is not a usable LLM target — its compute kernels are not exposed for matrix math the way CUDA or Metal are. So the Pi is purely an ARM CPU device for inference, and the bandwidth ceiling sets the speed limit cleanly.

Where the Pi 4 8GB LLM rig makes sense — and where it doesn't

It makes sense for:

  • Tiny offline assistants. A Pi running Phi-3 Mini that answers "what's the syntax for a Python list comprehension?" on a local network with no data leaving the building.
  • Personal RAG over a small notes folder. A 1-3B model can do retrieval-augmented Q&A over a few thousand notes well enough to be useful.
  • Learning and experimentation. The Pi is cheap, the models are free, and the workflow teaches the moving parts of local inference.
  • Offline-first or air-gapped deployments. The Pi runs on 15W, fits anywhere, and doesn't phone home.

It doesn't make sense for:

  • Interactive chat with reasoning quality. 7B q4 at 1 tok/s is unpleasant. Get a real GPU.
  • Code completion. The Pi is too slow for "type, see suggestion appear" speeds.
  • Image generation. Stable Diffusion on a Pi is a science fair project, not a workflow.
  • Anything that needs 70B+ models. Not happening.

Cross-link: the GPU path

For real local LLM work in 2026, the 12GB RTX 3060 at $269 is the budget pick. It runs Gemma 9B at 35-40 tok/s — roughly 30x faster than the Pi on the same model. The Pi is the right answer when $80 SBC and offline-first matter more than throughput; the RTX 3060 is the right answer when throughput matters.

For a complete budget LLM PC build see Best GPU for Local LLMs Under $300.

Bottom line

The Pi 4 8GB runs local LLMs in 2026, with realistic 2-5 tok/s for 1-3B models and the option to push a 7B model at slow speeds. With an SSD boot disk it's a competent little inference appliance for under $200 — the right pick for offline-first tiny assistants and learning, not for serious chat workloads. For serious work, save up for the RTX 3060 12GB.

Common pitfalls

  • Skipping the SSD. microSD-based LLM rigs feel three times as slow as they should.
  • Picking the 4GB Pi 4 to save $20. The 8GB is the right minimum for LLM work.
  • Pulling a 7B model without expecting 1 tok/s. The Pi is honest about its limits; reset expectations to the 1-3B tier.
  • No active cooling. Sustained CPU loads thermal-throttle the Pi to ~60-70% performance without a fan. A $15 case with active cooling is mandatory.
  • Treating the Pi 4 as a substitute for a GPU. It isn't. It's a different tool.

Real-world numbers — what 2-5 tok/s actually feels like

Output speedWhat it feels likeUseful for
10+ tok/sfaster than fast readinginteractive chat, code completion
5-9 tok/scomfortable reading pacechat, batch summarization
3-4 tok/sslower than typingtolerable for short queries, painful for long
1-2 tok/seach word visibly streamsbatch jobs only, not chat
< 1 tok/swait-staringleave it running, come back later

The Pi 4's 1-3B model performance lands at 3-9 tok/s, which sits at the "tolerable for short, painful for long" boundary. For a personal assistant that answers concise queries — "what's the syntax for X in Python?" — it works. For a tutoring conversation with long explanations, it's frustrating.

The 7B Mistral case at ~1 tok/s is in batch-only territory. Useful for "summarize these 20 short notes overnight"; useless for chat.

Power, thermals, and 24/7 always-on use

The Pi 4 8GB draws ~3-5W idle and 6-8W under sustained CPU load. With LLM inference active most of the time, expect ~6W average. Over a year at $0.15/kWh that's roughly $8/year in electricity — essentially free.

Thermals are the real constraint. Without cooling, sustained inference loads thermal-throttle the Pi at 80°C after 2-3 minutes, dropping clocks to 1.0 GHz and inference speed by ~30-40%. Active cooling (a $15 case with a 30mm fan) keeps the SoC at 55-65°C indefinitely, holding full 1.5 GHz throughout.

For 24/7 always-on use as a personal LLM appliance, the right setup is:

  • Pi 4 8GB in an aluminum case with active cooling
  • SSD on USB 3.0 (no microSD)
  • Ethernet, not Wi-Fi (more stable for headless operation)
  • A small UPS (any cheap mini-UPS) for clean shutdowns on power blips

That stack runs forever. Replace the SSD in 5-7 years; everything else lasts a decade.

Building a tiny RAG over personal notes — a worked example

A practical use case where the Pi 4 LLM rig pays off: a personal RAG over your notes folder.

  1. Convert your notes (Markdown, Obsidian, Joplin export) to plain text
  2. Use a small embedding model (BAAI/bge-small-en at ~130MB) to build a FAISS index
  3. Wire a tiny query loop: query → embedding → top-K retrieval → context-stuffed prompt to a 3B model in Ollama
  4. Total throughput: ~5-10 sec for retrieval + 3-15 sec for a 30-100 token reply, depending on prompt size

That's a workable offline knowledge assistant for a few thousand notes on a $200 box. Per llama.cpp's repo, the ARM NEON build path used by Ollama is well-optimized for exactly this kind of workload. The Pi handles retrieval-bound queries far better than it handles open-ended generation, because the retrieval step trims the prompt and the model only generates short responses.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Can a Raspberry Pi 4 8GB run local LLMs?
Yes, for the small tier. With Ollama and a quantized 1-3B model (TinyLlama, Phi-3 Mini, Gemma 2B), the Pi 4 8GB runs CPU-only inference at 2-5 tokens per second — usable for a personal assistant or a tiny RAG backend. 7B models technically run at q4 but drop to 0.5-1 tok/s, which is too slow for interactive chat. For models above 7B, get a real GPU.
Why boot from an SSD instead of microSD?
microSD cards have ~50-100 MB/s read and limited random-access throughput, so loading a 2-4GB model takes 20-60 seconds and the OS feels sluggish even outside LLM use. A SATA SSD over USB 3.0 jumps to 300-400 MB/s sustained and ~10K IOPS random — model loads finish in 5-10 seconds, and the OS feels like a real computer. SSD boot is the single biggest QoL upgrade for any Pi 4 build.
What's the largest model that fits in 8GB?
The Pi 4 8GB shares its memory between OS and apps, so the practical model ceiling is roughly 5-6GB to leave 2-3GB for the OS, KV cache, and overhead. That's enough for a 7B model at q4_K_M with a small context window, or a 3B model at q6 with plenty of context, or a 1-2B model unquantized for maximum quality. Above 7B, you're paging to swap and the system grinds.
Prefill vs generation on an ARM CPU with no GPU offload — what should I expect?
On the Pi 4's quad-core Cortex-A72 at 1.5 GHz, prefill processes around 60-120 tokens per second for a 3B model — long prompts take seconds to digest. Generation runs sequentially through the model weights and drops to 2-5 tok/s for the same model. The Pi has no GPU offload path for LLMs (its VideoCore VI is not LLM-supported), so it's pure ARM SIMD inference. llama.cpp's NEON build is meaningfully faster than the generic ARM build — use Ollama, which bundles llama.cpp with NEON enabled.
Where does the Pi 4 8GB LLM rig actually make sense?
Three places: tiny offline assistants (a local Phi-3 that answers questions without sending data anywhere), personal RAG over a small notes folder, and learning/experimentation projects where the Pi's $80 price tag is the right entry. For anything that needs reasoning quality or sub-second response time, the [12GB RTX 3060](/product/B08WRVQ4KR) at $269 is a better answer — 10-20x faster and runs models the Pi can't touch.

Sources

— SpecPicks Editorial · Last verified 2026-05-31

NVIDIA GeForce RTX 3060
NVIDIA GeForce RTX 3060
$389.22
View on Amazon →