Run a Local LLM on a Raspberry Pi 4 (8GB) in 2026: What Actually Works and How Slow It Is

Name: Run a Local LLM on a Raspberry Pi 4 (8GB) in 2026: What Actually Works and How Slow It Is
Item: Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for Building Mini PC/Smart Robot/Game Console/Workstation/Media Center/Etc.
Author: Mike Perry

can a Raspberry Pi 4 8GB run a local LLM

By Mike Perry · Published 2026-06-10 · Last verified 2026-07-24 · 8 min read

Yes, a Raspberry Pi 4 8GB can run a local LLM — but slowly, on small models, and only at…

Yes, a Raspberry Pi 4 8GB can run a local LLM — but slowly, on small models, and only at the speeds you'd accept for background tasks rather than interactive chat. With a 3B-class quantized model and a fast USB-attached SSD, expect 2-6 tokens per second of generated output. That's enough for home-automation intents, classification pipelines, and short summaries; it's not enough for a responsive coding assistant or a chat partner.

The Pi-as-LLM-host idea has been circulating since the first LLaMA leaked, but the gap between "loads" and "runs usefully" is wide on this hardware. CPU-only inference at 8GB has fundamental ceilings — memory bandwidth caps the token rate, single-thread performance caps the prompt-processing rate, and the absence of dedicated tensor hardware means there's no way to close the gap with software alone. Knowing those ceilings helps you pick the right tasks for the platform and avoid the disappointment that comes from expecting desktop-class speed.

Key takeaways

A Pi 4 8GB runs 1B-3B models at q4 quantization at 2-6 tokens per second — slow but usable for background tasks.
7B-class models technically load in 8GB at low quant but generate at 1-2 tok/s, which is too slow for chat.
A USB-attached SSD (WD Blue SN550 or Crucial BX500 in an enclosure) is required for usable load times and long-term reliability.
Best use cases: home-automation intent classification, offline assistants for narrow domains, prototype edge-AI projects, learning the local-LLM tooling.
For interactive chat or larger models, step up to a Pi 5, a mini-PC with iGPU, or a 12GB GPU desktop rig.

What you'll need checklist

A Raspberry Pi 4 Model B 8GB — the 8GB variant is non-negotiable; 4GB and below run out of headroom with any model larger than 1B.
USB-attached storage: a WD Blue SN550 1TB NVMe in a USB 3.0 enclosure, or a Crucial BX500 1TB SATA SSD with a USB bridge. Either dramatically outperforms a microSD card for model loads.
Active cooling: a fan-equipped case or the Argon ONE V2 chassis. The Pi 4 throttles at 80°C and you'll hit that within minutes of any sustained inference workload on a passive heatsink.
A 3 A USB-C power supply (the official Raspberry Pi PSU is the safest pick — undervoltage on inference workloads causes silent throttling and crashes).
Raspberry Pi OS 64-bit (32-bit can't address the full 8GB for a single process).

Which models fit in 8GB?

The Pi 4's 8GB is shared between the OS, network stack, swap, and the model. Realistic footprints:

Model	Quant	VRAM/RAM	Notes
TinyLlama 1.1B	q4_K_M	700 MB	Fast, weak — good for classification
Phi-3 Mini 3.8B	q4_K_M	2.4 GB	Best 8GB sweet spot
Gemma 2 2B	q4_K_M	1.5 GB	Lightweight chat
Qwen2.5 3B	q4_K_M	2.0 GB	Strong for size
Llama 3.1 8B	q4_K_M	4.9 GB	Loads but slow generation
Llama 3.1 8B	q2_K	3.1 GB	Loads with headroom, quality degraded

The sweet spot for a Pi 4 8GB is the 2-3B class at q4. You leave 4-5 GB free for the OS and any other services running on the Pi, and the model generates at a usable speed for its size class.

Benchmark table: tok/s on the Pi 4 8GB

Community measurements from Phoronix's Raspberry Pi benchmark coverage and r/LocalLLaMA's Pi-specific threads land roughly here:

Model	Quant	Prompt tok/s	Generation tok/s
TinyLlama 1.1B	q4_K_M	35-50	12-18
Gemma 2 2B	q4_K_M	20-30	6-9
Qwen2.5 3B	q4_K_M	14-20	4-7
Phi-3 Mini 3.8B	q4_K_M	12-16	3-5
Llama 3.1 8B	q4_K_M	6-9	1-2

These assume a stable CPU temperature (no throttling), the model loaded once and reused, and llama.cpp built with NEON optimizations. Cold-load times for the 3-4B models run 8-15 seconds on USB SSD vs 30-60 seconds on microSD.

Why CPU-only inference is slow

Two architectural facts bound throughput:

Memory bandwidth: The Pi 4's LPDDR4 maxes out around 4 GB/s of usable bandwidth for matrix-multiply workloads. Each generation step reads the full model from memory; a 3B model at q4 is 2 GB, so the upper limit is ~2 tokens per second per memory pass before any compute. Smaller models hit higher rates because they read less data per step.
No dedicated tensor hardware: The Pi 4's Cortex-A72 cores have NEON SIMD but no equivalent of NVIDIA's tensor cores or Apple's Neural Engine. Every matrix multiply runs on general-purpose vector units, which is dramatically less power-efficient than dedicated hardware.

Software optimizations (NEON kernels, quantization-aware code paths) get you some of the way; the fundamental ceilings remain. There's no software trick that turns a 4 GB/s memory bus into a 360 GB/s one.

SSD vs microSD

microSD cards are slow (typically 50-100 MB/s sequential read on the Pi's interface) and have limited write endurance. For LLM workloads where the model file is large (multi-GB) and read frequently, the microSD becomes the load-time bottleneck and a long-term reliability risk.

A USB 3.0-attached SSD (the Pi 4 has USB 3.0; earlier Pis didn't) achieves 200-400 MB/s reads through the bridge chip. Model load times drop by 3-5×, and write endurance becomes a non-issue. The WD Blue SN550 NVMe in a USB enclosure or a Crucial BX500 SATA SSD with a USB bridge are both well-tested in this role.

A common upgrade pattern: boot the OS from microSD for the first install, then migrate the root filesystem to the SSD using rpi-clone or the official SD Card Copier. Subsequent boots happen from SSD at much higher speed and reliability.

Realistic use cases

Use cases where a Pi 4 LLM works well:

Home automation intent classification: a small model translates spoken or typed commands into structured intents for Home Assistant or similar. Latency is acceptable (1-3 seconds) and the model is small.
Offline narrow-domain assistants: a 3B model fine-tuned (or just system-prompted) for cooking recipes, gardening advice, or a specific game's lore. Limited scope keeps quality high despite small model size.
Prototype edge-AI projects: validating that a workflow makes sense before deploying it to faster hardware. The Pi runs the same llama.cpp / Ollama stack as a desktop, just slower.
Learning the local-LLM tooling: getting comfortable with model loading, prompt design, and inference-engine configuration on cheap hardware.

Use cases where a Pi 4 LLM doesn't work:

Interactive chat: 4-7 tok/s is below the threshold where chat feels responsive.
Coding assistance: too slow per response, too small for code reasoning quality.
Image generation: not realistically possible — diffusion models need GPU acceleration.
Long context: KV cache grows quickly and the 8GB ceiling caps useful context length.

When to step up

If your use case outgrows the Pi 4, the natural upgrade paths:

Raspberry Pi 5 8GB: roughly 2-3× the LLM throughput at similar power, similar form factor. The right upgrade if you love the Pi platform.
Intel N100 Mini PC: 4-6× the throughput, runs Ollama and Open WebUI comfortably with iGPU acceleration, lands around $200.
12GB GPU desktop rig: the cheapest serious local-LLM hardware (see our budget LLM build coverage). 50-100× the Pi 4's throughput, runs models 5-10× larger.

The Pi 4 is the right entry point for learning and tinkering. It's not the right platform for production local-LLM work; step up when your workload demands it.

Bottom line

A Pi 4 8GB runs local LLMs slowly, on small models, with constraints that come from architecture rather than software. It's a great learning platform, a credible edge-AI host for narrow tasks, and a fun tinkering target. For chat-class interactive use or any 7B+ model, you need different hardware. Pair the Pi with a USB SSD, install Ollama, grab a 3B-class q4 model, and treat the platform for what it is: a low-power, always-on, small-model edge node.

Frequently asked questions

How fast can a Raspberry Pi 4 8GB actually run an LLM?

Slowly by GPU standards. With small quantized models in the 1B-3B range, a Pi 4 produces a few tokens per second — usable for short prompts and background tasks but tedious for interactive chat. The 7B class technically loads in 8GB at low quantization but crawls at 1-2 tokens per second. Treat the Pi as an edge-AI and learning platform, not a responsive chat box. The hardware ceiling is memory bandwidth, not software optimization; no amount of tuning will turn the Pi 4 into a fast LLM host.

Do I need an SSD, or will a microSD card work?

An SSD over USB is strongly recommended. Model weights are large and read frequently at load, and microSD cards are slow and prone to wear-out under heavy use, causing corruption on long-running projects. A USB-attached SSD like the WD SN550 or Crucial BX500 dramatically improves load times and reliability, which matters when a model fills most of the Pi's resources. The cost delta is small and the reliability improvement is real.

Which models should I run on a Pi 4?

Stick to small, quantized models — 1B to 3B parameter classes at q4 — for acceptable responsiveness. These handle classification, simple Q&A, summarization of short text, and home-automation intents well. Larger 7B models fit but respond too slowly for interactive use. Match the model to lightweight, latency-tolerant tasks rather than expecting desktop-class conversation. Phi-3 Mini 3.8B at q4_K_M is the current sweet spot for capability-per-GB on this hardware.

Can I use Ollama on a Raspberry Pi?

Yes. Ollama runs on ARM Linux and is one of the easiest ways to pull and serve small models on a Pi 4. It handles model management and exposes an API your home-automation or scripts can call. Performance is bound by the Pi's CPU and memory bandwidth, so pick small models, but the software experience itself is straightforward. Installation is a single shell command and the same Ollama API your desktop uses works identically on the Pi.

When should I move off the Pi to real hardware?

If you want responsive chat, larger models, or image generation, the Pi's CPU-only inference becomes the bottleneck and a 12GB GPU rig or a capable mini-PC is the next step. Use the Pi to learn the tooling and prototype edge tasks, then graduate to a GPU when speed and model size start limiting what you can build. The threshold most builders cross is when prompt processing latency exceeds the 2-3 second mark and you start avoiding the model out of impatience.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How fast can a Raspberry Pi 4 8GB actually run an LLM?

Do I need an SSD, or will a microSD card work?

Which models should I run on a Pi 4?

Can I use Ollama on a Raspberry Pi?

When should I move off the Pi to real hardware?

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Run a Local LLM on a Raspberry Pi 4 (8GB) in 2026: What Actually Works and How Slow It Is

Key takeaways

What you'll need checklist

Which models fit in 8GB?

Benchmark table: tok/s on the Pi 4 8GB

Why CPU-only inference is slow

SSD vs microSD

Realistic use cases

When to step up

Bottom line

Frequently asked questions

How fast can a Raspberry Pi 4 8GB actually run an LLM?

Do I need an SSD, or will a microSD card work?

Which models should I run on a Pi 4?

Can I use Ollama on a Raspberry Pi?

When should I move off the Pi to real hardware?

Citations and sources

Products mentioned in this article

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Run a Local LLM on a Raspberry Pi 4 (8GB) in 2026: What Actually Works and How Slow It Is

Key takeaways

What you'll need checklist

Which models fit in 8GB?

Benchmark table: tok/s on the Pi 4 8GB

Why CPU-only inference is slow

SSD vs microSD

Realistic use cases

When to step up

Bottom line

Frequently asked questions

How fast can a Raspberry Pi 4 8GB actually run an LLM?

Do I need an SSD, or will a microSD card work?

Which models should I run on a Pi 4?

Can I use Ollama on a Raspberry Pi?

When should I move off the Pi to real hardware?

Citations and sources

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks