Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Realistic tok/s

Name: Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Realistic tok/s
Item: Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for Building Mini PC/Smart Robot/Game Console/Workstation/Media Center/Etc.
Author: Mike Perry

Yes, at 1B-3B parameters at q4 — realistic tok/s, honest limits, and where a $85 SBC beats a $2000 GPU (spoiler: perf-per-watt).

By Mike Perry · Published 2026-05-31 · Last verified 2026-07-20 · 8 min read

A Pi 4 8GB runs 1B-3B parameter LLMs at q4 in a few tok/s — great for offline command parsing and always-on inference at 5W, not for chat.

Can a Raspberry Pi 4 8GB run a useful local LLM in 2026? Yes, if "useful" means a 1B-3B parameter model at q4 quantization producing 2-6 tokens per second — enough for offline command parsing, intent classification, and background tasks, not enough for interactive chat. The 8GB RAM ceiling is your hard limit; the Cortex-A72 CPU is your soft one. Under those constraints, the Pi is a genuinely interesting inference node.

Who this is for

You already have a Pi 4 8GB, or you're deciding whether to buy one for an "on-device AI" project. You've read that llama.cpp runs on ARM and want to know what that actually delivers on your board before you commit. Maybe you're building a voice assistant that shouldn't send commands to the cloud, or a homelab-style content router that classifies text before deciding where to send it. Those workloads fit the Pi. Anything that expects a 7B-class model at snappy chat speed does not.

The Pi 4's compute story, per the official Raspberry Pi product page: 1.8 GHz quad-core ARM Cortex-A72, 8GB LPDDR4-3200 (this variant), no NEON-boosted matrix extensions in the manner of newer ARMv9 chips. Memory bandwidth is roughly 12 GB/s. For context: an RTX 3060 12GB does 360 GB/s. That 30x bandwidth gap explains most of what happens next.

The good news is that quantized inference isn't only about raw FLOPs. Small models running in the Pi's RAM sidestep the storage bottleneck entirely, and llama.cpp's ARM-optimized kernels do a genuinely respectable job of extracting throughput from the A72. The result is a computer that costs $85 and runs a real language model — that's a wonderful thing, even if the tokens come slowly.

Key takeaways

Target 1B-3B models at q4_K_M: TinyLlama 1.1B, Phi-3 Mini 3.8B, Gemma 2B, Llama 3.2 1B/3B.
Realistic tok/s: 2-8 for 1B-3B models at q4; ~1 for 7B models at q4 (once they fit).
The Pi 4's 8GB RAM is the ceiling — a q4 7B model sits at ~4.5 GB and leaves little headroom for KV cache or the OS.
Cool the SoC actively; sustained inference throttles a passive-cooled Pi within minutes.
Boot from an SSD over USB 3.0 — not for inference speed but for model load times and OS reliability.
The right workloads are asynchronous: home-automation intent, log summarization, offline command parsing. Interactive chat belongs on a discrete GPU.

Which models fit in 8GB of Pi RAM?

Every model at every quantization has a rough RAM footprint. The rule of thumb: quantized weights are approximately param_count x bytes_per_param, plus a couple hundred MB of KV cache for a short context, plus ~500 MB for the OS. On a Pi 4 8GB you have effectively ~7 GB usable.

Model	Params	Quant	Weight RAM	Total RAM (est)	Fits Pi 4 8GB?
TinyLlama	1.1B	q4_K_M	~0.7 GB	~1.2 GB	Yes, easily
Gemma 2B	2.0B	q4_K_M	~1.4 GB	~2.0 GB	Yes
Llama 3.2 1B	1.2B	q4_K_M	~0.8 GB	~1.3 GB	Yes
Llama 3.2 3B	3.2B	q4_K_M	~2.0 GB	~2.8 GB	Yes
Phi-3 Mini	3.8B	q4_K_M	~2.4 GB	~3.2 GB	Yes
Qwen 2.5 3B	3.1B	q4_K_M	~2.0 GB	~2.8 GB	Yes
Llama 3 8B	8.0B	q4_K_M	~4.5 GB	~5.5 GB	Tight, swappy
Mistral 7B	7.2B	q4_K_M	~4.1 GB	~5.0 GB	Tight
Llama 3 8B	8.0B	q3_K_M	~3.8 GB	~4.8 GB	Yes but rough quality

The comfortable envelope is 1B-3B. The 7B/8B tier technically fits at q3-q4 but leaves so little headroom that anything else you do on the Pi (network, storage I/O, the OS scheduler) starts pushing weights out of cache and the throughput cratered.

How fast do they generate on a Pi 4?

Community benchmarks published on llama.cpp's GitHub and Pi forums cluster around a consistent pattern. These are measured with the Pi 4 8GB running Raspberry Pi OS 64-bit, actively cooled, kernel 6.6+, llama.cpp compiled with -mcpu=cortex-a72 -mtune=cortex-a72.

Model	Quant	Prefill tok/s	Generation tok/s
TinyLlama 1.1B	q4_K_M	~28	~8
Llama 3.2 1B	q4_K_M	~24	~6
Gemma 2B	q4_K_M	~14	~5
Llama 3.2 3B	q4_K_M	~8	~3
Phi-3 Mini 3.8B	q4_K_M	~7	~2.5
Mistral 7B	q4_K_M	~3	~1
Llama 3 8B	q3_K_M	~2.5	~0.9

Prefill (the pass that reads your prompt before generation starts) is faster than generation because it's compute-bound and the Pi's four cores can share the work. Generation is memory-bandwidth-bound: each token has to read the full weight matrix from RAM. That's where the 12 GB/s bandwidth ceiling hits hardest.

Quantization matrix (q2-q8): RAM required + tok/s + quality loss on ARM

The llama.cpp codebase supports a wide range of quantizations. On ARM, the sweet spot for the Pi's cache hierarchy is q4_K_M. Below that, quality degrades noticeably on small models; above that, RAM and bandwidth kill throughput.

Quant	Bits/param	RAM for 3B model	Relative tok/s	Quality vs fp16
q2_K	~2.5	~1.2 GB	~1.4x	Major degradation on tiny models
q3_K_M	~3.3	~1.5 GB	~1.2x	Moderate
q4_K_M	~4.5	~2.0 GB	1.0x (baseline)	Small
q5_K_M	~5.5	~2.5 GB	~0.9x	Very small
q6_K	~6.4	~2.9 GB	~0.75x	Near-lossless
q8_0	8	~3.5 GB	~0.55x	Effectively fp16

If you're building anything user-facing, q4_K_M is where you land. Drop to q3 only when you must fit a larger model and can tolerate the coherence loss.

Prefill vs generation: why the first token is slow on a Pi

Prefill processes the entire prompt in one compute pass — the whole context runs through every transformer layer. On the Pi 4, four A72 cores share this pass and it's mostly compute-bound. Long prompts amplify prefill time linearly, and a 1000-token system prompt on a 3B model at q4 can take 3-5 seconds before the first output token appears.

Generation, once started, is a per-token pass over the same weight matrix (plus growing KV cache). It's bandwidth-bound: at 12 GB/s and a 2 GB model, the theoretical ceiling is ~6 tokens/second. Measured throughput lands slightly under theoretical because of KV cache growth and cache misses.

Practical guidance: keep prompts short. A well-designed prompt for a Pi should be under 200 tokens if you want sub-second time-to-first-token. Reserve long-context work for machines with the bandwidth to sustain it.

Does an SSD swap file help, and how to set it up?

Short answer: not really, for inference. But it's still worth having.

An SSD over USB 3.0 delivers ~350 MB/s read — an order of magnitude faster than a microSD but two orders slower than RAM. If a model spills into swap, generation slows from tok/s to sec/tok territory. The SSD lets a slightly-too-big model load without OOM, but running from swap is functionally unusable.

Where the SSD earns its keep:

Model load times. A 3.8 GB Phi-3 Mini loads in ~11 seconds from SSD vs. ~45 seconds from microSD.
OS reliability. SD cards wear from constant metadata writes; SSDs don't care.
Filesystem for logs, embeddings, and prompt history. Any workflow that stores state benefits.

Configuration is a one-liner:

sudo fallocate -l 4G /var/swapfile
sudo chmod 600 /var/swapfile
sudo mkswap /var/swapfile
sudo swapon /var/swapfile

Add to /etc/fstab for persistence. Point it at a file on the mounted SSD, not at the SD card.

When a Pi Zero W or a discrete-GPU box makes more sense

A Pi Zero W has 512 MB of RAM — enough for a distilled 500M-class model at q4 for extremely narrow tasks (single-intent classification, wake-word triggering) but not for anything you'd call an LLM. Use it when the target is "microcontroller-plus-a-tiny-transformer," not "compact chat assistant."

A discrete-GPU box — a rig with an RTX 3060 12GB — jumps three orders of magnitude in throughput. That card runs Llama 3 8B at q4 at ~55 tok/s vs. the Pi's ~1 tok/s. If you need interactive chat, coding assistance, or genuine reasoning at speed, the desktop is the correct answer.

The Pi's niche is between those extremes: too capable for a microcontroller task, too slow for a chat product, exactly right for an always-on background inference node in a homelab.

Perf-per-watt: the appeal of a 5-watt inference node

The Pi 4 8GB running a 1B model at q4 draws about 5.5 W at the wall. Compare:

System	Wattage under LLM load	tok/s (1B-8B)
Raspberry Pi 4 8GB	~5.5 W	1-8
Intel N100 mini-PC	~15 W	10-35
Ryzen 5 5600G desktop	~65 W	20-60
RTX 3060 12GB desktop	~250 W (system)	55-120
RTX 4090 desktop	~450 W (system)	200+

Watts per token favor bigger hardware for chat, and favor the Pi for always-on background work. A Pi 4 running an intent classifier 24/7 for a home-automation router costs $7/year in electricity. A 65W desktop doing the same costs $85. That gap is the reason the Pi still belongs in the conversation.

Bottom line

The Pi 4 8GB is a real local-LLM host, provided you match the workload to the constraint. Small models (1B-3B) at q4 quantization run at usable speeds for asynchronous or background tasks: intent classification, log summarization, offline command parsing, home-automation glue, prompt routing before a bigger backend takes over. Storage should be an SSD over USB 3.0 for load times and OS durability. Cool actively. Keep prompts short.

Anything that requires interactive chat, code assistance, or reasoning beyond simple pattern matching wants a discrete GPU. That's where an RTX 3060 12GB rig becomes the correct answer — see our llama.cpp vs Ollama on the 12GB RTX 3060 coverage and the Qwen3-27B local LLM piece for what happens when you have real VRAM to work with.

The Pi is the right first step: it teaches you quantization, ARM inference tuning, and workload shaping without the risk of a $2000 upgrade. If you outgrow it, you'll know exactly what you're paying for.

Common pitfalls on the Pi

Passive cooling under sustained load. A bare Pi 4 hits ~80C on a 5-minute sustained inference run and thermal-throttles into slower generation. Use an active cooler (Argon ONE V3, Argon ONE M.2, or a case with a small PWM fan).
Under-powered PSU. A 15W wall wart brownout collapses generation to zero. Use the official 27W USB-C supply.
Booting an inference job from cold SD. First-token latency includes model load; on SD that's 45+ seconds. Boot from SSD via USB 3.0 so model loads land in 10-15 seconds.
Wrong context length. Doubling context on a memory-bandwidth-bound Pi runs prefill 2x longer; keep prompts tight.
Assuming NEON alone is enough. Compile llama.cpp with the correct -mcpu=cortex-a72 flag; the default build misses the tuning.

Two example builds worth stealing

Build A — Always-on intent classifier for Home Assistant. Pi 4 8GB running Llama 3.2 1B at q4 behind a tiny Flask endpoint. Home Assistant sends free-text commands; the model classifies them into automation intents ("turn on kitchen light", "set thermostat 68F"). Latency: ~500 ms. Power draw: 5W. Cost: $100 all-in including case and SSD. Beats a cloud round-trip for anything privacy-sensitive.

Build B — Log-summarizing homelab helper. Pi 4 8GB running Phi-3 Mini 3.8B at q4, cron-scheduled to summarize each service's last 24h of logs at 3am. Output goes to a static HTML dashboard. Slow generation is fine because it runs unattended overnight. Power: still 5W average. Value: a nightly digest of what changed across your homelab without paying for a cloud LLM.

Both projects are wonderful weekend builds and demonstrate the pattern: async task, tiny model, high tolerance for latency.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Which models actually run on a Pi 4 8GB?

Small models in the 1B-3B range at q4 quantization fit comfortably in 8GB and run without swapping, which is the sweet spot. A 7B model technically loads at aggressive quantization but leaves little headroom and slows down. For a Pi, the practical target is tiny, quantized models used for simple tasks rather than large general-purpose assistants that expect a GPU.

How many tokens per second should I expect?

Generation on a CPU-only ARM board is slow compared to a GPU, so expect low single-digit to modest tok/s on small quantized models, dropping further as model size grows. It is usable for short prompts and background tasks, not for snappy interactive chat. Community benchmarks vary with model, quant, and cooling, so treat any figure as a rough guide rather than a fixed spec.

Does adding an SSD swap file help?

An SSD swap over USB 3.0 lets a slightly-too-big model load without crashing, but swapping to storage is far slower than RAM, so throughput suffers badly once the model spills to disk. It is a workaround, not a fix. The better approach is to choose a model that fits fully in the Pi's RAM; the SSD's real value is fast OS and file access.

Is the Pi 4 the best board for on-device AI?

For ultra-low-power experiments and learning, yes — its 5-watt-class draw makes it a fine always-on inference node for tiny models. For anything demanding, boards with NPUs or a discrete GPU vastly outperform it. Think of the Pi 4 as an excellent place to understand quantization and edge inference tradeoffs before investing in faster hardware like a 12GB desktop GPU.

What can I realistically build with Pi-based LLM inference?

Good fits include offline command parsing, simple text classification, home-automation intent detection, and small chat helpers where latency tolerance is high. These leverage the Pi's low power and always-on nature without needing frontier-model quality. Anything requiring fast, high-quality generation belongs on a GPU. Matching the task to the tiny-model constraint is the key to a satisfying Pi AI project.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Realistic tok/s

Who this is for

Key takeaways

Which models fit in 8GB of Pi RAM?

How fast do they generate on a Pi 4?

Quantization matrix (q2-q8): RAM required + tok/s + quality loss on ARM

Prefill vs generation: why the first token is slow on a Pi

Does an SSD swap file help, and how to set it up?

When a Pi Zero W or a discrete-GPU box makes more sense

Perf-per-watt: the appeal of a 5-watt inference node

Bottom line

Common pitfalls on the Pi

Two example builds worth stealing

Related guides

Sources

Products mentioned in this article

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

MSI GeForce RTX 3060 Ventus 3X 12G OC, Gaming Graphics Card - RTX 3060

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Realistic tok/s

Who this is for

Key takeaways

Which models fit in 8GB of Pi RAM?

How fast do they generate on a Pi 4?

Quantization matrix (q2-q8): RAM required + tok/s + quality loss on ARM

Prefill vs generation: why the first token is slow on a Pi

Does an SSD swap file help, and how to set it up?

When a Pi Zero W or a discrete-GPU box makes more sense

Perf-per-watt: the appeal of a 5-watt inference node

Bottom line

Common pitfalls on the Pi

Two example builds worth stealing

Related guides

Sources

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Crucial BX500 1TB 3D NAND SATA 2.5-Inch Internal SSD, up to 540MB/s…

MSI GeForce RTX 3060 Ventus 3X 12G OC, Gaming Graphics Card - RTX 3060

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks