Running Local LLMs on the Raspberry Pi 4 8GB — Realistic Token-Throughput in 2026

Realistic tok/s benchmarks for TinyLlama, Phi-3-mini, and Llama-3.2 on the Raspberry Pi 4 8GB.

By Mike Perry · Published 2026-05-08 · Last verified 2026-05-08

TinyLlama-1.1B at q4_K_M hits 8-12 tok/s on a Pi 4 8GB; Phi-3-mini-3.8B lands at 2.5-4 tok/s; anything 7B+ falls under 1 tok/s. Memory bandwidth, not compute, is the binding constraint.

Running Local LLMs on the Raspberry Pi 4 8GB — Realistic Token-Throughput in 2026

For the raspberry pi 4 8gb llm tok/s 2026 question the realistic answer is: TinyLlama-1.1B at q4_K_M generates 8-12 tok/s, Phi-3-mini-3.8B at q4 lands at 2.5-4 tok/s, and Llama-3.2-3B at q4 sits at 2-3 tok/s. Anything 7B and up falls under 1 tok/s and is not usable.

Editorial intro

Running an LLM locally on a Raspberry Pi 4 8GB sounds absurd until you consider the use cases. A home-automation voice assistant that does not phone Amazon. A summarizer for an offline notes vault. A chatbot for an air-gapped lab. A teaching tool that fits in a backpack and runs on USB-C power. These are real workloads, and 2026 is the first year you can credibly serve them on a $90 single-board computer.

The shift came from two directions. On the model side, Llama 3.2 1B and 3B, Phi-3-mini, Qwen 2.5, and TinyLlama have collapsed useful inference into the 1-4B parameter range. On the runtime side, llama.cpp's ARM NEON and quantization improvements have squeezed two to three times more tokens per second out of identical hardware versus 2023 builds. The raspberry pi llama.cpp workflow is now the reference path; ollama wraps it with a friendlier server interface but the underlying numbers are similar.

The reason the Pi 4 8GB hits a hard ceiling at the 3B model class is not compute, it is memory bandwidth. The Cortex-A72 cluster runs at 1.5 GHz with 4 MB shared L2; the LPDDR4 memory bus delivers about 8.5 GB/s peak. Quantized weights at q4 still need to stream through that bus once per token, which means the theoretical generation ceiling for a 4 GB weight file is roughly 2 tok/s before any compute or kvcache overhead. TinyLlama at q4 is 0.6 GB and runs proportionally faster. The math sets the buying expectation cleanly.

Key Takeaways

TinyLlama-1.1B q4_K_M: 8-12 tok/s generation, 60-80 tok/s prefill
Phi-3-mini-3.8B q4_K_M: 2.5-4 tok/s generation, 18-25 tok/s prefill
Llama-3.2-3B q4_K_M: 2-3 tok/s generation, 14-20 tok/s prefill
Llama-3.1-8B q4_K_M: 0.5-0.9 tok/s — not usable
Memory bandwidth, not compute, is the binding constraint
Active cooling is mandatory above 5 minutes of continuous inference

What models fit in 8GB RAM?

Hard ceiling: roughly 5.5 GB for model weights + kvcache + workspace, after subtracting OS, swap, and llama.cpp working memory. That maps to:

TinyLlama-1.1B at any quant (0.4-2.2 GB)
Phi-3-mini-3.8B at q4 or q5 (2.3-2.8 GB)
Llama-3.2-1B at q4 to q8 (0.7-1.3 GB)
Llama-3.2-3B at q4_K_M (~2.0 GB)
Qwen 2.5 1.5B / 3B at q4 (~1.0-2.0 GB)
Llama-3.1-8B at q4_K_M technically loads (~4.7 GB) but swaps under any real context

Quantization matrix

Model	q2_K	q3_K_M	q4_K_M	q5_K_M	q6_K	q8_0
TinyLlama-1.1B	0.45GB	0.55GB	0.66GB	0.78GB	0.91GB	1.17GB
Phi-3-mini-3.8B	1.4GB	1.95GB	2.39GB	2.82GB	3.14GB	4.06GB
Llama-3.2-1B	0.42GB	0.57GB	0.77GB	0.91GB	1.05GB	1.32GB
Llama-3.2-3B	1.2GB	1.65GB	2.02GB	2.32GB	2.64GB	3.42GB

Quality cliff lands at q3 and below for all models tested; q4_K_M is the universal sweet spot for the edge llm inference workflow on the Pi 4. q8 is rarely worth the bandwidth penalty given the marginal quality gain over q5/q6.

Benchmark table: tok/s prefill + generation

Model + Quant	Prefill (tok/s)	Generation (tok/s)	Memory used
TinyLlama-1.1B q4_K_M	60-80	8-12	1.1 GB
Llama-3.2-1B q4_K_M	50-70	7-10	1.3 GB
Phi-3-mini-3.8B q4_K_M	18-25	2.5-4.0	3.0 GB
Llama-3.2-3B q4_K_M	14-20	2.0-3.0	2.6 GB
Qwen 2.5 3B q4_K_M	15-22	2.2-3.2	2.7 GB
Llama-3.1-8B q4_K_M	3-5	0.5-0.9	5.4 GB (swaps)

Numbers from the pi 4 ollama benchmark community thread on r/LocalLLaMA, cross-referenced against llama.cpp's own performance issues and reproduced on a Pi 4 8GB with active cooling. CPU at 1.5 GHz stock, no overclock. Running 4 threads.

Context-length impact analysis

Larger context = bigger kvcache = slower per-token generation, regardless of model. On Phi-3-mini at q4_K_M, going from a 512-token context to 4096 tokens drops generation from ~3.8 tok/s to ~2.5 tok/s. The kvcache also competes with model weights for RAM; pushing Phi-3 to a 4K context with a 3.0 GB model leaves only ~3.5 GB for OS plus everything else. For a chat assistant use case, 1024-2048 token windows are the practical sweet spot.

Cooling + thermal throttling on Pi 4

The Pi 4 SoC throttles aggressively above 80 C. Sustained inference workloads will hit that ceiling within 5-10 minutes on a stock Pi without active cooling. The official Raspberry Pi 4 case fan or any 30 mm fan strapped over a heatsink keeps junction temperature in the 60-70 C range and prevents throttling entirely. Without it, expect a 25-40% drop in sustained tok/s as the SoC scales clocks down.

The Freenove starter kit ships with thermal pads and a basic heatsink that handles bursty workloads (single conversational turn) but is insufficient for sustained streaming. Plan for active cooling.

Perf-per-dollar vs Jetson Orin Nano vs cheap mini-PC

Platform	Cost	TinyLlama tok/s	Phi-3 tok/s	Power
Pi 4 8GB	$90	10	3.5	6W
Jetson Orin Nano 8GB	$499	60-90	18-25	15W
Mini-PC N100 16GB	$179	22-28	7-9	10W
Mini-PC Ryzen 5560U	$349	45-60	14-18	25W

The Pi wins on absolute dollars and on power. The Jetson Orin Nano wins on absolute throughput per dollar specifically for LLM workloads because of its CUDA-accelerated GPU. A modern mini-PC (N100, Ryzen-based) outperforms the Pi by 2-3x on equivalent quants while costing 2-4x more. If your use case is "I need a low-power always-on inference box and can tolerate 3 tok/s," the Pi 4 is the answer. If you need 10+ tok/s on Phi-3 class models, you need to step up.

Verdict matrix

Get a Raspberry Pi 4 8GB if:

Your use case tolerates 3-10 tok/s on small models
You want sub-$100 always-on inference
Power consumption matters (6W idle, 8W under load)
You are using TinyLlama or Llama-3.2-1B class workloads

Get something else if:

You want above 10 tok/s on Phi-3 class models
You want to run Llama-3.1-8B at usable speed
Your workload is bursty rather than always-on (a mini-PC sleeps better)

Bottom line

The Raspberry Pi 4 8GB in 2026 is a real LLM platform if you size your model to its memory bandwidth ceiling. TinyLlama-1.1B at q4_K_M is the sweet spot for chat-style use; Phi-3-mini-3.8B at q4 is the upper edge of usable. Add active cooling, set a 1024-2048 token context, and the Pi delivers a credible always-on local inference node for $90.

Citations and sources

llama.cpp performance issue tracker, ARM NEON optimization PRs (2024-2025)
r/LocalLLaMA Raspberry Pi 4 benchmark thread index
ollama documentation, llama.cpp backend performance notes
Raspberry Pi 4 official thermal management whitepaper
NVIDIA Jetson Orin Nano benchmark data (NVIDIA developer blog)
Phi-3-mini and Llama-3.2 model cards (Hugging Face)

Software stack: llama.cpp vs ollama vs vLLM

Three runtimes are worth considering on the Pi 4. llama.cpp is the bare-metal reference: lowest overhead, full control over thread count and quantization, but you bring your own server interface. ollama wraps llama.cpp with a friendlier model-pull workflow and an HTTP API, at a small (1-3%) overhead cost. vLLM is x86-only and not relevant here.

For a deployment where you want to script around the model, ollama is the right pick. For squeezing maximum tok/s out of the Pi for benchmarking or for an embedded-device project, raw llama.cpp wins. The numbers in the benchmark table above were captured with llama.cpp directly; ollama would land within 5% of the same figures.

Build llama.cpp from source on the Pi with make LLAMA_NATIVE=1 for ARM NEON acceleration. The default build flags do not include NEON; manually enabling it produces a 30-40% generation speedup on the Cortex-A72 cluster. Confirm in the build log that NEON intrinsics are being compiled in.

Storage and I/O considerations

Use a fast microSD card (A2-rated, U3) or boot from USB 3.0 SSD for the Pi 4 LLM rig. Model loading time dominates the cold-start experience: a 2 GB Phi-3-mini at q4 takes 12-18 seconds to load from a typical microSD versus 4-6 seconds from a USB 3.0 SSD. Once loaded, the model stays in RAM and storage speed no longer matters. For an always-on deployment this is a one-time cost; for a script that loads and unloads models the SSD path saves real time.

Power consumption

The Pi 4 8GB draws roughly 4-6W idle and 7-9W under sustained inference. Across a 24-hour always-on deployment that is roughly 0.18 kWh per day, or about $8-12 per year at typical US electricity rates. This is the killer feature relative to any x86 alternative; even the cheapest mini-PC draws 3-5x as much power at idle. For a home-automation LLM node that needs to be always-on for voice-assistant duty, the Pi 4 is uniquely well-suited.

If you can tolerate 3 tok/s on Phi-3-mini for a privacy-respecting local assistant, the Pi 4 8GB is the right hardware. If you need faster inference, step up to a Jetson Orin Nano or accept higher power draw on a mini-PC.