Running Local LLMs on the Raspberry Pi 4 8GB — Realistic Token-Throughput in 2026
For the raspberry pi 4 8gb llm tok/s 2026 question the realistic answer is: TinyLlama-1.1B at q4_K_M generates 8-12 tok/s, Phi-3-mini-3.8B at q4 lands at 2.5-4 tok/s, and Llama-3.2-3B at q4 sits at 2-3 tok/s. Anything 7B and up falls under 1 tok/s and is not usable.
Editorial intro
Running an LLM locally on a Raspberry Pi 4 8GB sounds absurd until you consider the use cases. A home-automation voice assistant that does not phone Amazon. A summarizer for an offline notes vault. A chatbot for an air-gapped lab. A teaching tool that fits in a backpack and runs on USB-C power. These are real workloads, and 2026 is the first year you can credibly serve them on a $90 single-board computer.
The shift came from two directions. On the model side, Llama 3.2 1B and 3B, Phi-3-mini, Qwen 2.5, and TinyLlama have collapsed useful inference into the 1-4B parameter range. On the runtime side, llama.cpp's ARM NEON and quantization improvements have squeezed two to three times more tokens per second out of identical hardware versus 2023 builds. The raspberry pi llama.cpp workflow is now the reference path; ollama wraps it with a friendlier server interface but the underlying numbers are similar.
The reason the Pi 4 8GB hits a hard ceiling at the 3B model class is not compute, it is memory bandwidth. The Cortex-A72 cluster runs at 1.5 GHz with 4 MB shared L2; the LPDDR4 memory bus delivers about 8.5 GB/s peak. Quantized weights at q4 still need to stream through that bus once per token, which means the theoretical generation ceiling for a 4 GB weight file is roughly 2 tok/s before any compute or kvcache overhead. TinyLlama at q4 is 0.6 GB and runs proportionally faster. The math sets the buying expectation cleanly.
Key Takeaways
- TinyLlama-1.1B q4_K_M: 8-12 tok/s generation, 60-80 tok/s prefill
- Phi-3-mini-3.8B q4_K_M: 2.5-4 tok/s generation, 18-25 tok/s prefill
- Llama-3.2-3B q4_K_M: 2-3 tok/s generation, 14-20 tok/s prefill
- Llama-3.1-8B q4_K_M: 0.5-0.9 tok/s — not usable
- Memory bandwidth, not compute, is the binding constraint
- Active cooling is mandatory above 5 minutes of continuous inference
What models fit in 8GB RAM?
Hard ceiling: roughly 5.5 GB for model weights + kvcache + workspace, after subtracting OS, swap, and llama.cpp working memory. That maps to:
- TinyLlama-1.1B at any quant (0.4-2.2 GB)
- Phi-3-mini-3.8B at q4 or q5 (2.3-2.8 GB)
- Llama-3.2-1B at q4 to q8 (0.7-1.3 GB)
- Llama-3.2-3B at q4_K_M (~2.0 GB)
- Qwen 2.5 1.5B / 3B at q4 (~1.0-2.0 GB)
- Llama-3.1-8B at q4_K_M technically loads (~4.7 GB) but swaps under any real context
Quantization matrix
| Model | q2_K | q3_K_M | q4_K_M | q5_K_M | q6_K | q8_0 |
|---|---|---|---|---|---|---|
| TinyLlama-1.1B | 0.45GB | 0.55GB | 0.66GB | 0.78GB | 0.91GB | 1.17GB |
| Phi-3-mini-3.8B | 1.4GB | 1.95GB | 2.39GB | 2.82GB | 3.14GB | 4.06GB |
| Llama-3.2-1B | 0.42GB | 0.57GB | 0.77GB | 0.91GB | 1.05GB | 1.32GB |
| Llama-3.2-3B | 1.2GB | 1.65GB | 2.02GB | 2.32GB | 2.64GB | 3.42GB |
Quality cliff lands at q3 and below for all models tested; q4_K_M is the universal sweet spot for the edge llm inference workflow on the Pi 4. q8 is rarely worth the bandwidth penalty given the marginal quality gain over q5/q6.
Benchmark table: tok/s prefill + generation
| Model + Quant | Prefill (tok/s) | Generation (tok/s) | Memory used |
|---|---|---|---|
| TinyLlama-1.1B q4_K_M | 60-80 | 8-12 | 1.1 GB |
| Llama-3.2-1B q4_K_M | 50-70 | 7-10 | 1.3 GB |
| Phi-3-mini-3.8B q4_K_M | 18-25 | 2.5-4.0 | 3.0 GB |
| Llama-3.2-3B q4_K_M | 14-20 | 2.0-3.0 | 2.6 GB |
| Qwen 2.5 3B q4_K_M | 15-22 | 2.2-3.2 | 2.7 GB |
| Llama-3.1-8B q4_K_M | 3-5 | 0.5-0.9 | 5.4 GB (swaps) |
Numbers from the pi 4 ollama benchmark community thread on r/LocalLLaMA, cross-referenced against llama.cpp's own performance issues and reproduced on a Pi 4 8GB with active cooling. CPU at 1.5 GHz stock, no overclock. Running 4 threads.
Context-length impact analysis
Larger context = bigger kvcache = slower per-token generation, regardless of model. On Phi-3-mini at q4_K_M, going from a 512-token context to 4096 tokens drops generation from ~3.8 tok/s to ~2.5 tok/s. The kvcache also competes with model weights for RAM; pushing Phi-3 to a 4K context with a 3.0 GB model leaves only ~3.5 GB for OS plus everything else. For a chat assistant use case, 1024-2048 token windows are the practical sweet spot.
Cooling + thermal throttling on Pi 4
The Pi 4 SoC throttles aggressively above 80 C. Sustained inference workloads will hit that ceiling within 5-10 minutes on a stock Pi without active cooling. The official Raspberry Pi 4 case fan or any 30 mm fan strapped over a heatsink keeps junction temperature in the 60-70 C range and prevents throttling entirely. Without it, expect a 25-40% drop in sustained tok/s as the SoC scales clocks down.
The Freenove starter kit ships with thermal pads and a basic heatsink that handles bursty workloads (single conversational turn) but is insufficient for sustained streaming. Plan for active cooling.
Perf-per-dollar vs Jetson Orin Nano vs cheap mini-PC
| Platform | Cost | TinyLlama tok/s | Phi-3 tok/s | Power |
|---|---|---|---|---|
| Pi 4 8GB | $90 | 10 | 3.5 | 6W |
| Jetson Orin Nano 8GB | $499 | 60-90 | 18-25 | 15W |
| Mini-PC N100 16GB | $179 | 22-28 | 7-9 | 10W |
| Mini-PC Ryzen 5560U | $349 | 45-60 | 14-18 | 25W |
The Pi wins on absolute dollars and on power. The Jetson Orin Nano wins on absolute throughput per dollar specifically for LLM workloads because of its CUDA-accelerated GPU. A modern mini-PC (N100, Ryzen-based) outperforms the Pi by 2-3x on equivalent quants while costing 2-4x more. If your use case is "I need a low-power always-on inference box and can tolerate 3 tok/s," the Pi 4 is the answer. If you need 10+ tok/s on Phi-3 class models, you need to step up.
Verdict matrix
Get a Raspberry Pi 4 8GB if:
- Your use case tolerates 3-10 tok/s on small models
- You want sub-$100 always-on inference
- Power consumption matters (6W idle, 8W under load)
- You are using TinyLlama or Llama-3.2-1B class workloads
Get something else if:
- You want above 10 tok/s on Phi-3 class models
- You want to run Llama-3.1-8B at usable speed
- Your workload is bursty rather than always-on (a mini-PC sleeps better)
Bottom line
The Raspberry Pi 4 8GB in 2026 is a real LLM platform if you size your model to its memory bandwidth ceiling. TinyLlama-1.1B at q4_K_M is the sweet spot for chat-style use; Phi-3-mini-3.8B at q4 is the upper edge of usable. Add active cooling, set a 1024-2048 token context, and the Pi delivers a credible always-on local inference node for $90.
Citations and sources
- llama.cpp performance issue tracker, ARM NEON optimization PRs (2024-2025)
- r/LocalLLaMA Raspberry Pi 4 benchmark thread index
- ollama documentation, llama.cpp backend performance notes
- Raspberry Pi 4 official thermal management whitepaper
- NVIDIA Jetson Orin Nano benchmark data (NVIDIA developer blog)
- Phi-3-mini and Llama-3.2 model cards (Hugging Face)
Software stack: llama.cpp vs ollama vs vLLM
Three runtimes are worth considering on the Pi 4. llama.cpp is the bare-metal reference: lowest overhead, full control over thread count and quantization, but you bring your own server interface. ollama wraps llama.cpp with a friendlier model-pull workflow and an HTTP API, at a small (1-3%) overhead cost. vLLM is x86-only and not relevant here.
For a deployment where you want to script around the model, ollama is the right pick. For squeezing maximum tok/s out of the Pi for benchmarking or for an embedded-device project, raw llama.cpp wins. The numbers in the benchmark table above were captured with llama.cpp directly; ollama would land within 5% of the same figures.
Build llama.cpp from source on the Pi with make LLAMA_NATIVE=1 for ARM NEON acceleration. The default build flags do not include NEON; manually enabling it produces a 30-40% generation speedup on the Cortex-A72 cluster. Confirm in the build log that NEON intrinsics are being compiled in.
Storage and I/O considerations
Use a fast microSD card (A2-rated, U3) or boot from USB 3.0 SSD for the Pi 4 LLM rig. Model loading time dominates the cold-start experience: a 2 GB Phi-3-mini at q4 takes 12-18 seconds to load from a typical microSD versus 4-6 seconds from a USB 3.0 SSD. Once loaded, the model stays in RAM and storage speed no longer matters. For an always-on deployment this is a one-time cost; for a script that loads and unloads models the SSD path saves real time.
Power consumption
The Pi 4 8GB draws roughly 4-6W idle and 7-9W under sustained inference. Across a 24-hour always-on deployment that is roughly 0.18 kWh per day, or about $8-12 per year at typical US electricity rates. This is the killer feature relative to any x86 alternative; even the cheapest mini-PC draws 3-5x as much power at idle. For a home-automation LLM node that needs to be always-on for voice-assistant duty, the Pi 4 is uniquely well-suited.
If you can tolerate 3 tok/s on Phi-3-mini for a privacy-respecting local assistant, the Pi 4 8GB is the right hardware. If you need faster inference, step up to a Jetson Orin Nano or accept higher power draw on a mini-PC.
