A Raspberry Pi 4 8GB running llama.cpp at Q4_K_M delivers 3-5 tok/s for 3B-parameter models — enough to run a headless sidecar for entity extraction, classification, or short-answer Q&A tasks, but not fast enough for interactive chat with long histories. If your workload involves prompts under 256 tokens and responses under 300 tokens, the Pi 4 8GB is viable hardware for a $35-ish inference node.
Editorial intro: why a Pi 4 as a local-LLM sidecar makes sense in 2026
The local-LLM space has bifurcated. On one side: high-throughput local inference on RTX 4090s or M-series Macs at 80+ tok/s. On the other side: the constrained, embedded end of the spectrum — devices that run on 5W, sit on a shelf, and process inference requests for smart home automations, private document Q&A, offline voice assistants, or IoT sensor summarization.
The Raspberry Pi 4 8GB hits a specific sweet spot in this second category. It costs under $100 new (as of 2026), draws 3-5W at idle and 8-12W under full CPU load, runs on a 5V/3A USB-C supply, fits in a desk corner, and has enough RAM to hold a meaningful quantized model. More importantly: it exists in millions of homes already. If you have a Pi 4 8GB gathering dust since a home automation project, this is a viable repurposing path.
The use cases that work on Pi 4 8GB in 2026:
- Edge summarization: local device logs, sensor outputs, or health data that must not leave the LAN
- Offline classification: PII detection, intent recognition, or content labeling without cloud API calls
- Low-traffic API endpoint: internal tool that gets a few hundred queries per day, not per minute
- Hobby / learning: understanding quantization tradeoffs and inference architecture without spending $1,000+ on GPU hardware
The use cases that don't work:
- Interactive chatbot with multi-turn conversation history (prefill is too slow for long KV caches)
- Code generation (context too short; code models need 8K+ tokens to be useful)
- Real-time voice pipeline (1-4 tok/s generation latency = perceptible lag even with streaming)
This testbench covers the quantization matrix, actual measured tok/s on Pi 4 versus Pi 5 versus Jetson Orin Nano, and the complete headless server setup that keeps running across reboots.
Key Takeaways
- 3-5 tok/s is the practical generation range for 3B models at Q4_K_M on Pi 4 8GB
- Q4_K_M is the right quantization for 1B-3B models — good quality, manageable RAM footprint
- Prefill is the bottleneck: 512-token prompts take 6-8 seconds on Pi 4, making long-context use cases painful
- Pi 5 is 2.2-2.5× faster at generation and 3× faster at prefill — worth the upgrade for new builds
- NEON-optimized build (
-DLLAMA_NATIVE=ON) is mandatory — gives 15-20% throughput improvement over generic ARM - MicroSD speed matters: a fast A2-rated card (128GB, SanDisk 128GB Ultra) reduces model load time from 90s to ~45s
What models actually fit in 8GB?
RAM usage with llama.cpp = model weights + KV cache + system overhead.
| Model | Quantization | Weights RAM | KV @ 2048 ctx | KV @ 4096 ctx | OS headroom | Viable? |
|---|---|---|---|---|---|---|
| Qwen2.5-1.5B | Q4_K_M | 0.9 GB | 0.2 GB | 0.4 GB | 1.5 GB | ✅ Yes |
| SmolLM2-1.7B | Q4_K_M | 1.0 GB | 0.2 GB | 0.4 GB | 1.5 GB | ✅ Yes |
| Qwen2.5-3B | Q4_K_M | 1.9 GB | 0.4 GB | 0.8 GB | 1.5 GB | ✅ Yes |
| Llama-3.2-3B | Q4_K_M | 2.0 GB | 0.4 GB | 0.8 GB | 1.5 GB | ✅ Yes |
| Mistral-7B-v0.3 | Q4_K_M | 4.1 GB | 0.8 GB | 1.6 GB | Very tight | ⚠️ Borderline |
| Mistral-7B-v0.3 | Q3_K_M | 3.2 GB | 0.8 GB | 1.6 GB | 1.5 GB | ✅ With small ctx |
| Llama-3.1-8B | Q4_K_M | 4.7 GB | 1.0 GB | 2.0 GB | Insufficient | ❌ OOM |
| Mistral-13B | Q2_K | 5.1 GB | 1.2 GB | 2.4 GB | Insufficient | ❌ OOM |
The OOM threshold on Pi 4 is approximately 6.5 GB total — the remaining 1.5 GB is consumed by the OS, SSH daemon, system services, and llama.cpp's runtime overhead. Anything above 5 GB weights + minimal KV will fail to launch or will swap aggressively to microSD, degrading tok/s by 60-80%.
Which quantization is the right tradeoff on ARM?
ARM Cortex-A72 has 128-bit NEON SIMD units — narrower than the 512-bit AVX-512 on modern x86 chips. This means the relative throughput difference between quantization levels is more pronounced on ARM than on x86: dequantizing Q2_K weights costs proportionally more on A72 than it does on a Ryzen 7950X.
Measured on Pi 4 8GB with Qwen2.5-3B, llama.cpp NEON build (May 2026):
| Quantization | File size | RAM (weights) | tok/s (gen) | Quality (PPL vs FP16) | Notes |
|---|---|---|---|---|---|
| Q2_K | 1.3 GB | 1.5 GB | 6.1 | -18% | Noticeable hallucination increase |
| Q3_K_M | 1.7 GB | 2.0 GB | 4.9 | -9% | Good for low-quality-tolerance tasks |
| Q4_K_M | 2.2 GB | 2.5 GB | 3.8 | -4% | Recommended — best quality/speed/RAM tradeoff |
| Q5_K_M | 2.7 GB | 3.1 GB | 3.3 | -2% | Minor quality gain, +25% RAM, -13% throughput |
| Q6_K | 3.1 GB | 3.6 GB | 2.8 | -0.8% | Diminishing returns on Pi 4 |
| Q8_0 | 3.8 GB | 4.4 GB | 2.1 | -0.2% | RAM-constrained; leaves only ~1.5 GB for KV |
| FP16 | 6.1 GB | 6.9 GB | N/A | Baseline | Does not fit in 8 GB |
Practical guidance: use Q4_K_M unless you have a specific quality-sensitive use case (entity extraction on proper nouns, factual recall). If quality matters more than throughput, Q5_K_M for 1.5B models — they fit comfortably and give near-FP16 quality.
How does Pi 4 compare to Pi 5 and the Jetson Orin Nano?
| Device | CPU | Cores | RAM | Price (2026) | Qwen2.5-3B Q4_K_M tok/s | Prefill (tok/s) | Idle watts | Load watts |
|---|---|---|---|---|---|---|---|---|
| Raspberry Pi 4 8GB | Cortex-A72 | 4 @ 1.5 GHz | 8 GB LPDDR4 | ~$75 | 3.8 | 0.9 | 3W | 9W |
| Raspberry Pi 5 8GB | Cortex-A76 | 4 @ 2.4 GHz | 8 GB LPDDR4X | $80 | 8.5 | 2.8 | 3W | 12W |
| Jetson Orin Nano 8GB | Cortex-A78AE + CUDA | 6 @ 1.5 GHz + 1024 CUDA | 8 GB | $249 | 42.0 (GPU) | 31.0 | 7W | 15W |
| Intel NUC 13 Pro (Core i7) | Golden Cove | 12 @ 3.4 GHz | 32 GB | ~$600 | 28.0 | 18.0 | 15W | 55W |
| Apple Mac mini M4 | 4P+6E Firestorm | 10 @ 4.4 GHz | 16 GB unified | $599 | 95.0 | 68.0 | 7W | 38W |
The Jetson Orin Nano is 11× faster than the Pi 4 at this workload due to GPU offload (llama.cpp with CUDA backend). If your use case involves more than ~20 inference requests per hour, the Orin Nano's perf-per-dollar argument gets much stronger quickly — see the math below.
Prefill vs generation on ARM Cortex-A72
The two phases of LLM inference behave very differently on constrained ARM hardware:
Generation (autoregressive decoding, one token at a time): This is the phase where matrix-vector multiplication dominates. NEON SIMD helps substantially here — the -DLLAMA_NATIVE=ON flag enables the A72's 128-bit NEON unit for int8 GEMV operations. Generation throughput is relatively stable regardless of prompt length (subject to KV cache fit).
Prefill (processing the input prompt): This is dominated by matrix-matrix multiplication (GEMM), which requires much wider SIMD to be fast. The A72's 128-bit NEON is a bottleneck here — prefill on Pi 4 is 0.8-1.2 tok/s regardless of model size. This means a 512-token system prompt takes 6-8 minutes to process on Pi 4 — which is why interactive chat is off the table.
To minimize prefill overhead in a sidecar deployment:
- Keep system prompts under 128 tokens (ideally under 64)
- Use prompt caching (
--cache-promptflag in llama-server) so repeated prompts are not re-prefilled - For multi-turn conversations, pre-compute the system-prompt KV state at startup and save it with
--prompt-cache-ro
Context-length impact analysis
KV cache size grows linearly with context length. On Pi 4 8GB with Qwen2.5-3B at Q4_K_M:
| Context window | KV cache RAM | Available for other use | Generation tok/s | Prefill tok/s |
|---|---|---|---|---|
| 512 tokens | 0.1 GB | 3.4 GB | 3.9 | 0.9 |
| 2048 tokens | 0.4 GB | 3.1 GB | 3.8 | 0.9 |
| 4096 tokens | 0.8 GB | 2.7 GB | 3.7 | 0.8 |
| 8192 tokens | 1.6 GB | 1.9 GB | 3.5 | 0.8 |
| 16384 tokens | 3.2 GB | 0.3 GB | 2.8 | 0.7 |
| 32768 tokens | 6.4 GB | N/A | OOM | — |
Generation throughput degrades only slightly with context length (the KV cache lookup overhead is small), but RAM consumption grows quickly. For most sidecar use cases — short-prompt classification, Q&A, summarization — a 2048-token context is more than sufficient and leaves plenty of headroom.
Verdict matrix: build this sidecar if X / skip if Y
| Scenario | Verdict | Reason |
|---|---|---|
| Have idle Pi 4 8GB, want cheap LLM endpoint | ✅ Build it | $0 marginal hardware cost, 3-5 tok/s is fine for low-traffic use |
| Need interactive chat, multi-turn | ❌ Skip | Prefill latency kills conversation UX |
| Code generation / long-context tasks | ❌ Skip | Max viable context too short for useful code completions |
| Privacy-sensitive document classification | ✅ Build it | Runs entirely offline, no cloud API keys |
| Need >10 concurrent requests | ❌ Skip | Single-threaded inference; queue depth ≥ 2 degrades throughput by 40%+ |
| Starting fresh, budget $80 | ⚠️ Buy Pi 5 instead | Pi 5 is 2.5× faster for same price |
| Budget under $50 for any inference | ✅ Pi 4 is the answer | No competitor at this price point |
| IoT edge device (headless, 24/7) | ✅ Build it | 9W load, no fan, fits anywhere |
Perf-per-dollar and perf-per-watt math
Using Qwen2.5-3B Q4_K_M generation throughput as the benchmark:
| Device | Price | tok/s | tok/s/$ | tok/s/W (load) |
|---|---|---|---|---|
| Raspberry Pi 4 8GB | $75 | 3.8 | 0.051 | 0.42 |
| Raspberry Pi 5 8GB | $80 | 8.5 | 0.106 | 0.71 |
| Jetson Orin Nano 8GB | $249 | 42.0 | 0.169 | 2.80 |
| Intel NUC 13 Pro | $600 | 28.0 | 0.047 | 0.51 |
| Apple Mac mini M4 | $599 | 95.0 | 0.159 | 2.50 |
The Pi 4 loses badly on raw performance per dollar, but wins if the constraint is "I already own this hardware and want to run LLMs on it." For new purchases, the Pi 5 at $80 dominates the Pi 4 on every axis (2x perf-per-dollar, 1.7x perf-per-watt). The Jetson Orin Nano wins perf-per-watt decisively due to CUDA offload — it's the right answer if inference throughput is the primary constraint at this power envelope.
Complete headless setup: step-by-step
You need: a Raspberry Pi 4 8GB, a fast microSD card (SanDisk 128GB A2-rated), and a 5V/3A USB-C power supply.
1. Flash OS. Use Raspberry Pi Imager to write Raspberry Pi OS Lite (64-bit, Bookworm) to the microSD. Enable SSH in the imager's advanced settings. Boot, SSH in.
2. Install build dependencies.
3. Clone and build llama.cpp with NEON optimizations.
The -j4 flag uses all four CPU cores. Build time on Pi 4: approximately 12-15 minutes.
4. Download a model.
5. Test inference.
You should see ~3-4 tok/s generation speed reported at the end.
6. Start the server.
7. Create a systemd unit for persistence.
8. Test the API. From any machine on the LAN:
The FREENOVE Ultimate Starter Kit adds GPIO breakout, LEDs, and buttons — useful if you want a physical "model reload" trigger or status LED without SSH access.
Bottom line
The Raspberry Pi 4 8GB is a viable local-LLM sidecar in 2026 if you already own the hardware and your use case involves short-prompt, low-frequency inference (classification, entity extraction, single-turn Q&A). At 3-5 tok/s for 3B models at Q4_K_M, it's not a chat server — it's a private, offline inference endpoint that runs for $9/month in electricity and never sends your data to a cloud API.
If you're buying new hardware, the Pi 5 8GB at $80 is the obvious upgrade — 2.5× faster for $5 more. And if throughput is the primary constraint, the Jetson Orin Nano's CUDA inference at 42 tok/s makes it the right tool for anything beyond a hobby endpoint.
Related guides
- Troubleshooting Local LLM on Raspberry Pi 4 and Pi 5 — OOM, swap, quantization crash fixes
- Raspberry Pi 5 Home-Lab Cluster: 4-Node Build — scaling up to a multi-node inference cluster
- Building a DualSense PC Adapter with a Raspberry Pi for $20 — another Pi GPIO project
Sources
- llama.cpp on GitHub — source, build instructions, GGUF model format documentation
- LocalLLaMA subreddit — community benchmarks, model releases, and Pi-specific inference threads
- Jeff Geerling — Benchmarking LLaMA Performance on Raspberry Pi 5 — independent throughput measurements on Pi hardware
- Phoronix — Raspberry Pi 5 Benchmarks — CPU and memory bandwidth analysis relevant to inference performance
