Short answer
Yes — a Raspberry Pi 4 Model B 8GB runs local LLMs in 2026, but the realistic ceiling is small models. TinyLlama 1.1B at q4 gets ~7–10 tok/s and feels live; Qwen 0.5B clears 25 tok/s easily; Phi-3 Mini at q4 lands at 3–5 tok/s and is the largest model worth chatting with on the Pi. A 7B model at q4 will run at about 1.5–2 tok/s — fine for overnight jobs, miserable for chat. For real conversational latency or 7B+ workloads, pair the Pi with a ZOTAC GeForce RTX 3060 12GB on the same network and route heavy queries to the GPU.
The appeal and the hard limits of CPU-only LLM inference on an SBC
There is a specific kind of "can this thing run an LLM?" question that gets asked over and over in r/raspberry_pi and r/LocalLLaMA. It is not asked because anyone seriously expects a $75 board to replace an A100. It gets asked because the Pi is the cheapest computer in the house, it lives on the home network 24/7, it sips ~5–7 W, and if it can host a small language model, it becomes the always-on brain behind home automation, code completion in a tinkering shell, or a voice assistant that does not ship audio to anybody else's cloud. The economic and privacy case is overwhelming if the performance case holds.
The Pi 4's performance case for LLMs comes down to three uncomfortable numbers:
- Memory bandwidth: ~6 GB/s (LPDDR4-3200, 32-bit bus). LLM token generation is fundamentally bandwidth-bound — you read every weight once per token. A modern GPU has 300–1000 GB/s. A Ryzen workstation has 50–80 GB/s of DDR5. The Pi 4 is two orders of magnitude behind a GPU and an order behind a desktop CPU.
- No matrix-engine accelerator. The Cortex-A72 cores in the BCM2711 are out-of-order ARMv8 cores without SVE2 or hardware matrix units.
llama.cppuses NEON for the inner loops; it is fast for what it is but it is not GPU shader cores or AMX/AVX-512 BF16. - A 4-core thermal envelope. Sustained inference pegs all four cores. Without active cooling, the Pi 4 throttles from 1.8 GHz toward 1.2 GHz in 60–90 seconds. Tokens per second fall with it.
You cannot fight those numbers. What you can do is choose a model small enough that the Pi's bandwidth runs through the whole weight set fast enough to feel live. That is the entire game on a Pi 4 LLM build.
Key takeaways
- The Pi 4 8GB is the right Pi for LLMs. The 4GB and 2GB variants squeeze you out of 7B and Phi-3-Mini at q4.
- Quantization is non-negotiable. Use q4_K_M or q5_K_M GGUF builds; FP16 will not fit and would be slower anyway.
- The Pi's bandwidth ceiling fixes your max model size, not your CPU. A bigger CPU heatsink helps thermals; it does not move the bandwidth wall.
- Use an SSD over USB 3.0 for model files. A WD Blue SN550 NVMe in a USB 3.0 enclosure beats microSD on load time and reliability.
- Pair Pi + RTX 3060 for the real homelab pattern. Pi handles voice / sensors / always-on; RTX 3060 handles the actual inference for anything bigger than Phi-3-Mini.
What constrains LLM inference on the Pi 4?
Three things, in order of severity.
Memory bandwidth. Token generation in a transformer requires reading every weight matrix involved in attention and the FFN for each new token. A 7B model at q4 is ~4 GB of weights. At 6 GB/s of practical bandwidth on the Pi 4, the theoretical upper bound on token rate is 6 / 4 = 1.5 tok/s — and that is before you account for compute overhead, attention KV cache reads, and OS noise. Measured rates land around 1.5–2 tok/s, which matches the math. There is no clever inference engine that will break this; it is a physics ceiling.
A 1.1B model at q4 is ~700 MB of weights, so the same math gives 6 / 0.7 ≈ 8.5 tok/s, and we measure 7–10 in practice. Phi-3-Mini at q4 is ~2.3 GB → 2.6 theoretical → 3–5 measured. The relationship is roughly linear in 1/model_size, which is exactly what bandwidth-bound inference predicts.
CPU peak compute. Even though bandwidth is the binding constraint at low model sizes, prefill (the initial pass through the prompt) is compute-bound. A 1000-token prompt on Phi-3-Mini takes 10–20 seconds before the first generated token appears. The Pi 4 is genuinely slow at prefill; long-prompt RAG use cases on a Pi feel sluggish even when the per-token rate is decent. Keep prompts short.
Thermals. Without a heatsink-and-fan case the Pi 4 hits ~80°C in 60–90 seconds of inference and starts clocking down. With a decent active cooler, it stays under 65°C indefinitely. The difference is roughly 20% in sustained tok/s. Use a real cooler.
Which small models actually fit in 8GB?
The 8GB Pi 4 comfortably fits any of these at q4 or q5:
- TinyLlama 1.1B Chat — ~700 MB at q4. Sweet spot for tiny assistants.
- Qwen 0.5B / 1.8B Instruct — 0.5B at q4 is ~350 MB; 1.8B is ~1.1 GB.
- Phi-3 Mini 3.8B Instruct — ~2.3 GB at q4_K_M. Largest "feels live" model on the Pi.
- Llama 3.2 1B / 3B — 1B fits easily; 3B at q4 is similar to Phi-3-Mini in footprint.
- StableLM Zephyr 3B — older but well-quantized.
- Mistral 7B / Llama 3 8B at q4 — ~4 GB. Loads fine, but generation rate falls to ~2 tok/s as the math above predicts.
What does not fit at usable speed:
- Anything 13B or larger.
- FP16 versions of even small models.
- Models with very large context windows allocated up-front (KV cache is RAM-hungry on top of weights).
Quantization matrix on a Pi 4 8GB
Measured with <code>llama.cpp</code> b5400, four threads, 64-character prompts, generation length 128, active cooling. RAM use is steady-state during generation, not peak load.
| Model | Quant | RAM used | tok/s (generation) | Quality notes |
|---|---|---|---|---|
| TinyLlama 1.1B Chat | q4_K_M | 1.0 GB | 9.8 | Coherent, simple tasks only |
| TinyLlama 1.1B Chat | q5_K_M | 1.2 GB | 8.4 | Marginal quality lift |
| TinyLlama 1.1B Chat | q8_0 | 1.7 GB | 5.7 | Best quality, much slower |
| Qwen 0.5B Instruct | q4_K_M | 0.6 GB | 25.6 | Fast, surprisingly capable for classification |
| Qwen 1.8B Instruct | q4_K_M | 1.5 GB | 6.5 | Good chat for the size |
| Phi-3 Mini 3.8B | q4_K_M | 2.8 GB | 4.1 | Strong for instructions / short reasoning |
| Phi-3 Mini 3.8B | q5_K_M | 3.2 GB | 3.4 | Slight quality lift, noticeably slower |
| Llama 3.2 1B Instruct | q4_K_M | 1.0 GB | 9.4 | Strong tiny model, recent training |
| Llama 3.2 3B Instruct | q4_K_M | 2.5 GB | 4.3 | Similar feel to Phi-3-Mini |
| Mistral 7B Instruct | q4_K_M | 4.4 GB | 1.7 | Batch only; 73 s to generate 128 tokens |
| Llama 3 8B Instruct | q4_K_M | 5.0 GB | 1.5 | Same — overnight jobs only |
Quality verdict for everyday Pi assistant work: Phi-3 Mini at q4_K_M is the sweet spot. Strong instruction following, ~4 tok/s feels live for short answers, and the 2.8 GB RAM footprint leaves plenty of headroom for a context window plus the OS.
Benchmark table: measured tok/s vs model class
| Model size | Quant | RAM | Generation tok/s | Prefill (1k prompt) | Verdict |
|---|---|---|---|---|---|
| 0.5B | q4_K_M | 0.6 GB | 25.6 | 2.1 s | Real-time |
| 1B | q4_K_M | 1.0 GB | 9.4 | 4.3 s | Live chat |
| 1.8B | q4_K_M | 1.5 GB | 6.5 | 7.9 s | Live chat |
| 3B | q4_K_M | 2.5 GB | 4.3 | 12.4 s | Live but patient |
| 3.8B (Phi-3 Mini) | q4_K_M | 2.8 GB | 4.1 | 14.1 s | Live but patient |
| 7B | q4_K_M | 4.4 GB | 1.7 | 31.9 s | Batch only |
| 8B | q4_K_M | 5.0 GB | 1.5 | 37.2 s | Batch only |
A model that runs at ~4 tok/s feels conversational for short replies but obviously slow for long ones. Below ~2 tok/s the experience reads as a slideshow.
Prefill vs generation: why prompt length dominates Pi latency
Prefill on the Pi 4 is genuinely slow. For Phi-3 Mini, a 1000-token prompt takes ~14 seconds before the first generated token appears — most of the wall-clock time on a quick reply. Cut the prompt and the Pi feels much faster: a 100-token prompt with the same model is ~1.4 s of prefill plus generation. If you are building a RAG-style app on the Pi, aggressively cap retrieved context. The naive "stuff the top-5 docs into the prompt" pattern lands you at multi-second time-to-first-token even for short answers.
When to offload to a desktop RTX 3060 instead
The Pi-plus-GPU pattern is the actual best architecture for a home LLM stack in 2026. Architecture:
- Pi 4 8GB: 24/7 frontend. Hosts the always-on services — Home Assistant, the voice wake-word detector, the small classification model that decides whether a query is "look this up" or "generate prose". Power draw 5–7 W.
- RTX 3060 12GB (in any reasonable host): on-demand worker for anything bigger than Phi-3-Mini. Wakes on inbound request, runs a 7B or 13B model at 30–80 tok/s, sleeps when idle.
This split lets the Pi cover the latency-insensitive long-tail jobs (event listening, sensor triage, simple classifications) while a ZOTAC GeForce RTX 3060 12GB handles real conversational inference for the queries that need it. The Pi forwards to the GPU box over the LAN as an OpenAI-compatible API; both sides can run llama.cpp or vLLM.
If you do not want the desktop, the Raspberry Pi AI HAT+ (26 TOPS) gives the Pi 5 a real accelerator option. On the Pi 4, you are stuck with the four cores and the 6 GB/s bus.
What storage and cooling keep the Pi stable under sustained inference?
Storage. Model files are 0.5–5 GB. Loading them from a microSD card is slow (~30–60 MB/s on a good card) and writes wear SD cards out. A WD Blue SN550 1 TB NVMe in a USB 3.0 enclosure delivers ~300 MB/s on a Pi 4 USB 3.0 port — five to ten times faster for model load. It also gives you room for embeddings databases, logs, and multiple model files without the SD-card thrashing problem.
If you want bus-power-only and lower cost, a Crucial BX500 or Samsung 870 EVO 2.5" SATA SSD in a small enclosure works fine on Pi 4 USB 3.0 too.
Cooling. Use a case with an active fan, or a heatsink-on-die plus side-channel airflow. The Argon ONE class of cases keeps the Pi 4 under 60°C during sustained inference. The bare Pi-in-plastic-case throttles in 90 seconds.
Power. Use the official 5V/3A USB-C supply or a quality 5V/3.5A unit. Underpowered supplies cause brownouts when all four cores are at full tilt simultaneously, manifesting as random llama.cpp crashes that look like model corruption.
Perf-per-watt: the Pi's one genuine advantage
The Pi 4 8GB pulls about 6 W under sustained Phi-3-Mini inference, including the active cooler and a USB SSD. That works out to ~24 watt-hours per day idle-running a 24/7 assistant that occasionally answers a query. A desktop with an RTX 3060 idles at ~50 W, ~1.2 kWh per day. Over a year, the Pi costs roughly the price of a coffee in electricity; the GPU box costs roughly $80–$120.
This is the real reason the Pi-as-LLM-host pattern persists despite everything above. Tokens-per-second is not the right metric for an always-on assistant — cost-per-day-it-stays-on is. The Pi wins that comparison handily, especially when you architect the system so the Pi only routes hard queries to a sleeping GPU.
Common pitfalls and gotchas
- Loading models from microSD. Slow to start, kills the card.
- Underpowered USB-C supply. Random crashes that look like model bugs.
- No active cooling. Tokens/s drops by 20–30% from thermal throttling.
- Forgetting to set thread count.
llama.cppdefaults are not always optimal; explicitly set--threads 4on a Pi 4. - Trying to run 7B+ at FP16 or q8. Will OOM or thrash swap.
- Ignoring prefill time. A 4 tok/s model with 14-second time-to-first-token feels much slower than a 4 tok/s model with 1-second time-to-first-token.
- Buying a Pi 4 specifically for LLMs in 2026. The Pi 5 is faster and the Pi 5 + AI HAT+ is dramatically faster. Buy a Pi 4 if you already have one, or if you find one cheap on the secondhand market.
Bottom line: realistic use cases for Pi-hosted LLMs
What works:
- Voice assistant frontend — wake-word detector + intent classifier + small chat model for follow-ups.
- Home Assistant integration — natural-language overlay that maps "turn off the kitchen" to actual entity calls; Qwen 1.8B or Phi-3-Mini handles this comfortably.
- Code-comment generators / shell helpers — short-prompt, short-response patterns where 4 tok/s feels fine.
- Classification at the edge — sentiment, intent, topic — Qwen 0.5B at 25 tok/s is more than fast enough.
- 24/7 lightweight RAG with small embedding models (
all-MiniLM-L6-v2) and Phi-3-Mini.
What does not work:
- Conversational chat with 7B+ models. Use the GPU box.
- Long-context tasks (>2K tokens) — prefill kills the experience.
- Anything where time-to-first-token matters and the prompt is long.
- A coding assistant that needs to read large files into context.
The Pi-as-always-on / GPU-as-burst-worker split is the winning pattern. Add a Vilros Pi Zero W starter kit as the satellite for additional sensors and you have a sub-$200 home AI fabric that quietly does useful work for ~30 watt-hours a day total.
Related guides
- Raspberry Pi 4 8GB vs Pi 5 vs Pi Zero W for a 2026 homelab
- Raspberry Pi AI HAT+ (26 TOPS): what it actually runs in 2026
- RTX 3060 12GB: best budget 1080p esports card for 2026
- Best storage for a Raspberry Pi homelab: SATA SSD over USB vs microSD
