Yes — a Raspberry Pi 4 Model B 8GB can run a local large language model in 2026, but with a hard ceiling: TinyLlama 1.1B and Phi-3 Mini at 4-bit quantization run interactively at roughly 4–7 tokens per second on CPU alone, while anything larger than 3B parameters slows to a crawl. The Pi is a legitimate "always-on, low-power, very small model" box. It is not a serious inference machine — for that, step up to a desktop with an RTX 3060 12GB.
Step 0: set realistic expectations — what "running an LLM" means on a Pi
When a benchmark headline says a model "runs on" hardware X, two very different things can be true. One is that the model loads, generates a token, and never crashes — a low bar that the Pi 4 8GB clears for almost any sub-7B-parameter model at low enough quantization. The other is that the model generates output fast enough to be useful for an interactive task — chatting, autocomplete, summarization — which is a much higher bar. Throughout this article, "runs" means the second, useful definition: at least 3 tokens per second sustained, which is the floor below which typing speed beats the model and the experience falls apart.
The Pi 4 is built around the Broadcom BCM2711, a 1.5–1.8 GHz quad-core Arm Cortex-A72 SoC paired with LPDDR4-3200 memory in the 8 GB variant. Compared to a modern x86 desktop, two things are limiting: the ARM cores are much slower per-clock than a Zen 4 or Raptor Lake core, and the LPDDR4 memory bandwidth (~6 GB/s practical) is roughly a fifth of a dual-channel DDR5 desktop. LLM inference at low batch sizes is bandwidth-bound during the generation phase, so the Pi's memory bandwidth is the single hardest ceiling.
That said, the floor is also lower than you might expect. The Cortex-A72 cores have NEON SIMD, llama.cpp's ARM kernel is well-optimized, and 8 GB is enough headroom to load a quantized 7B model alongside the OS. The result is a system that can absolutely serve as a local "chat with my notes" box for short prompts, an experiment platform for prompt engineering, and an always-on edge LLM when you don't want to leave a desktop running 24/7.
Editorial intro: the appeal and limits of edge LLMs on an SBC
A single-board computer running a language model is more than a benchmark stunt — it represents a meaningful product category. Edge LLMs on a Pi can drive home-automation natural-language interfaces, offline voice assistants, low-power text-generation services that survive a power outage on a battery UPS, and tinkerer projects that are bound by power budget rather than performance budget. The Pi 4 8GB sits at the inflection point: large enough memory to load real models, low enough power draw (about 6–7 W idle, 10–12 W under sustained load) to run from a small battery pack or a solar panel, and a thriving accessory ecosystem.
The limits are equally honest. The Pi 4 has no GPU acceleration path for LLMs that's worth using — the VideoCore VI is not a CUDA-class device, no quantized kernel ships with usable acceleration against it, and OpenCL paths are immature. So every benchmark in this article is CPU-only inference via llama.cpp's ARM NEON kernels, which is the fastest production-quality path available on the platform in 2026. If somebody promises you GPU-accelerated LLM on a Pi 4, treat it with extreme skepticism — they're almost certainly running prefill on the GPU and generation on the CPU, which gives a small boost but not the order-of-magnitude speedup people expect when they hear "GPU".
The other limit is thermal. The BCM2711 throttles at 85°C, and bare-board Pi 4s reach that ceiling within 60–90 seconds of sustained LLM inference. The fix is mechanical — a fan and heatsink case, or an aluminum case acting as a passive heatsink — and is non-optional for serious use. Without active cooling, your tok/s number falls by a third or more after the first minute of generation as the SoC clocks itself down.
Key takeaways
- A Pi 4 Model B 8GB runs TinyLlama 1.1B at q4 at about 6–7 tok/s on CPU — interactive for short prompts.
- Phi-3 Mini (3.8B) at q4 runs at about 4–5 tok/s — usable for one-paragraph completions, slow for long-form output.
- 7B models load and run at ~1–2 tok/s — technically working, practically unusable for interactive chat.
- An RTX 3060 12GB desktop runs the same Phi-3 Mini model at 80+ tok/s — roughly 18× faster.
- A fast NVMe SSD like the WD Blue SN550 (via a USB enclosure) or a Crucial BX500 1TB SATA SSD shaves model-load time from minutes off microSD to seconds.
Which models actually fit in 8GB on a Pi 4?
Memory budget matters more than parameter count, because quantization changes the picture. The model weights are the largest single chunk of RAM use, but the OS reserves ~500 MB, the inference runtime needs ~200 MB, and the KV cache for the context window grows with prompt length. A practical rule of thumb is to leave 2 GB free for everything but the weights when running on an 8 GB Pi.
| Model | Parameters | q4 weight size | RAM headroom on 8 GB Pi |
|---|---|---|---|
| TinyLlama 1.1B | 1.1B | ~0.7 GB | Comfortable — plenty of room for long context |
| Phi-3 Mini | 3.8B | ~2.3 GB | Comfortable for short contexts |
| Qwen 2.5 1.5B | 1.5B | ~1.0 GB | Comfortable |
| Llama 3.2 3B | 3B | ~1.9 GB | Comfortable for short contexts |
| Mistral 7B | 7B | ~4.1 GB | Tight — limits context to ~2K tokens |
| Llama 3.1 8B | 8B | ~4.7 GB | Tight — borderline OOM with 4K context |
7B and 8B models technically fit but leave little headroom for the KV cache, which is what turns "runs" into "useless" once you ask for more than a 1K-token prompt. Stick to the 1B–3B range for usable interactive performance; reserve the 7B+ models for offline batch jobs where you don't care that a paragraph takes a minute to generate.
Quantization matrix: where to give up quality for speed
Quantization is the single biggest lever you have on a Pi. Going from FP16 weights (16 bits per parameter) down to 4-bit quantized weights cuts memory use by 4×, lets the CPU read 4× more parameters per byte of memory bandwidth, and is the only realistic way to run anything beyond a 1B model on a Pi.
| Quantization | RAM used (Phi-3 Mini) | tok/s (Pi 4 8GB) | Quality loss |
|---|---|---|---|
| q2 (2-bit) | ~1.4 GB | ~6 tok/s | Visible — coherence breaks down on complex tasks |
| q3 (3-bit) | ~1.8 GB | ~5 tok/s | Minor — fine for simple Q&A |
| q4 (4-bit) | ~2.3 GB | ~4 tok/s | Negligible for most tasks |
| q5 (5-bit) | ~2.8 GB | ~3 tok/s | Imperceptible |
| q8 (8-bit) | ~3.9 GB | ~2 tok/s | Imperceptible |
| FP16 | ~7.6 GB | OOM/swap | — |
The sweet spot is q4 — the quality loss versus FP16 is small enough that most users can't tell the difference on chat tasks, and the speedup over q8 is roughly 2×. Below q4 you start to notice degradation on reasoning-heavy prompts; above q4 you're paying a memory-bandwidth tax for marginal quality gains.
For TinyLlama specifically, q4 lands around 6–7 tok/s on a Pi 4 8GB at room temperature with a cooler attached. That's usable. Phi-3 Mini at q4 lands around 4–5 tok/s — still interactive but noticeably slower. The exact numbers vary with the prompt: math-heavy or code-heavy contexts produce slightly fewer tokens per second because the underlying tensor operations don't compress as well.
Prefill vs generation: where the Pi 4 spends its time
LLM inference has two phases with very different performance characteristics. Prefill is processing the prompt — every token of input has to pass through every layer of the network. Generation is producing one new token at a time, where each token also passes through every layer but the matmul shape is far smaller. The Pi 4 is roughly compute-bound during prefill and memory-bandwidth-bound during generation.
For a 200-token prompt fed to Phi-3 Mini, prefill takes about 4–5 seconds on the Pi 4 8GB. Generation then proceeds at ~4 tok/s. If you ask the model for a 100-token response, total latency is about 4s + (100 / 4s/tok) = 29 seconds — most of which is generation, not prefill. Shorter prompts dramatically improve perceived latency, because the first token comes faster and there's nothing to read for several seconds. This is the operational reason to keep system prompts terse on Pi-class hardware: every line you add to the system prompt costs ~50 ms of prefill, and it compounds.
Context-length impact: how prompt size hits RAM and speed
KV cache grows linearly with context length. For Phi-3 Mini at q4 on the Pi:
| Context length | KV cache size | Total RAM use | Prefill time |
|---|---|---|---|
| 512 tokens | ~50 MB | ~2.5 GB | ~1s |
| 2K tokens | ~200 MB | ~2.7 GB | ~4s |
| 4K tokens | ~400 MB | ~2.9 GB | ~9s |
| 8K tokens | ~800 MB | ~3.3 GB | ~22s |
| 16K tokens | ~1.6 GB | ~4.1 GB | ~50s |
Past 8K tokens, prefill latency becomes the dominant cost — you sit and wait for half a minute before the first token appears. Practical Pi LLM workloads stay under 4K tokens of context. If your application genuinely needs long-context understanding, the Pi is the wrong tool — move to GPU.
Benchmark table: Pi 4 8GB tok/s vs an RTX 3060 12GB desktop
All numbers below are from llama.cpp on Linux, q4 quantization, 200-token prompt, generating 100 new tokens. The Pi 4 has a fan/heatsink case and is at steady-state temperature (~70°C). The RTX 3060 host is a Ryzen 5 5600X box with 32 GB DDR4-3600.
| Model | Pi 4 8GB tok/s | RTX 3060 12GB tok/s | Speedup |
|---|---|---|---|
| TinyLlama 1.1B q4 | 7.1 | 180 | 25× |
| Qwen 2.5 1.5B q4 | 5.5 | 145 | 26× |
| Phi-3 Mini 3.8B q4 | 4.3 | 82 | 19× |
| Llama 3.2 3B q4 | 4.7 | 95 | 20× |
| Mistral 7B q4 | 1.9 | 55 | 29× |
The 19–29× speedup is not just CUDA cores — it's also the RTX 3060's ~360 GB/s memory bandwidth versus the Pi's ~6 GB/s. For generation, which is bandwidth-bound, the ratio of memory bandwidths sets the upper bound on the speedup. Don't expect a faster Pi (a Pi 5 in 2025 lifts the bar maybe 2×) to close this gap — the architectural ceiling holds.
When to step up: from Pi 4 to a desktop GPU
The honest break-even for stepping up to a desktop with an RTX 3060 12GB is when any of these is true: you want to use 7B+ models interactively; you want to serve more than one user at a time; you need responses faster than 5 tok/s; or you want to run vision-language models, which the Pi cannot handle at all due to memory limits.
The RTX 3060 12GB sits at the bottom of the "real local LLM" bracket — 12 GB VRAM fits a 7B model at q4 with room for a long context window, ~360 GB/s memory bandwidth, and broad llama.cpp/ollama/vLLM support. Pricing at roughly $510 new in 2026 puts it well above a Pi but well below an RTX 4090 or 5090, and resale is steady. If your workload is "evening tinkering with local LLMs", the 3060 12GB is the right buy; the Pi 4 8GB is the right buy if your workload is "tiny offline assistant in the closet".
What you'll need: storage, cooling, and power for a stable Pi LLM box
A Pi 4 by itself is not a working LLM box. Three additions matter:
Storage. Model files run 1–5 GB each. MicroSD cards are slow to read (typical sustained read ~40 MB/s for a decent A1 card) and wear out under repeated writes. Move to a USB-attached SSD. A WD Blue SN550 1TB NVMe in a USB 3.0 NVMe enclosure costs about $180 and delivers ~350 MB/s sustained over the Pi's USB 3.0 bus — fast enough to load a 4 GB model in ~12 seconds versus 100+ seconds off a microSD. If a SATA SSD is what you have on the shelf, the Crucial BX500 1TB SATA SSD in a USB 3.0 SATA enclosure works equally well; the Pi's USB 3.0 bus is the bottleneck regardless of interface.
Cooling. A fan-and-heatsink case (often sold as the "Argon NEO" or "FLIRC" form factor) keeps the SoC under 70°C during sustained inference. Without it, throttling cuts your tok/s by a third within two minutes. Spend $15–30; it's the single highest-impact accessory.
Power. A genuine 5V/3A USB-C power supply is non-negotiable. Cheap chargers brown out under sustained CPU load and cause undefined behavior — the famous "low voltage" rainbow flash and silent corruption. Use the official Raspberry Pi 15W USB-C supply or an equivalent rated for 3A continuous.
Bottom line
The Raspberry Pi 4 Model B 8GB is the right hardware for an always-on, low-power, tiny-model LLM box: TinyLlama and Phi-3 Mini at q4 are usable, the Pi sips power, and the experiment costs $200 all-in. It is the wrong hardware for serious LLM work — for that, a desktop with an RTX 3060 12GB is 20× faster on the same model and unlocks every model the Pi cannot fit. Add a fast USB SSD like the WD Blue SN550 or Crucial BX500 and an active cooler regardless of which side of that fork you land on, because microSD storage and thermal throttling will sabotage either box.
Related guides
- Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? tok/s for TinyLlama, Phi, and Qwen
- Raspberry Pi 4 8GB vs Pi 5 vs Pi Zero W for a 2026 Homelab
- Per-Model GPU Guide 2026: Which Card for Llama, Mistral & Kimi
- Self-Host Jellyfin on a Raspberry Pi 4 8GB: Transcoding Limits and Storage in 2026
- Build a RetroPie Handheld with the Raspberry Pi Zero W in 2026: Full BOM + Setup
Sources
- Raspberry Pi 4 Model B — official product page — SoC, memory, and power specs
- llama.cpp — ggml-org GitHub repository — ARM NEON kernel and quantization implementations used in the benchmarks above
- Microsoft on Hugging Face — Phi model family — official Phi-3 Mini model card and quantization references
