Yes, but expectations matter. A Raspberry Pi 4 Computer Model B 8GB running Ollama handles 1B-3B parameter models at q4 quantization in the low single digits to mid-single digits of tokens per second, CPU-only. The Pi has no usable GPU acceleration for LLMs, so do not compare its throughput to a desktop GPU; compare it to other small CPU-only Linux boxes.
"Running a local LLM on a Pi" sits at the intersection of three audiences: makers who want a self-hosted assistant in a low-power case, students learning the LLM stack on hardware they already own, and tinkerers who want to brag they got a 3B model running on $80 of silicon. None of those audiences expects ChatGPT-grade throughput. They want something that works, reliably, on a quiet board they can leave on 24/7. The Pi 4 8GB delivers that in a narrow but real sense.
Per the Raspberry Pi 4 Model B product page, the board ships a quad-core Cortex-A72 at 1.5 GHz, 8GB of LPDDR4, and gigabit Ethernet. The Pi's VideoCore VI GPU exists but is not useful for general-purpose LLM inference — the Ollama project on the Pi falls back to ARM Neon CPU code paths. That CPU is what bounds your throughput.
For comparison context: a desktop with a ZOTAC RTX 3060 Twin Edge 12GB runs the same models at ~35-40 tok/s — easily 10x what the Pi delivers. Per the Phoronix coverage of ARM Linux benchmarks, the Pi's per-core CPU is well-characterized, and small-LLM throughput on it tracks the published ARM Neon performance numbers closely.
Key takeaways
- The Pi 4 8GB runs 1B-3B models at q4 in the 2-5 tok/s range for generation, CPU-only.
- 7B models technically load but throughput drops to fractions of a tok/s — not usable interactively.
- The Pi has no usable GPU acceleration for LLMs; treat it as a small ARM CPU box.
- Storage choice (SanDisk 1TB 3D NAND SSD over USB 3) affects load times, not inference speed.
- Active cooling is mandatory for sustained inference — passive heatsinks hit thermal throttle within minutes.
- A used ZOTAC RTX 3060 box is 10x faster for ~$600 more — a real upgrade path.
Which models fit in 8GB RAM?
The Pi shares its 8GB between the OS, your applications, and the model. Practical RAM headroom for the model is roughly 6GB on a stripped Raspberry Pi OS Lite install.
| Model | Params | Quant | RAM footprint | Fits Pi 4 8GB? | Usable? |
|---|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | q4_K_M | ~0.7 GB | yes | yes, snappy |
| Phi-3 Mini 3.8B | 3.8B | q4_K_M | ~2.3 GB | yes | yes, slow |
| Llama 3.2 3B | 3.0B | q4_K_M | ~2.0 GB | yes | yes, slow |
| Qwen 2.5 3B | 3.0B | q4_K_M | ~2.0 GB | yes | yes, slow |
| Llama 3.1 8B | 8B | q4_K_M | ~4.8 GB | yes, tight | barely usable |
| Mistral 7B | 7B | q4_K_M | ~4.2 GB | yes | very slow |
| Llama 3.1 8B | 8B | q5_K_M | ~5.6 GB | yes, tight | not usable interactively |
| Llama 3.1 8B | 8B | q8_0 | ~8.5 GB | no | n/a |
The 1-3B band is the practical zone. 7-8B is where the Pi technically fits the model but throughput collapses to under one token per second.
Benchmark table: tok/s on the Pi 4 8GB
Public community measurements consistently cluster in the ranges below. Treat as orientation — your specific cooling, distribution, and Ollama version matter.
| Model | Quant | Approx. prefill tok/s | Approx. generation tok/s |
|---|---|---|---|
| TinyLlama 1.1B | q4_K_M | ~70 | ~7-9 |
| Llama 3.2 3B | q4_K_M | ~28 | ~4-5 |
| Phi-3 Mini 3.8B | q4_K_M | ~22 | ~3-4 |
| Llama 3.1 8B | q4_K_M | ~9 | ~1-2 |
| Mistral 7B | q4_K_M | ~10 | ~1.5-2 |
"Snappy" on a Pi means generation in the 5+ tok/s range. Anything under 2 tok/s is technically functional but feels broken in interactive use.
Quantization matrix: 3B model on the Pi 4
The 3B class is the sweet spot. The matrix below uses Llama 3.2 3B as the representative model.
| Quant | RAM footprint | Approx. gen tok/s | Quality vs fp16 |
|---|---|---|---|
| q2_K | ~1.2 GB | ~6 | noticeable quality loss |
| q3_K_M | ~1.5 GB | ~5 | small but visible drop |
| q4_K_M | ~2.0 GB | ~4-5 | the standard, near-lossless |
| q5_K_M | ~2.4 GB | ~3.5 | best quality per RAM |
| q6_K | ~2.8 GB | ~3 | marginal gain over q5 |
| q8_0 | ~3.5 GB | ~2 | reference quality, throughput pain |
q4_K_M is again the sensible default. The drop from q4 to q3 is more visible on small models than on large ones because there are fewer parameters to absorb the precision loss.
Why prefill is slow on the Pi
Prefill — the model's first pass over your prompt — is compute-bound. On a GPU it is fast because the GPU has thousands of FP16 ALUs. On a Pi's quad-core ARM CPU it is slow because there are only four cores with Neon SIMD. The result: long prompts hurt the Pi more than they hurt a GPU box.
A 200-token prompt on the Pi 4 with Llama 3.2 3B at q4 takes roughly 8-10 seconds before the first generated token. A 2,000-token prompt takes roughly 80-100 seconds before generation starts. That non-linear cost is the reason the Pi is not a good fit for RAG-heavy workflows or long-context agents.
The supporting build
Three components matter beyond the Pi itself.
- Storage. An NVMe SSD is overkill; a SATA SSD over USB 3 is the sweet spot. A 1TB SanDisk Ultra 3D NAND SSD gives you headroom for a handful of model files and faster cold loads than an SD card. SD card boot is fine but model load takes ~5x longer.
- Cooling. A passive aluminum heatsink case will throttle the CPU within 5 minutes of sustained inference. An active fan case (the official Pi 4 case fan or an Argon ONE) holds clock speed under load and is mandatory if you intend to run inference workloads continuously.
- Power. Use the official 15.3W USB-C PSU. Under-volt warnings during inference cause silent slowdowns that look like buggy software.
Perf-per-dollar vs alternatives
| Box | Approx. cost | Approx. gen tok/s (3B q4) | Power draw |
|---|---|---|---|
| Pi 4 8GB | ~$80 board + $40 supporting | ~4-5 | ~7W |
| Pi 5 8GB | ~$80 board + $40 supporting | ~10-12 | ~12W |
| Used Intel N100 mini-PC | ~$120-180 | ~12-15 | ~10W |
| Used Ryzen 5 5600 desktop | ~$300 | ~25-35 | ~65W |
| Used RTX 3060 box | ~$650 | ~38-42 | ~200W |
The Pi 4 wins on power draw and silent operation. It loses badly on raw throughput per dollar — a used mini-PC delivers triple the tok/s for less than double the price. If your priority is bragging rights, the Pi is the right answer. If your priority is daily-driver LLM use, a mini-PC or used desktop is the smarter spend.
Memory: 4GB vs 8GB Pi
The 8GB Pi 4 is the right tier for LLM work. The 4GB and 2GB variants can run TinyLlama-class models but cannot load a 3B model at q4 without aggressive swap, and Pi swap on SD or even a USB SSD is slow enough that swap-bound inference collapses to under one token per second. The cost difference between the 4GB and 8GB Pi 4 is small — about $20 — and worth it for any LLM project.
The Pi 5 (16GB) released since extends this further: 16GB of LPDDR4X plus a meaningfully faster Cortex-A76 takes the 3B class from "barely interactive" to "comfortably interactive" and pulls 7B models into the usable zone. For new builds dedicated to LLM workloads, the Pi 5 8GB or 16GB is now the smarter board. The Pi 4 8GB remains relevant for existing hardware and for the cheapest possible "I want to brag a Pi runs an LLM" sticker.
Common pitfalls
- Booting from SD. SD cards are slow and wear-fragile. Boot from a USB 3 SSD for any serious use.
- Skipping active cooling. Without a fan, the Pi throttles in minutes. You think the model is slow; the CPU is actually capped.
- Loading 7B models. They technically fit. They do not run usefully. Stay in the 1-3B band.
What a Pi 4 8GB LLM rig is actually good for
The Pi 4 LLM rig works for a narrow set of real applications:
- Home automation assistants. Latency-tolerant. Pair Ollama with Home Assistant for voice control where 3-second response time is fine.
- Background classification or summarization. Cron-driven jobs that process logs, emails, or RSS into structured output.
- Always-on local search. A small embedder + vector store + 3B chat model on a Pi makes a respectable personal-knowledge query box.
- Education and demos. Teaching the LLM stack on hardware students already own.
- Air-gapped, low-power deployments. Where a 200W desktop is not an option.
What it is not good for: real-time chat, code assistance, agent loops, document Q&A with long context, or anything where 5+ tok/s feels too slow.
When NOT to use a Pi 4 for LLMs
If your use case is interactive (chat, code help, anything you watch tokens stream), a Pi 4 will frustrate you. A used x86 mini-PC delivers triple the throughput for double the cost. A used RTX 3060 box delivers 10x the throughput for an order of magnitude more money. The Pi is the right tool for background, low-power, latency-tolerant workloads.
Worked example: home-automation assistant
A representative production deployment: Pi 4 8GB running Llama 3.2 3B at q4_K_M behind Home Assistant. Voice command captured, transcribed by a small Whisper variant on the same Pi, fed to Llama for intent extraction, action executed. End-to-end latency from voice end to action: 4-6 seconds. That is acceptable for "turn off the kitchen lights" — comparable to a hardware Alexa or Google Home interaction — and the whole stack runs offline with no cloud dependency.
Power and uptime economics
Always-on operation is one of the Pi's strongest cases. At ~7W typical and ~12W under load, a Pi LLM rig running 24/7 costs about $7-12 per year in electricity at typical U.S. residential rates. That is meaningfully cheaper than running an x86 mini-PC at $20-30 per year, and dramatically cheaper than running a desktop with a 3060 at $120-180 per year if you leave the GPU box on continuously. For 24/7 ambient deployments — a kitchen voice assistant, an always-listening home automation hub, an offline RSS classifier — the Pi's running cost is the deciding factor more often than its raw throughput.
Bottom line
A Raspberry Pi 4 8GB can run small local LLMs with Ollama, slowly. The 1-3B parameter band is the practical zone, throughput is in the low single digits of tok/s, and the win is privacy and 24/7 operation rather than speed. For interactive chat, look elsewhere. For an always-on, low-power home assistant or background classifier, the Pi 4 plus Ollama is a clean, cheap deployment that just works.
Related guides
- Run a local LLM on a Raspberry Pi 4 8GB: what works in 2026 — the broader Pi LLM survey
- Self-hosted Jellyfin on a Raspberry Pi 4 8GB — what else the Pi 4 8GB handles
- Ollama on a 12GB RTX 3060: best models and tok/s in 2026 — the upgrade target
- Air-gapped local LLM rig for privacy — privacy-first build at a higher tier
Citations and sources
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
