Yes — a Raspberry Pi 4 with 8GB of RAM can run a local LLM in 2026, but only the small ones. Per public community benchmarks on r/LocalLLaMA, a Pi 4 8GB runs 1B-3B parameter models at q4 quantization at single-digit to low-double-digit tokens per second on CPU-only inference. That is enough for batch tasks, home-automation triggers, and offline tinkering — but it is not a real-time chat assistant.
What self-hosters realistically expect from a $75 SBC
The Raspberry Pi 4 Model B 8GB is the cheapest credible local-LLM target on the market. At roughly $75-$90 in 2026, it is closer to a Roku stick than a workstation. People reach for it not because it is fast but because it is small, low-power (5-7W idle, 8-15W under load), silent, and runs Linux. For voice assistants, log summarizers, or always-on classifiers, that profile is exactly right.
The Pi 4 8GB pairs a quad-core ARM Cortex-A72 at 1.8 GHz with LPDDR4 memory. Per Raspberry Pi's official Pi 4 product page, the memory subsystem tops out at roughly 6-7 GB/s aggregate bandwidth — about 50x slower than an RTX 3060's 360 GB/s VRAM. That bandwidth gap is exactly what limits the Pi as an inference platform. Generation tok/s on transformer decoders is overwhelmingly memory-bandwidth-bound, and the Pi has very little to spend.
So self-hosters get a useful but bounded answer. The Pi 4 runs small quantized models adequately, opens up a learning path into the local-LLM stack at trivial cost, and lights up real projects in the home-automation and edge-classification space. It does not replace a discrete GPU for general-purpose chat. The wall is hard, predictable, and bandwidth-shaped — there is no software trick that removes it.
Key takeaways
- The practical sweet spot is 1B-3B models at q4. TinyLlama 1.1B, Phi-3 Mini 3.8B, Llama 3.2 1B/3B fit comfortably and produce useful work at low speed.
- Generation speed is single-digit to low-double-digit tokens per second. Expect 3-12 tok/s depending on model size, quantization, and clocking.
- Use a USB 3.0 SSD, not the SD card. SD random I/O kills any operation that hits swap or reloads a model.
- q4_K_M is the universal default. q2/q3 saves memory at significant quality cost; q5/q6 fits but slows everything down.
- The Pi 4 is the right tool for always-on, low-throughput, edge-classification, or learning workloads. It is the wrong tool for interactive chat or anything mid-size.
Which models actually run on a Raspberry Pi 4's 8GB of RAM?
The 8GB Pi 4 has roughly 7GB of usable RAM after the kernel and base system overhead. That leaves a working budget for the model, the runtime, the KV cache, and your application code. Realistic candidates:
- TinyLlama 1.1B at q4_K_M — ~700MB on disk, 1-1.5GB at runtime. Generates roughly 8-12 tok/s on a Pi 4. Useful for small classifications, summarization of short text, and toy chat.
- Llama 3.2 1B at q4_K_M — similar footprint to TinyLlama, noticeably better quality.
- Phi-3 Mini 3.8B at q4_K_M — ~2.3GB on disk, ~3-4GB runtime. Generates roughly 3-6 tok/s. Quality jumps meaningfully — this is the smallest "real assistant" tier.
- Llama 3.2 3B at q4_K_M — ~2GB on disk, ~3GB runtime. Comparable speed to Phi-3 Mini; slightly different strengths.
- Mistral 7B at q4_K_M — ~4GB on disk, ~5-6GB runtime. Technically fits with care, but generation drops to 1-2 tok/s, which most users will find painfully slow.
Anything larger than 7B realistically requires aggressive swap and produces tok/s in the fractional range — closer to "the Pi will eventually finish" than to interactive use.
Quantization matrix: RAM required + tok/s + quality loss for 1B-3B models
Per the llama.cpp project's quantization documentation on GitHub, the memory-vs-quality tradeoff at small model sizes:
| Quant | Bits/param | TinyLlama 1.1B size | Quality loss vs fp16 | Pi 4 tok/s (1.1B est.) |
|---|---|---|---|---|
| fp16 | 16 | ~2.2 GB | None (reference) | ~3-5 |
| q8_0 | 8 | ~1.1 GB | Minimal | ~6-9 |
| q6_K | 6 | ~880 MB | Very low | ~7-10 |
| q5_K_M | 5 | ~760 MB | Low | ~7-11 |
| q4_K_M | ~4.6 | ~700 MB | Modest, often imperceptible | ~8-12 |
| q3_K_M | ~3.5 | ~550 MB | Noticeable; small models suffer more | ~9-13 |
| q2_K | ~2.6 | ~420 MB | Significant; coherence drops | ~10-14 |
For small models like TinyLlama 1.1B, quality loss at q3/q2 hits harder than on larger models — there is less redundancy in the weights to absorb quantization noise. Stay at q4 or higher for production work on the Pi.
How slow is it really? Tok/s expectations on CPU-only inference
Per public llama.cpp benchmark threads on r/LocalLLaMA, real-world Pi 4 8GB measurements cluster around the following bands:
| Model | Quant | Pi 4 8GB (tok/s) | Use case |
|---|---|---|---|
| TinyLlama 1.1B | q4_K_M | 8-12 | Toy chat, learning |
| Llama 3.2 1B | q4_K_M | 7-11 | Small classifier, summarizer |
| Phi-3 Mini 3.8B | q4_K_M | 3-6 | Smallest "real assistant" |
| Llama 3.2 3B | q4_K_M | 3-6 | General small chat |
| Mistral 7B | q4_K_M | 1-2 | Borderline; slow chat |
Clocking matters. A stock Pi 4 at 1.5 GHz produces lower numbers; the Pi 4B at 1.8 GHz with proper cooling is the baseline above. Aggressive overclocks to 2.0 GHz with active cooling lift numbers another 10-20% per community reports, at the cost of throttling risk in hot rooms.
To put speeds in human terms: 10 tok/s produces roughly 8-10 words per second of output — readable in real time but visibly slower than ChatGPT-grade response speed. 3 tok/s is closer to one word every two seconds, which feels like watching a slow typist.
Setting up Ollama or llama.cpp on Raspberry Pi OS
The two practical runtimes for the Pi 4 are llama.cpp (the upstream C/C++ project) and Ollama (a wrapper that adds a model manager and HTTP API on top). For Pi 4 work, Ollama is the easier on-ramp because it handles model downloads and provides a stable API for your app.
A clean setup looks like:
- Install 64-bit Raspberry Pi OS — ARMv8 is required for the optimized kernels both runtimes ship.
- Move root or model storage to a USB 3.0 SSD via the boot config — the SD card is the bottleneck once you start swapping or reloading.
- Install Ollama with the one-line script from its README; it lays down a systemd service.
ollama pull tinyllamaorollama pull phi3:minito fetch a small model.- Configure swap on the SSD, not the SD card — 4-8GB of swap is plenty for buffering.
For llama.cpp directly, build with make LLAMA_BLAS=1 LLAMA_BLAS_VENDOR=OpenBLAS to use OpenBLAS for matmul; it produces a measurable speedup on the Pi's ARM cores over a vanilla build.
Storage and swap: why a fast SSD over USB beats the SD card
The SD card is the single largest performance trap on a Pi 4 LLM build. SD random I/O is roughly 5-20 MB/s in the real world; a SATA SSD over USB 3.0 sustains 200-400 MB/s with much better random performance. For a workload that loads multi-gigabyte model files and occasionally swaps, that gap turns a 10-second load into a 5-minute crawl.
Practical recommendation: keep the OS on the SD card or on the SSD (either works), but keep model files and swap on the SSD. A 1TB drive like the SanDisk Ultra 3D NAND or the Crucial BX500 1TB plus a USB 3.0 SATA enclosure runs under $80 total and removes the storage bottleneck entirely. Higher-end NVMe-via-USB enclosures help marginally but are not required at the Pi 4's throughput ceiling.
When to stop fighting the Pi and step up to an RTX 3060 12GB box
There is a point at which an enthusiast keeps trying to coax better numbers out of a Pi 4 and that effort would be better spent on different hardware. The trip-wire signals:
- You want to run a 7B-13B class model interactively and find 1-2 tok/s unacceptable.
- You want a multi-turn chat assistant where latency matters.
- You want to run multiple models concurrently for a RAG pipeline.
- You want a 32K+ context window for long-document work.
For any of those, a sub-$700 desktop with an RTX 3060 12GB lifts you from 5 tok/s to 35-55 tok/s on an 8B q4 model — a 10x improvement that turns "barely usable" into "feels like ChatGPT." A used Ryzen 7 5800X plus 32GB DDR4 plus the 3060 is the well-trodden upgrade path. The Pi 4 remains useful in that build's shadow for edge tasks the workstation should not handle.
Perf-per-dollar and perf-per-watt: Pi 4 vs an entry GPU build
| Platform | Approx cost | TinyLlama 1.1B q4 (tok/s) | Power under load | tok/s per $ | tok/s per W |
|---|---|---|---|---|---|
| Pi 4 8GB | $75 | 10 | 8 W | 0.13 | 1.25 |
| 5800X + RTX 3060 12GB build | $650 | 200+ (caps at runtime overhead) | 280 W | 0.31 | 0.71 |
The Pi 4 wins decisively on tok/s per watt — it is roughly twice as power-efficient as the discrete GPU build for small models. The GPU build wins on tok/s per dollar by a smaller margin and dominates on absolute speed. Per-watt advantage matters for always-on edge deployments; absolute speed matters for interactive use.
Bottom line: good projects vs jobs that need real hardware
The Raspberry Pi 4 8GB is the right hardware for:
- Always-on home-automation triggers that run a small classifier
- Offline voice assistants with simple NLU
- Batch processing of incoming notes or logs
- Privacy-sensitive edge classification (camera frames, text logs)
- Learning the local-LLM stack at trivial cost
It is the wrong hardware for:
- Interactive chat at ChatGPT-grade speed
- Code completion in a real editor
- Long-context document understanding
- Anything larger than a 3B-class model
If your project lives in the first list, the Pi 4 is genuinely the right tool. If it lives in the second list, do the math and step up to a 12GB GPU box — you will save more time in week one than the cost gap represents.
Common pitfalls when trying to run an LLM on a Pi 4
Several repeating gotchas show up in community threads worth heading off:
- Underestimating thermal throttling. The Pi 4's Cortex-A72 cores throttle at roughly 80°C. Sustained inference workloads hold the chip near that ceiling without active cooling. A simple passive heatsink helps; a small fan-on-heatsink combo is closer to what you actually want. Aluminum case heatsinks designed for the Pi 4 also work and are quieter.
- Running off the SD card and being baffled by 30-second loads. As covered above, this is the largest single cause of "the Pi feels broken" reports. Move the model files and swap to a USB 3.0 SSD before you start blaming the runtime.
- Picking q2 or q3 for tiny models and getting incoherent output. Quantization noise hits harder on smaller models because there is less redundancy to absorb. Stay at q4_K_M or higher on the 1B-3B class.
- Expecting parity with cloud models. A Pi 4 8GB running TinyLlama 1.1B is not a Gemini Pro replacement — it is a learning platform and an edge device. Set expectations accordingly and the Pi delivers; expect a desktop-class assistant and you will be disappointed.
- Forgetting that the Pi runs from a 5V/3A USB-C supply. Inadequate power supplies cause undervolting that silently slows the CPU. Use the official Raspberry Pi power supply or a high-quality equivalent — your inference speed will thank you.
When NOT to use a Pi 4 for LLM work
If your workload involves long-context document understanding, multi-turn interactive chat at human-typing speed, real-time code completion, or running multiple models concurrently, the Pi 4 is the wrong tool — even before considering memory limits, the bandwidth ceiling makes those workloads frustrating. Step up to a 12GB-class discrete GPU build at that point; the upgrade pays for itself in time saved within a week of real use.
Related guides
- What Hardware Runs a Gemini-Class Model Locally in 2026
- Ryzen AI Max 400 Gorgon Halo vs RTX 3060 for Local LLMs
- A LEGO Castle Case for the Raspberry Pi 5 Is Going Viral
- Best Budget GPU for CNN and Image-Model Training in 2026
Citations and sources
- Raspberry Pi 4 Model B — official product page
- llama.cpp on GitHub — quantization and ARM build notes
- Ollama on GitHub — model runner and ARM support
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
