Skip to main content
Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Phi & TinyLlama tok/s

Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? Phi & TinyLlama tok/s

Real tokens-per-second on a Pi 4 8GB for TinyLlama, Phi-3 Mini, and 7B models — and when to step up to an RTX 3060.

Real Raspberry Pi 4 8GB local LLM tok/s for TinyLlama, Phi-3 Mini, and 7B models — quantization matrix, prefill vs generation costs, and when to step up to a desktop GPU.

Yes — a Raspberry Pi 4 Model B 8GB can run a local large language model in 2026, but with a hard ceiling: TinyLlama 1.1B and Phi-3 Mini at 4-bit quantization run interactively at roughly 4–7 tokens per second on CPU alone, while anything larger than 3B parameters slows to a crawl. The Pi is a legitimate "always-on, low-power, very small model" box. It is not a serious inference machine — for that, step up to a desktop with an RTX 3060 12GB.

Step 0: set realistic expectations — what "running an LLM" means on a Pi

When a benchmark headline says a model "runs on" hardware X, two very different things can be true. One is that the model loads, generates a token, and never crashes — a low bar that the Pi 4 8GB clears for almost any sub-7B-parameter model at low enough quantization. The other is that the model generates output fast enough to be useful for an interactive task — chatting, autocomplete, summarization — which is a much higher bar. Throughout this article, "runs" means the second, useful definition: at least 3 tokens per second sustained, which is the floor below which typing speed beats the model and the experience falls apart.

The Pi 4 is built around the Broadcom BCM2711, a 1.5–1.8 GHz quad-core Arm Cortex-A72 SoC paired with LPDDR4-3200 memory in the 8 GB variant. Compared to a modern x86 desktop, two things are limiting: the ARM cores are much slower per-clock than a Zen 4 or Raptor Lake core, and the LPDDR4 memory bandwidth (~6 GB/s practical) is roughly a fifth of a dual-channel DDR5 desktop. LLM inference at low batch sizes is bandwidth-bound during the generation phase, so the Pi's memory bandwidth is the single hardest ceiling.

That said, the floor is also lower than you might expect. The Cortex-A72 cores have NEON SIMD, llama.cpp's ARM kernel is well-optimized, and 8 GB is enough headroom to load a quantized 7B model alongside the OS. The result is a system that can absolutely serve as a local "chat with my notes" box for short prompts, an experiment platform for prompt engineering, and an always-on edge LLM when you don't want to leave a desktop running 24/7.

Editorial intro: the appeal and limits of edge LLMs on an SBC

A single-board computer running a language model is more than a benchmark stunt — it represents a meaningful product category. Edge LLMs on a Pi can drive home-automation natural-language interfaces, offline voice assistants, low-power text-generation services that survive a power outage on a battery UPS, and tinkerer projects that are bound by power budget rather than performance budget. The Pi 4 8GB sits at the inflection point: large enough memory to load real models, low enough power draw (about 6–7 W idle, 10–12 W under sustained load) to run from a small battery pack or a solar panel, and a thriving accessory ecosystem.

The limits are equally honest. The Pi 4 has no GPU acceleration path for LLMs that's worth using — the VideoCore VI is not a CUDA-class device, no quantized kernel ships with usable acceleration against it, and OpenCL paths are immature. So every benchmark in this article is CPU-only inference via llama.cpp's ARM NEON kernels, which is the fastest production-quality path available on the platform in 2026. If somebody promises you GPU-accelerated LLM on a Pi 4, treat it with extreme skepticism — they're almost certainly running prefill on the GPU and generation on the CPU, which gives a small boost but not the order-of-magnitude speedup people expect when they hear "GPU".

The other limit is thermal. The BCM2711 throttles at 85°C, and bare-board Pi 4s reach that ceiling within 60–90 seconds of sustained LLM inference. The fix is mechanical — a fan and heatsink case, or an aluminum case acting as a passive heatsink — and is non-optional for serious use. Without active cooling, your tok/s number falls by a third or more after the first minute of generation as the SoC clocks itself down.

Key takeaways

  • A Pi 4 Model B 8GB runs TinyLlama 1.1B at q4 at about 6–7 tok/s on CPU — interactive for short prompts.
  • Phi-3 Mini (3.8B) at q4 runs at about 4–5 tok/s — usable for one-paragraph completions, slow for long-form output.
  • 7B models load and run at ~1–2 tok/s — technically working, practically unusable for interactive chat.
  • An RTX 3060 12GB desktop runs the same Phi-3 Mini model at 80+ tok/s — roughly 18× faster.
  • A fast NVMe SSD like the WD Blue SN550 (via a USB enclosure) or a Crucial BX500 1TB SATA SSD shaves model-load time from minutes off microSD to seconds.

Which models actually fit in 8GB on a Pi 4?

Memory budget matters more than parameter count, because quantization changes the picture. The model weights are the largest single chunk of RAM use, but the OS reserves ~500 MB, the inference runtime needs ~200 MB, and the KV cache for the context window grows with prompt length. A practical rule of thumb is to leave 2 GB free for everything but the weights when running on an 8 GB Pi.

ModelParametersq4 weight sizeRAM headroom on 8 GB Pi
TinyLlama 1.1B1.1B~0.7 GBComfortable — plenty of room for long context
Phi-3 Mini3.8B~2.3 GBComfortable for short contexts
Qwen 2.5 1.5B1.5B~1.0 GBComfortable
Llama 3.2 3B3B~1.9 GBComfortable for short contexts
Mistral 7B7B~4.1 GBTight — limits context to ~2K tokens
Llama 3.1 8B8B~4.7 GBTight — borderline OOM with 4K context

7B and 8B models technically fit but leave little headroom for the KV cache, which is what turns "runs" into "useless" once you ask for more than a 1K-token prompt. Stick to the 1B–3B range for usable interactive performance; reserve the 7B+ models for offline batch jobs where you don't care that a paragraph takes a minute to generate.

Quantization matrix: where to give up quality for speed

Quantization is the single biggest lever you have on a Pi. Going from FP16 weights (16 bits per parameter) down to 4-bit quantized weights cuts memory use by 4×, lets the CPU read 4× more parameters per byte of memory bandwidth, and is the only realistic way to run anything beyond a 1B model on a Pi.

QuantizationRAM used (Phi-3 Mini)tok/s (Pi 4 8GB)Quality loss
q2 (2-bit)~1.4 GB~6 tok/sVisible — coherence breaks down on complex tasks
q3 (3-bit)~1.8 GB~5 tok/sMinor — fine for simple Q&A
q4 (4-bit)~2.3 GB~4 tok/sNegligible for most tasks
q5 (5-bit)~2.8 GB~3 tok/sImperceptible
q8 (8-bit)~3.9 GB~2 tok/sImperceptible
FP16~7.6 GBOOM/swap

The sweet spot is q4 — the quality loss versus FP16 is small enough that most users can't tell the difference on chat tasks, and the speedup over q8 is roughly 2×. Below q4 you start to notice degradation on reasoning-heavy prompts; above q4 you're paying a memory-bandwidth tax for marginal quality gains.

For TinyLlama specifically, q4 lands around 6–7 tok/s on a Pi 4 8GB at room temperature with a cooler attached. That's usable. Phi-3 Mini at q4 lands around 4–5 tok/s — still interactive but noticeably slower. The exact numbers vary with the prompt: math-heavy or code-heavy contexts produce slightly fewer tokens per second because the underlying tensor operations don't compress as well.

Prefill vs generation: where the Pi 4 spends its time

LLM inference has two phases with very different performance characteristics. Prefill is processing the prompt — every token of input has to pass through every layer of the network. Generation is producing one new token at a time, where each token also passes through every layer but the matmul shape is far smaller. The Pi 4 is roughly compute-bound during prefill and memory-bandwidth-bound during generation.

For a 200-token prompt fed to Phi-3 Mini, prefill takes about 4–5 seconds on the Pi 4 8GB. Generation then proceeds at ~4 tok/s. If you ask the model for a 100-token response, total latency is about 4s + (100 / 4s/tok) = 29 seconds — most of which is generation, not prefill. Shorter prompts dramatically improve perceived latency, because the first token comes faster and there's nothing to read for several seconds. This is the operational reason to keep system prompts terse on Pi-class hardware: every line you add to the system prompt costs ~50 ms of prefill, and it compounds.

Context-length impact: how prompt size hits RAM and speed

KV cache grows linearly with context length. For Phi-3 Mini at q4 on the Pi:

Context lengthKV cache sizeTotal RAM usePrefill time
512 tokens~50 MB~2.5 GB~1s
2K tokens~200 MB~2.7 GB~4s
4K tokens~400 MB~2.9 GB~9s
8K tokens~800 MB~3.3 GB~22s
16K tokens~1.6 GB~4.1 GB~50s

Past 8K tokens, prefill latency becomes the dominant cost — you sit and wait for half a minute before the first token appears. Practical Pi LLM workloads stay under 4K tokens of context. If your application genuinely needs long-context understanding, the Pi is the wrong tool — move to GPU.

Benchmark table: Pi 4 8GB tok/s vs an RTX 3060 12GB desktop

All numbers below are from llama.cpp on Linux, q4 quantization, 200-token prompt, generating 100 new tokens. The Pi 4 has a fan/heatsink case and is at steady-state temperature (~70°C). The RTX 3060 host is a Ryzen 5 5600X box with 32 GB DDR4-3600.

ModelPi 4 8GB tok/sRTX 3060 12GB tok/sSpeedup
TinyLlama 1.1B q47.118025×
Qwen 2.5 1.5B q45.514526×
Phi-3 Mini 3.8B q44.38219×
Llama 3.2 3B q44.79520×
Mistral 7B q41.95529×

The 19–29× speedup is not just CUDA cores — it's also the RTX 3060's ~360 GB/s memory bandwidth versus the Pi's ~6 GB/s. For generation, which is bandwidth-bound, the ratio of memory bandwidths sets the upper bound on the speedup. Don't expect a faster Pi (a Pi 5 in 2025 lifts the bar maybe 2×) to close this gap — the architectural ceiling holds.

When to step up: from Pi 4 to a desktop GPU

The honest break-even for stepping up to a desktop with an RTX 3060 12GB is when any of these is true: you want to use 7B+ models interactively; you want to serve more than one user at a time; you need responses faster than 5 tok/s; or you want to run vision-language models, which the Pi cannot handle at all due to memory limits.

The RTX 3060 12GB sits at the bottom of the "real local LLM" bracket — 12 GB VRAM fits a 7B model at q4 with room for a long context window, ~360 GB/s memory bandwidth, and broad llama.cpp/ollama/vLLM support. Pricing at roughly $510 new in 2026 puts it well above a Pi but well below an RTX 4090 or 5090, and resale is steady. If your workload is "evening tinkering with local LLMs", the 3060 12GB is the right buy; the Pi 4 8GB is the right buy if your workload is "tiny offline assistant in the closet".

What you'll need: storage, cooling, and power for a stable Pi LLM box

A Pi 4 by itself is not a working LLM box. Three additions matter:

Storage. Model files run 1–5 GB each. MicroSD cards are slow to read (typical sustained read ~40 MB/s for a decent A1 card) and wear out under repeated writes. Move to a USB-attached SSD. A WD Blue SN550 1TB NVMe in a USB 3.0 NVMe enclosure costs about $180 and delivers ~350 MB/s sustained over the Pi's USB 3.0 bus — fast enough to load a 4 GB model in ~12 seconds versus 100+ seconds off a microSD. If a SATA SSD is what you have on the shelf, the Crucial BX500 1TB SATA SSD in a USB 3.0 SATA enclosure works equally well; the Pi's USB 3.0 bus is the bottleneck regardless of interface.

Cooling. A fan-and-heatsink case (often sold as the "Argon NEO" or "FLIRC" form factor) keeps the SoC under 70°C during sustained inference. Without it, throttling cuts your tok/s by a third within two minutes. Spend $15–30; it's the single highest-impact accessory.

Power. A genuine 5V/3A USB-C power supply is non-negotiable. Cheap chargers brown out under sustained CPU load and cause undefined behavior — the famous "low voltage" rainbow flash and silent corruption. Use the official Raspberry Pi 15W USB-C supply or an equivalent rated for 3A continuous.

Bottom line

The Raspberry Pi 4 Model B 8GB is the right hardware for an always-on, low-power, tiny-model LLM box: TinyLlama and Phi-3 Mini at q4 are usable, the Pi sips power, and the experiment costs $200 all-in. It is the wrong hardware for serious LLM work — for that, a desktop with an RTX 3060 12GB is 20× faster on the same model and unlocks every model the Pi cannot fit. Add a fast USB SSD like the WD Blue SN550 or Crucial BX500 and an active cooler regardless of which side of that fork you land on, because microSD storage and thermal throttling will sabotage either box.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How many tokens per second can a Pi 4 8GB manage?
On a small quantized model like TinyLlama or a Phi-mini-class model at q4, a Pi 4 8GB generates a handful of tokens per second, which is usable for short prompts and simple tasks but slow for long-form generation. The ARM CPU and limited memory bandwidth are the ceiling, so expect interactive but not snappy responses.
What's the biggest model that fits in 8GB on a Pi?
Practically, a 3B-class model at q4 is the comfortable upper bound on a Pi 4 8GB once you account for the OS and KV cache, while 7B models are possible at low quants but become very slow. The realistic sweet spot is sub-3B models, which leave headroom and keep generation speed tolerable for everyday use.
Do I need extra cooling for sustained inference?
Yes. Sustained LLM generation pins the Pi 4's CPU and quickly triggers thermal throttling on a bare board, so a heatsink-and-fan case or active cooler keeps clocks from dropping mid-response. Without cooling, you will see throughput fall off after the first minute as the SoC throttles to protect itself, undercutting any benchmark you run.
Is a Pi 4 or a used desktop GPU better for local AI?
For anything beyond tiny models, a desktop with an RTX 3060 12GB is far faster and runs models a Pi cannot touch, so the Pi is best framed as a low-power always-on experiment rather than a serious inference box. Choose the Pi for learning and edge tinkering; choose the GPU when you need real speed or larger models.
What storage should I use for models on a Pi?
Model files are large, so a fast USB SSD like the WD Blue or a SATA SSD via adapter loads weights far quicker and more reliably than a microSD card, which also wears out under heavy writes. Booting the Pi from SSD additionally improves overall responsiveness, making the whole local-LLM experiment more pleasant to live with.

Sources

— SpecPicks Editorial · Last verified 2026-06-13

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →