Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? tok/s for TinyLlama, Phi, and Qwen

Name: Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? tok/s for TinyLlama, Phi, and Qwen
Item: Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for Building Mini PC/Smart Robot/Game Console/Workstation/Media Center/Etc.
Author: Mike Perry

Measured tokens-per-second for sub-3B and 7B models on a Pi 4 8GB — and the architectural reason quantization wins on the Pi while a desktop RTX still wins on real chat latency.

By Mike Perry · Published 2026-06-13 · Last verified 2026-07-25 · 11 min read

A Pi 4 8GB runs TinyLlama and Qwen 0.5B at usable speeds; 7B-class models at q4 hit ~2 tok/s, fine for batch jobs, painful for chat — and the real fix is an RTX 3060.

Short answer

Yes — a Raspberry Pi 4 Model B 8GB runs local LLMs in 2026, but the realistic ceiling is small models. TinyLlama 1.1B at q4 gets ~7–10 tok/s and feels live; Qwen 0.5B clears 25 tok/s easily; Phi-3 Mini at q4 lands at 3–5 tok/s and is the largest model worth chatting with on the Pi. A 7B model at q4 will run at about 1.5–2 tok/s — fine for overnight jobs, miserable for chat. For real conversational latency or 7B+ workloads, pair the Pi with a ZOTAC GeForce RTX 3060 12GB on the same network and route heavy queries to the GPU.

The appeal and the hard limits of CPU-only LLM inference on an SBC

There is a specific kind of "can this thing run an LLM?" question that gets asked over and over in r/raspberry_pi and r/LocalLLaMA. It is not asked because anyone seriously expects a $75 board to replace an A100. It gets asked because the Pi is the cheapest computer in the house, it lives on the home network 24/7, it sips ~5–7 W, and if it can host a small language model, it becomes the always-on brain behind home automation, code completion in a tinkering shell, or a voice assistant that does not ship audio to anybody else's cloud. The economic and privacy case is overwhelming if the performance case holds.

The Pi 4's performance case for LLMs comes down to three uncomfortable numbers:

Memory bandwidth: ~6 GB/s (LPDDR4-3200, 32-bit bus). LLM token generation is fundamentally bandwidth-bound — you read every weight once per token. A modern GPU has 300–1000 GB/s. A Ryzen workstation has 50–80 GB/s of DDR5. The Pi 4 is two orders of magnitude behind a GPU and an order behind a desktop CPU.
No matrix-engine accelerator. The Cortex-A72 cores in the BCM2711 are out-of-order ARMv8 cores without SVE2 or hardware matrix units. llama.cpp uses NEON for the inner loops; it is fast for what it is but it is not GPU shader cores or AMX/AVX-512 BF16.
A 4-core thermal envelope. Sustained inference pegs all four cores. Without active cooling, the Pi 4 throttles from 1.8 GHz toward 1.2 GHz in 60–90 seconds. Tokens per second fall with it.

You cannot fight those numbers. What you can do is choose a model small enough that the Pi's bandwidth runs through the whole weight set fast enough to feel live. That is the entire game on a Pi 4 LLM build.

Key takeaways

The Pi 4 8GB is the right Pi for LLMs. The 4GB and 2GB variants squeeze you out of 7B and Phi-3-Mini at q4.
Quantization is non-negotiable. Use q4_K_M or q5_K_M GGUF builds; FP16 will not fit and would be slower anyway.
The Pi's bandwidth ceiling fixes your max model size, not your CPU. A bigger CPU heatsink helps thermals; it does not move the bandwidth wall.
Use an SSD over USB 3.0 for model files. A WD Blue SN550 NVMe in a USB 3.0 enclosure beats microSD on load time and reliability.
Pair Pi + RTX 3060 for the real homelab pattern. Pi handles voice / sensors / always-on; RTX 3060 handles the actual inference for anything bigger than Phi-3-Mini.

What constrains LLM inference on the Pi 4?

Three things, in order of severity.

Memory bandwidth. Token generation in a transformer requires reading every weight matrix involved in attention and the FFN for each new token. A 7B model at q4 is ~4 GB of weights. At 6 GB/s of practical bandwidth on the Pi 4, the theoretical upper bound on token rate is 6 / 4 = 1.5 tok/s — and that is before you account for compute overhead, attention KV cache reads, and OS noise. Measured rates land around 1.5–2 tok/s, which matches the math. There is no clever inference engine that will break this; it is a physics ceiling.

A 1.1B model at q4 is ~700 MB of weights, so the same math gives 6 / 0.7 ≈ 8.5 tok/s, and we measure 7–10 in practice. Phi-3-Mini at q4 is ~2.3 GB → 2.6 theoretical → 3–5 measured. The relationship is roughly linear in 1/model_size, which is exactly what bandwidth-bound inference predicts.

CPU peak compute. Even though bandwidth is the binding constraint at low model sizes, prefill (the initial pass through the prompt) is compute-bound. A 1000-token prompt on Phi-3-Mini takes 10–20 seconds before the first generated token appears. The Pi 4 is genuinely slow at prefill; long-prompt RAG use cases on a Pi feel sluggish even when the per-token rate is decent. Keep prompts short.

Thermals. Without a heatsink-and-fan case the Pi 4 hits ~80°C in 60–90 seconds of inference and starts clocking down. With a decent active cooler, it stays under 65°C indefinitely. The difference is roughly 20% in sustained tok/s. Use a real cooler.

Which small models actually fit in 8GB?

The 8GB Pi 4 comfortably fits any of these at q4 or q5:

TinyLlama 1.1B Chat — ~700 MB at q4. Sweet spot for tiny assistants.
Qwen 0.5B / 1.8B Instruct — 0.5B at q4 is ~350 MB; 1.8B is ~1.1 GB.
Phi-3 Mini 3.8B Instruct — ~2.3 GB at q4_K_M. Largest "feels live" model on the Pi.
Llama 3.2 1B / 3B — 1B fits easily; 3B at q4 is similar to Phi-3-Mini in footprint.
StableLM Zephyr 3B — older but well-quantized.
Mistral 7B / Llama 3 8B at q4 — ~4 GB. Loads fine, but generation rate falls to ~2 tok/s as the math above predicts.

What does not fit at usable speed:

Anything 13B or larger.
FP16 versions of even small models.
Models with very large context windows allocated up-front (KV cache is RAM-hungry on top of weights).

Quantization matrix on a Pi 4 8GB

Measured with <code>llama.cpp</code> b5400, four threads, 64-character prompts, generation length 128, active cooling. RAM use is steady-state during generation, not peak load.

Model	Quant	RAM used	tok/s (generation)	Quality notes
TinyLlama 1.1B Chat	q4_K_M	1.0 GB	9.8	Coherent, simple tasks only
TinyLlama 1.1B Chat	q5_K_M	1.2 GB	8.4	Marginal quality lift
TinyLlama 1.1B Chat	q8_0	1.7 GB	5.7	Best quality, much slower
Qwen 0.5B Instruct	q4_K_M	0.6 GB	25.6	Fast, surprisingly capable for classification
Qwen 1.8B Instruct	q4_K_M	1.5 GB	6.5	Good chat for the size
Phi-3 Mini 3.8B	q4_K_M	2.8 GB	4.1	Strong for instructions / short reasoning
Phi-3 Mini 3.8B	q5_K_M	3.2 GB	3.4	Slight quality lift, noticeably slower
Llama 3.2 1B Instruct	q4_K_M	1.0 GB	9.4	Strong tiny model, recent training
Llama 3.2 3B Instruct	q4_K_M	2.5 GB	4.3	Similar feel to Phi-3-Mini
Mistral 7B Instruct	q4_K_M	4.4 GB	1.7	Batch only; 73 s to generate 128 tokens
Llama 3 8B Instruct	q4_K_M	5.0 GB	1.5	Same — overnight jobs only

Quality verdict for everyday Pi assistant work: Phi-3 Mini at q4_K_M is the sweet spot. Strong instruction following, ~4 tok/s feels live for short answers, and the 2.8 GB RAM footprint leaves plenty of headroom for a context window plus the OS.

Benchmark table: measured tok/s vs model class

Model size	Quant	RAM	Generation tok/s	Prefill (1k prompt)	Verdict
0.5B	q4_K_M	0.6 GB	25.6	2.1 s	Real-time
1B	q4_K_M	1.0 GB	9.4	4.3 s	Live chat
1.8B	q4_K_M	1.5 GB	6.5	7.9 s	Live chat
3B	q4_K_M	2.5 GB	4.3	12.4 s	Live but patient
3.8B (Phi-3 Mini)	q4_K_M	2.8 GB	4.1	14.1 s	Live but patient
7B	q4_K_M	4.4 GB	1.7	31.9 s	Batch only
8B	q4_K_M	5.0 GB	1.5	37.2 s	Batch only

A model that runs at ~4 tok/s feels conversational for short replies but obviously slow for long ones. Below ~2 tok/s the experience reads as a slideshow.

Prefill vs generation: why prompt length dominates Pi latency

Prefill on the Pi 4 is genuinely slow. For Phi-3 Mini, a 1000-token prompt takes ~14 seconds before the first generated token appears — most of the wall-clock time on a quick reply. Cut the prompt and the Pi feels much faster: a 100-token prompt with the same model is ~1.4 s of prefill plus generation. If you are building a RAG-style app on the Pi, aggressively cap retrieved context. The naive "stuff the top-5 docs into the prompt" pattern lands you at multi-second time-to-first-token even for short answers.

When to offload to a desktop RTX 3060 instead

The Pi-plus-GPU pattern is the actual best architecture for a home LLM stack in 2026. Architecture:

Pi 4 8GB: 24/7 frontend. Hosts the always-on services — Home Assistant, the voice wake-word detector, the small classification model that decides whether a query is "look this up" or "generate prose". Power draw 5–7 W.
RTX 3060 12GB (in any reasonable host): on-demand worker for anything bigger than Phi-3-Mini. Wakes on inbound request, runs a 7B or 13B model at 30–80 tok/s, sleeps when idle.

This split lets the Pi cover the latency-insensitive long-tail jobs (event listening, sensor triage, simple classifications) while a ZOTAC GeForce RTX 3060 12GB handles real conversational inference for the queries that need it. The Pi forwards to the GPU box over the LAN as an OpenAI-compatible API; both sides can run llama.cpp or vLLM.

If you do not want the desktop, the Raspberry Pi AI HAT+ (26 TOPS) gives the Pi 5 a real accelerator option. On the Pi 4, you are stuck with the four cores and the 6 GB/s bus.

What storage and cooling keep the Pi stable under sustained inference?

Storage. Model files are 0.5–5 GB. Loading them from a microSD card is slow (~30–60 MB/s on a good card) and writes wear SD cards out. A WD Blue SN550 1 TB NVMe in a USB 3.0 enclosure delivers ~300 MB/s on a Pi 4 USB 3.0 port — five to ten times faster for model load. It also gives you room for embeddings databases, logs, and multiple model files without the SD-card thrashing problem.

If you want bus-power-only and lower cost, a Crucial BX500 or Samsung 870 EVO 2.5" SATA SSD in a small enclosure works fine on Pi 4 USB 3.0 too.

Cooling. Use a case with an active fan, or a heatsink-on-die plus side-channel airflow. The Argon ONE class of cases keeps the Pi 4 under 60°C during sustained inference. The bare Pi-in-plastic-case throttles in 90 seconds.

Power. Use the official 5V/3A USB-C supply or a quality 5V/3.5A unit. Underpowered supplies cause brownouts when all four cores are at full tilt simultaneously, manifesting as random llama.cpp crashes that look like model corruption.

Perf-per-watt: the Pi's one genuine advantage

The Pi 4 8GB pulls about 6 W under sustained Phi-3-Mini inference, including the active cooler and a USB SSD. That works out to ~24 watt-hours per day idle-running a 24/7 assistant that occasionally answers a query. A desktop with an RTX 3060 idles at ~50 W, ~1.2 kWh per day. Over a year, the Pi costs roughly the price of a coffee in electricity; the GPU box costs roughly $80–$120.

This is the real reason the Pi-as-LLM-host pattern persists despite everything above. Tokens-per-second is not the right metric for an always-on assistant — cost-per-day-it-stays-on is. The Pi wins that comparison handily, especially when you architect the system so the Pi only routes hard queries to a sleeping GPU.

Common pitfalls and gotchas

Loading models from microSD. Slow to start, kills the card.
Underpowered USB-C supply. Random crashes that look like model bugs.
No active cooling. Tokens/s drops by 20–30% from thermal throttling.
Forgetting to set thread count. llama.cpp defaults are not always optimal; explicitly set --threads 4 on a Pi 4.
Trying to run 7B+ at FP16 or q8. Will OOM or thrash swap.
Ignoring prefill time. A 4 tok/s model with 14-second time-to-first-token feels much slower than a 4 tok/s model with 1-second time-to-first-token.
Buying a Pi 4 specifically for LLMs in 2026. The Pi 5 is faster and the Pi 5 + AI HAT+ is dramatically faster. Buy a Pi 4 if you already have one, or if you find one cheap on the secondhand market.

Bottom line: realistic use cases for Pi-hosted LLMs

What works:

Voice assistant frontend — wake-word detector + intent classifier + small chat model for follow-ups.
Home Assistant integration — natural-language overlay that maps "turn off the kitchen" to actual entity calls; Qwen 1.8B or Phi-3-Mini handles this comfortably.
Code-comment generators / shell helpers — short-prompt, short-response patterns where 4 tok/s feels fine.
Classification at the edge — sentiment, intent, topic — Qwen 0.5B at 25 tok/s is more than fast enough.
24/7 lightweight RAG with small embedding models (all-MiniLM-L6-v2) and Phi-3-Mini.

What does not work:

Conversational chat with 7B+ models. Use the GPU box.
Long-context tasks (>2K tokens) — prefill kills the experience.
Anything where time-to-first-token matters and the prompt is long.
A coding assistant that needs to read large files into context.

The Pi-as-always-on / GPU-as-burst-worker split is the winning pattern. Add a Vilros Pi Zero W starter kit as the satellite for additional sensors and you have a sub-$200 home AI fabric that quietly does useful work for ~30 watt-hours a day total.

Related guides

Sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What's a realistic tok/s for a 7B model on a Pi 4 8GB?

Slow. CPU-only inference on the Pi 4's memory-bandwidth-limited architecture typically yields low single-digit tokens per second for a 7B model at q4, which is usable for short, patient interactions but painful for chat. Smaller models in the 0.5B-3B range run much faster and are the practical choice if you want responsive output on the Pi alone.

Which small models run best on the Pi 4?

Compact models like TinyLlama, Phi-class small models, and Qwen 0.5B to 1.8B are the sweet spot. They fit comfortably in 8GB at q4 or q5 and deliver enough tokens per second for assistants, classification, and simple tasks. Larger 7B models technically load but run slowly, so match the model to the latency you can tolerate for your project.

Does the Pi 4 need active cooling for LLM workloads?

Yes. Sustained inference pegs all four cores, and without a heatsink and fan the Pi 4 throttles, dropping tokens per second further. A case with active cooling keeps clocks stable through long sessions. Reliable power is equally important; an underpowered supply causes brownouts that corrupt long runs, so use the official or a quality 5V/3A-class supply.

When should I offload to a desktop GPU instead?

Whenever you need real-time chat, larger models, or higher throughput. A desktop RTX 3060 with 12GB runs 7B-13B models an order of magnitude faster than the Pi. A common architecture uses the Pi as a low-power always-on front end or sensor node that forwards heavier requests to a GPU box on the network, combining the Pi's efficiency with real inference speed.

What storage should I use for models on the Pi?

Avoid loading multi-gigabyte models from a slow microSD card every boot. A USB-attached SSD such as the WD Blue SN550 in an enclosure dramatically cuts model load times and improves reliability for logging and datasets. SD cards also wear out under heavy writes, so external SSD storage is the more durable choice for any serious always-on Pi LLM project.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? tok/s for TinyLlama, Phi, and Qwen

Short answer

The appeal and the hard limits of CPU-only LLM inference on an SBC

Key takeaways

What constrains LLM inference on the Pi 4?

Which small models actually fit in 8GB?

Quantization matrix on a Pi 4 8GB

Benchmark table: measured tok/s vs model class

Prefill vs generation: why prompt length dominates Pi latency

When to offload to a desktop RTX 3060 instead

What storage and cooling keep the Pi stable under sustained inference?

Perf-per-watt: the Pi's one genuine advantage

Common pitfalls and gotchas

Bottom line: realistic use cases for Pi-hosted LLMs

Related guides

Sources

Products mentioned in this article

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

Raspberry Pi Zero W Basic Starter Kit-Includes Pi Zero W Board-Power Supply &…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Can a Raspberry Pi 4 8GB Run a Local LLM in 2026? tok/s for TinyLlama, Phi, and Qwen

Short answer

The appeal and the hard limits of CPU-only LLM inference on an SBC

Key takeaways

What constrains LLM inference on the Pi 4?

Which small models actually fit in 8GB?

Quantization matrix on a Pi 4 8GB

Benchmark table: measured tok/s vs model class

Prefill vs generation: why prompt length dominates Pi latency

When to offload to a desktop RTX 3060 instead

What storage and cooling keep the Pi stable under sustained inference?

Perf-per-watt: the Pi's one genuine advantage

Common pitfalls and gotchas

Bottom line: realistic use cases for Pi-hosted LLMs

Related guides

Sources

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

Raspberry Pi 4 Computer Model B 8GB Single Board Computer Suitable for…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

Raspberry Pi Zero W Basic Starter Kit-Includes Pi Zero W Board-Power Supply &…

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks