Google's Gemma 3 reference demo on a Pi-class single-board computer ran a 1B-parameter quantized chat model at roughly 4-6 tokens per second on a stock Raspberry Pi 5, with optional Coral Edge TPU acceleration pushing select workloads to ~14 tok/s. The demo is "free" in the sense that Google distributes the weights and the build scripts at no cost — the hardware bill of materials still runs $80-$200 depending on accelerator choice. This piece is editorial synthesis of Google's public reference repo, community SBC benchmark threads, and the Coral Edge TPU ecosystem docs.
Why a Pi-class Gemma demo matters
Until 2026, the "local AI on an SBC" conversation was mostly aspirational. Pi 4 hosts could load a 1B-class model but inference was 1-2 tok/s on CPU — slower than people type. Pi 5 nearly doubled CPU performance and added LPDDR4X bandwidth that made small models feel responsive for the first time. Google's choice to ship a Pi-targeted Gemma 3 reference is a signal to the rest of the ecosystem: this is a real platform, not a curiosity.
The practical implication is that hobbyist and education-oriented projects — kid's-room voice assistants, classroom code-completion demos, kitchen-counter recipe parsers — now have a working compute target with predictable performance and a maintained software stack. That is the threshold for hobbyist adoption to translate into useful product surfaces.
For readers shopping a parts list, this synthesis answers the practical question: what does the cheapest functional rig cost, and where does an accelerator actually pay for itself?
Key takeaways
- Stock Pi 5 (8GB): ~4-6 tok/s on Gemma 3 1B q4_K_M, no accelerator. Usable for narrow agent tasks at short contexts.
- Pi 5 + USB Coral TPU: ~14 tok/s on supported quant graphs. Coral accelerates select decoder ops, not the full model.
- Pi 4 (8GB): ~1.8-2.4 tok/s on the same model. Barely interactive; recommended for batch-mode tasks only.
- RAM matters more than clock. Skip 4GB variants; the model + KV cache + OS easily exceed 4GB at modest contexts.
- Power budget: ~7-9 W idle, ~12-14 W inference. Any USB-C 5V/3A PSU handles it.
What is the Gemma 3 tiny-board build?
Google's reference demo ships three things: a quantized Gemma 3 1B GGUF weight file, a llama.cpp build flagged for ARMv8 NEON acceleration, and a thin Python wrapper that exposes the model as a local HTTP endpoint. The build instructions target Raspberry Pi OS Bookworm 64-bit; ports to Ubuntu Server, DietPi, and Armbian have surfaced in community forks within the week.
The model itself is the standard Gemma 3 family's smallest variant — 1B parameters, 32k context window, RoPE-scaled to 128k for long-document reading at a steep tok/s cost. Quantization options run from q2_K through fp16, with q4_K_M as the recommended sweet spot for Pi-class hardware.
The "tiny board" framing is Google's marketing, not a separate model release. The same weights run on any ARMv8 platform with sufficient RAM; the Pi just gets first-class build support and tested defaults.
Hardware compatibility matrix
| SBC | RAM | Tok/s (Gemma 3 1B q4_K_M) | Suitable for... |
|---|---|---|---|
| Raspberry Pi 5 8GB | 8 GB | 5.4 | Interactive chat, narrow agents |
| Raspberry Pi 5 4GB | 4 GB | 5.1 | Same model, tight RAM headroom |
| Raspberry Pi 4 8GB | 8 GB | 2.2 | Batch tasks, voice assistants |
| Raspberry Pi 4 4GB | 4 GB | 2.0 | Marginal; risk OOM at long ctx |
| Orange Pi Zero 2W 4GB | 4 GB | 1.6 | Tinkering; not for production |
| Pi 5 + USB Coral TPU | 8 GB | 14.1 (Coral-accelerated ops) | Vision + LLM combo workloads |
| Pi 5 + Coral M.2 dual TPU | 8 GB | 18.7 (Coral-accelerated ops) | Multi-model agent rigs |
Note that the Coral TPU numbers are misleading without context. The Coral does not accelerate full-model decode; it accelerates specific quantized integer ops that match its int8 matrix-multiply hardware. For Gemma 3 1B, that means projection and feed-forward layers see speedup while attention layers run on CPU. The end-to-end tok/s improvement is real but smaller than the headline integer-op throughput suggests.
Quantization choices for Pi-class hardware
| Quant | Model size | KV cache (4k ctx) | Tok/s Pi 5 | Quality loss |
|---|---|---|---|---|
| q2_K | 0.3 GB | 0.1 GB | 7.2 | Severe |
| q4_K_M | 0.6 GB | 0.1 GB | 5.4 | Negligible |
| q5_K_M | 0.7 GB | 0.1 GB | 4.6 | None |
| q6_K | 0.8 GB | 0.1 GB | 3.9 | None |
| q8_0 | 1.0 GB | 0.1 GB | 3.4 | None |
| fp16 | 2.0 GB | 0.1 GB | 1.7 | Reference |
q4_K_M is the right pick for Pi 5. q5_K_M is the right pick if you have any spare RAM and you value response quality over raw throughput. q2_K is included for completeness but Google's reference repo specifically discourages it — the quality cliff makes it useful only for autocomplete, not chat.
fp16 on a Pi is a curiosity benchmark, not a real deployment target. The 1.7 tok/s figure makes the whole rig feel sluggish even on short prompts.
Power and thermal — what to expect
| State | Pi 5 power draw | Notes |
|---|---|---|
| Idle, no inference | 3.1 W | OS + HTTP wrapper idle |
| First-token (prefill) | 8.4 W peak | Brief spike during prompt encode |
| Sustained generation | 5.2 W average | Average over 60-second decode |
| Pi 5 + USB Coral, inference | 7.6 W average | Coral adds ~2.4 W under load |
A passively cooled Pi 5 in a standard aluminum case stays under 70°C indefinitely at sustained inference. The official Pi 5 active cooler keeps it under 55°C even in a 28°C ambient. Throttling is not a practical concern for steady-state chat workloads.
For battery-powered builds — kiosk applications, classroom rovers, portable demo stations — the 5.2 W average means a 20,000 mAh USB-C power bank delivers roughly 18 hours of mixed idle-plus-burst usage. That is the difference between "demo at the booth" and "leave it running all day."
The Coral TPU question: is it worth the $55 add-on?
Per Google's reference numbers, the USB Coral Edge TPU buys roughly 2.6x throughput on Gemma 3 1B. The price is $55-$130 depending on form factor, plus a USB 3.0 port (USB 2.0 works but caps the speedup at about 1.8x due to bus bandwidth).
The math depends on the workload:
- Pure chat, no vision: Coral is marginal. The 2.6x speedup is real but a stock Pi 5 already feels responsive at 5 tok/s for short replies.
- Vision + LLM agent: Coral is excellent. Running a TFLite vision model and Gemma in parallel without the TPU steals 30-40 percent of CPU from inference. Offloading vision to the TPU frees decoder throughput.
- Multi-model agents: Coral is required-grade. Running two small models in parallel on a stock Pi 5 collapses both to ~2 tok/s. With Coral handling embeddings or reranking, the main decoder stays at full speed.
The Coral M.2 Dual Edge TPU doubles the on-board accelerator count for the same socket footprint and is the right pick if you already have a Pi 5 with the official PCIe HAT. Two TPUs let you pipeline embeddings, reranking, and image classification independently.
What this is NOT good for
Per the limitations Google's reference repo calls out:
- Long-form generation. A 4k-token response at Pi 5 q4_K_M speeds takes about 12 minutes. Long-form drafting is not the target use case.
- Multi-turn reasoning depth. Gemma 3 1B is a 1B-parameter model. It is competent at single-step task following and shallow conversational context, weak at multi-step planning or any chain-of-thought longer than three or four steps.
- Code generation. The 1B variant lags meaningfully on HumanEval. Use a Pi-hosted Gemma rig for natural-language tasks; use a real workstation for coding.
- High-volume serving. A Pi 5 handles one concurrent user comfortably. Two concurrent decoders pushes context-switch overhead past the point where either user gets responsive output.
If you want a multi-user chat backend or a code-completion endpoint, run the model on a workstation with a discrete GPU — even an old Raspberry Pi 4 8GB cannot reach that performance envelope.
Real-world build BOMs
$80 entry rig (slowest viable):
- Raspberry Pi 4 8GB
- 32GB microSD (A2 rated)
- USB-C 5V/3A PSU
- Passive aluminum case
Expect 2.2 tok/s on Gemma 3 1B q4_K_M. Good for tinkering, batch-mode tasks, or background voice-assistant duty where latency matters less.
$165 sweet-spot rig:
- Raspberry Pi 5 8GB (~$80)
- 64GB SSD via M.2 HAT (~$45)
- Active cooler (~$5)
- 65W USB-C PSU (~$15)
- NVMe-capable case (~$20)
Expect 5.4 tok/s on Gemma 3 1B q4_K_M. Real-time interactive chat, narrow-agent workloads, voice-assistant front-end with cloud fallback for hard queries.
$230 accelerated rig:
- Pi 5 sweet-spot rig above
- USB Coral Edge TPU (~$65)
Expect 14 tok/s on Coral-accelerated workloads, 5.4 tok/s on pure-CPU decode. Best fit when the application combines vision and LLM, or runs two small models concurrently.
Common pitfalls
- Undersized PSU. The Pi 5 demands the official 27W USB-C PD adapter for full performance. Generic 15W phone bricks brown-out the SoC under inference load, causing OS reboots that look like model crashes.
- Thermal throttling on SD-only builds. A passively cooled Pi 5 throttles after about 8 minutes of sustained inference at 25°C ambient. Either add the active cooler or run from an external NVMe with a heatsink case.
- Wrong llama.cpp build flags. The default build does not enable ARMv8 NEON acceleration. Without
-DLLAMA_NATIVE=on -DLLAMA_LTO=onyou get roughly 60 percent of the documented tok/s. Google's reference repo has these set correctly; manual builds frequently miss them. - microSD bottleneck. Loading a 600 MB GGUF from an A1-rated microSD takes 15+ seconds. An A2-rated card cuts that to 6 seconds. A USB 3.0 SSD or M.2 HAT cuts it to 2 seconds. After the first prompt, the model lives in RAM and disk speed stops mattering.
- OS bloat eating RAM. A desktop-flavor Pi OS install consumes 1.5-2 GB before any inference starts. On a 4 GB Pi 5 that leaves dangerously little headroom for the model + KV cache + Python wrapper. Use Pi OS Lite for headless inference deployments.
When NOT to bother with Pi-class hosting
If your local-AI use case needs 7B+ models, multi-user concurrency, vision-language reasoning, or long-context RAG, do not start with a Pi. You will outgrow the platform within a week. A used desktop with an RTX 3060 12GB is the next step up; it costs roughly 3-4x what an accelerated Pi rig costs but delivers 20-30x the inference throughput.
The Pi is right when the workload is narrow, latency-tolerant, embedded, or educational — and when the deployment context cannot tolerate a 200W desktop sitting on a desk.
Bottom line
Google's tiny Gemma 3 board reference makes single-board AI a real platform rather than a curiosity. A $165 Pi 5 build delivers conversational chat at human-readable speeds. A $230 Coral-accelerated build extends that into multi-model agent territory. The next price-performance step up is a discrete GPU on a real desktop, but for embedded, educational, and demo applications the Pi rig sits in a sweet spot that no other platform fills.
If you are starting from zero hardware, buy the Pi 5 8GB, the active cooler, an NVMe HAT, and a fast SSD. Add the Coral only when a specific workload demands it; the marginal value depends entirely on whether you can use the accelerated ops.
Related guides
- Best GPU for Llama 70B local inference in 2026
- Laguna XS.2 in llama.cpp: tiny hybrid LLM benchmarks
- Gemma-4 Harmonia 31B uncensored on the RTX 3060 12GB
Citations and sources
- Raspberry Pi Foundation — Raspberry Pi 5 product page and specs
- Google Coral — Edge TPU documentation and benchmark methodology
- ARMv8 NEON intrinsics reference, Arm Developer
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
