Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid
Yes — a Raspberry Pi 5 + Ryzen 7 5800X hybrid local llm workstation works in 2026, and it's one of the cheapest paths to running models like Qwen 3.6 35B A3B at home. The Pi 5 handles routing, voice frontend, and Tailscale gateway duties; the Ryzen 7 5800X with an RTX 3060 12GB does the actual inference at 18-30 tok/s on q4 quants per public LocalLLaMA measurements.
By the SpecPicks editorial team — last verified May 2026.
Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid
The conversation around running LLMs locally in 2026 is dominated by two extremes: the $5,000 RTX 5090 build for people running 70B parameter models at full speed, and the $80 Pi 5 build for people running 1B-parameter toys. The cheap local llm rig that actually answers the homelab tinkerer's question — "Can I run useful models without a GPU mortgage?" — sits in the middle.
This guide details a hybrid setup that pairs a Raspberry Pi 5 with a Ryzen 7 5800X desktop and an RTX 3060 12GB for under $1,200 total, including peripherals. The Pi 5 is not the inference engine — it is the thin client, voice gateway, and 24/7 routing layer. The 5800X plus RTX 3060 handles the heavy work, but only when called.
We grounded the numbers in published LocalLLaMA community measurements for Qwen 3.6 35B A3B (the trending model this week per r/LocalLLaMA's hot threads), llama.cpp's documented benchmark suite, and TechPowerUp's RTX 3060 12GB review for the GPU side.
Editorial intro — homelab tinkerer audience
This article assumes you already run a Pi 4 or Pi 5 for Pi-hole, Home Assistant, or a NAS, and you want to add LLM capabilities without buying a $2,000+ GPU. It also assumes you can SSH into a Linux box, edit configs, and tolerate setup friction — the reward is a private, always-on AI assistant that costs $0/month after the hardware spend.
The hybrid approach matters because pure-Pi local LLM is mostly a novelty. A Pi 5 8GB runs llama.cpp with 1-3B parameter models at 4-8 tok/s — enough for Home Assistant intent matching, not enough for code review or document summarization. Pure-desktop local LLM works well but the desktop draws 350-500 W idle if you leave it on 24/7 for a voice assistant, which is a $200+/year electric bill.
Hybrid solves both. The Pi 5 sips 5-8 W, runs the Tailscale-exposed voice frontend, Whisper STT, and a routing layer. When a query needs the big model, the Pi 5 triggers wake-on-LAN to the 5800X, runs inference, and returns the result. The desktop spends 95% of its life in standby.
Key Takeaways
- The Pi 5 is the orchestrator, not the inference engine — voice frontend, Whisper STT, intent routing, Tailscale gateway.
- The Ryzen 7 5800X + RTX 3060 12GB delivers 18-30 tok/s on Qwen 3.6 35B A3B at q4 quants.
- Total stack cost: ~$1,150 (Pi 5 8GB $80 + 5800X $230 + RTX 3060 12GB $280 + board/RAM/PSU/case ~$560).
- Power: Pi 5 always-on at ~6 W average; desktop wakes on demand, idle 350 W only when active.
- Quantization at q4_K_M is the sweet spot for 35B models on 12 GB VRAM.
- Wake-on-LAN + Tailscale = private always-available LLM endpoint.
What roles does each device play in a hybrid local LLM setup?
The Pi 5 8GB runs three persistent services: 1. Whisper STT — small.en model, processes 16 kHz audio in roughly 0.4× realtime on the Pi 5's Cortex-A76 cores. 2. Tailscale node — gives you a private endpoint reachable from your phone over the Tailscale mesh. 3. Routing layer — a small Python service that classifies the incoming query (small/local-only vs needs-the-big-model), wakes the desktop via WoL when needed, and forwards.
The Ryzen 7 5800X + RTX 3060 12GB handles: 1. Ollama on the desktop — exposes a local OpenAI-compatible endpoint on port 11434. 2. The actual LLM inference — Qwen 3.6 35B A3B q4_K_M, Llama 3.1 8B for fast responses, Phi-3 Mini for tool-use. 3. Vector store — for RAG queries against your obsidian vault, codebase, or local docs.
This split is why the ryzen 5800x local llm pairing wins. The 5800X's eight Zen 3 cores feed the RTX 3060 efficiently — CPU is rarely the bottleneck on 30B-class models when GPU offload lands the entire model in VRAM.
Spec delta table: Pi 5 8GB vs Ryzen 7 5800X + RTX 3060
| Component | Spec | Idle Power | Inference Power | Cost |
|---|---|---|---|---|
| Raspberry Pi 5 8GB | 4× A76 @ 2.4 GHz, 8 GB LPDDR4X | 4 W | 6-8 W (Whisper small) | $80 |
| AMD Ryzen 7 5800X | 8C/16T, Zen 3, 4.7 GHz boost | — | 70-95 W (CPU-only inference) | $230 |
| ZOTAC RTX 3060 12GB | 3,584 CUDA, 12 GB GDDR6 | 14 W | 130-170 W (GPU offload inference) | $280 |
| MSI RTX 3060 Ventus 12GB | 3,584 CUDA, 12 GB GDDR6 | 14 W | 130-170 W (GPU offload inference) | $280 |
The 12 GB VRAM is what makes the RTX 3060 a serious local-LLM card despite its consumer-tier rasterization performance. 12 GB fits a 30B-parameter model at q4_K_M quantization with comfortable context-window headroom — something neither the 8 GB RTX 3070 nor the 8 GB RTX 4060 can do.
Benchmark table: tok/s on Qwen 3.6 35B A3B across configs
| Config | Quant | Context | Tokens/sec | Source |
|---|---|---|---|---|
| Pi 5 8GB (CPU only) | q4_K_M | 4K | 0.8-1.4 | LocalLLaMA threads, May 2026 |
| Ryzen 5800X (CPU only, no GPU) | q4_K_M | 4K | 4-7 | LocalLLaMA Zen 3 thread |
| Ryzen 5800X + RTX 3060 12GB (full offload) | q4_K_M | 4K | 18-22 | r/LocalLLaMA RTX 3060 Qwen post |
| Ryzen 5800X + RTX 3060 12GB (full offload) | q5_K_M | 4K | 14-17 | r/LocalLLaMA |
| Ryzen 5800X + RTX 4090 (full offload) | q5_K_M | 8K | 55-70 | r/LocalLLaMA |
The 18-22 tok/s on the RTX 3060 12GB at q4_K_M is the headline number. That's faster than typing speed and roughly 2× the throughput of M2 Pro Mac Mini reports for the same model — the RTX 3060 12GB hits a sweet spot for pi 5 ollama-routed hybrid setups that smaller VRAM cards cannot.
Quantization matrix: q4/q5/q6/q8 on Pi 5 vs 5800X vs 3060 12GB
| Quant | Model size | Pi 5 fits? | 5800X CPU fits? | RTX 3060 12GB fits? |
|---|---|---|---|---|
| q4_K_M | ~21 GB | No | Yes (system RAM) | Yes (full GPU offload) |
| q5_K_M | ~25 GB | No | Yes | Yes (full GPU offload) |
| q6_K | ~29 GB | No | Yes | Partial offload only |
| q8_0 | ~37 GB | No | Yes | No (split CPU+GPU) |
For a 12 GB VRAM card with a 35B model, q4_K_M and q5_K_M are the practical choices. q4_K_M loses about 2-3% on perplexity vs q8_0 — imperceptible for chat, slightly more visible for code generation. q5_K_M is the Goldilocks pick for users who want quality first.
Why offload routing + voice frontend to the Pi while inference runs on the desktop
The 24/7 path. The Pi 5 idles at 4 W. Run it 365 days a year and it costs about $4 in electricity. The Ryzen 5800X desktop with the RTX 3060 spinning idles at 80-110 W — leave it on year-round and you pay $90-130 in power for an LLM you used 30 minutes a day.
WoL solves this. The Pi 5 listens for a wake intent (a Whisper-transcribed voice command, a Discord message, a Home Assistant trigger), sends a magic packet to the desktop's MAC, and the desktop boots from S5/S3 in 5-15 seconds. Inference runs, the response comes back, the desktop sleeps again after a 10-minute idle window.
This is also where the Pi's role expands beyond Whisper. The Pi can host a small intent-classification model (Phi-3 Mini, ~2.5 GB) that decides:
- "Set a timer for 5 minutes" → handle locally on the Pi (no desktop wake).
- "Summarize this PDF" → wake desktop, route to Qwen 3.6.
- "Run this code" → wake desktop, route to a code model.
The Pi 5's neural-coprocessor-free architecture limits its model ceiling to ~3B parameters at q4 (4-8 tok/s usable), but the latency on these decisions is under 200 ms, so the user experience is "instant" for short queries and "5-second delay" for big-model queries.
Step-by-step: Ollama on 5800X, Pi 5 as Tailscale gateway + Whisper STT
- Install Ollama on the Ryzen 5800X desktop.
curl https://ollama.com/install.sh | sh. Pull Qwen 3.6 35B A3B:ollama pull qwen3:35b-a3b-q4_K_M. Verify GPU offload by runningnvidia-smiwhile inference is active — VRAM should hit ~10.5 GB. - Install Tailscale on both devices. Single sign-on with the same account. The desktop and Pi 5 will see each other on a 100.x.x.x mesh IP.
- Configure WoL on the desktop. Enable WoL in BIOS, set Linux kernel to allow
ethtool -s eth0 wol g. Test from the Pi:wakeonlan <desktop_mac>. - Set up Whisper on the Pi.
pip install openai-whisper. Use the small.en or base.en model — small.en is the practical sweet spot for accuracy/speed. - Write a tiny FastAPI router on the Pi. Endpoints:
/transcribe(audio in, text out via Whisper),/ask(text in, route to local Phi-3 or wake-and-forward to desktop Ollama). - Expose the FastAPI router via Tailscale. Set
tailscale serveon port 8080. Now your phone, laptop, or any other Tailscale device can hithttp://pi5.tailnet:8080/askfrom anywhere. - Tune the wake/sleep policy. Set the desktop to sleep after 10 min idle. Add a Pi cron job that pings the desktop nightly to verify WoL is still functional.
The result is a private, always-available AI assistant accessible from anywhere on your Tailscale mesh, costing about $5/month in marginal electricity vs the GPU-PC-on-24/7 alternative.
Power draw + perf-per-watt math
| Device | Idle (W) | Inference active (W) | Cost/year (24/7, $0.13/kWh) |
|---|---|---|---|
| Pi 5 always on | 4 | 6 | $4.50 |
| 5800X + 3060 always on | 110 | 220 | $125 |
| 5800X + 3060 hybrid (active 30 min/day) | 110 idle 30 min, sleep 23.5 hr | 220 (30 min/day) | $13 |
Hybrid wins by ~$110/year for typical use. Over 5 years of running this stack, that's $550 in electricity savings — roughly the cost of upgrading the 3060 to a used 4090 down the line.
Bottom line — when to add a 3060 vs upgrade to a 5090
For a home LLM workstation under $1,200 in 2026, the Pi 5 + Ryzen 7 5800X + RTX 3060 12GB stack is the cheapest meaningful path to running 30B-class models. The 18-22 tok/s on Qwen 3.6 35B A3B is the speed where chat feels instant and code review feels practical.
Upgrade paths from here:
- More VRAM — RTX 4060 Ti 16GB ($450) for fitting q5_K_M of 35B comfortably; RTX 4090 24GB ($1,500-1,700 used) for 70B-class models.
- More CPU — Ryzen 7 7700X (AM5) for 15-20% IPC uplift; only worth it on a fresh build, not a swap.
- Add a second GPU — dual 3060 12GB ($560 total) for 24 GB combined VRAM via tensor parallelism in vLLM, but Ollama's multi-GPU support is rougher.
For a single-creator home setup, this hybrid is the right answer through at least 2027 — Qwen 3.6, Llama 3.1, and Phi-3 all run cleanly, and the Pi 5's voice frontend makes it the always-available assistant the desktop alone never quite was.
Citations and sources
- r/LocalLLaMA Qwen 3.6 35B A3B benchmark threads, May 2026: https://reddit.com/r/LocalLLaMA
- Ollama documentation, model library: https://ollama.com/library
- llama.cpp benchmark methodology: https://github.com/ggerganov/llama.cpp/blob/master/examples/benchmark/README.md
- Raspberry Pi 5 official spec sheet: https://www.raspberrypi.com/products/raspberry-pi-5/
- TechPowerUp RTX 3060 12GB review and VRAM analysis: https://www.techpowerup.com/review/nvidia-geforce-rtx-3060-12-gb/
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
