Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid

Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid

Pi 5 as the 24/7 voice frontend, Ryzen 5800X + RTX 3060 12GB as the wake-on-LAN inference engine — the cheapest under-$1,200 path to a private always-on Qwen 3.6 35B local LLM rig.

A Raspberry Pi 5 + Ryzen 7 5800X hybrid local LLM workstation works in 2026: Pi 5 handles routing, voice frontend, and Tailscale gateway, while the 5800X + RTX 3060 12GB does inference at 18-30 tok/s on Qwen 3.6 35B A3B q4 quants per LocalLLaMA measurements.

Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid

Yes — a Raspberry Pi 5 + Ryzen 7 5800X hybrid local llm workstation works in 2026, and it's one of the cheapest paths to running models like Qwen 3.6 35B A3B at home. The Pi 5 handles routing, voice frontend, and Tailscale gateway duties; the Ryzen 7 5800X with an RTX 3060 12GB does the actual inference at 18-30 tok/s on q4 quants per public LocalLLaMA measurements.

By the SpecPicks editorial team — last verified May 2026.

Building a Local LLM Workstation on a Raspberry Pi 5 + Ryzen 7 5800X Hybrid

The conversation around running LLMs locally in 2026 is dominated by two extremes: the $5,000 RTX 5090 build for people running 70B parameter models at full speed, and the $80 Pi 5 build for people running 1B-parameter toys. The cheap local llm rig that actually answers the homelab tinkerer's question — "Can I run useful models without a GPU mortgage?" — sits in the middle.

This guide details a hybrid setup that pairs a Raspberry Pi 5 with a Ryzen 7 5800X desktop and an RTX 3060 12GB for under $1,200 total, including peripherals. The Pi 5 is not the inference engine — it is the thin client, voice gateway, and 24/7 routing layer. The 5800X plus RTX 3060 handles the heavy work, but only when called.

We grounded the numbers in published LocalLLaMA community measurements for Qwen 3.6 35B A3B (the trending model this week per r/LocalLLaMA's hot threads), llama.cpp's documented benchmark suite, and TechPowerUp's RTX 3060 12GB review for the GPU side.

Editorial intro — homelab tinkerer audience

This article assumes you already run a Pi 4 or Pi 5 for Pi-hole, Home Assistant, or a NAS, and you want to add LLM capabilities without buying a $2,000+ GPU. It also assumes you can SSH into a Linux box, edit configs, and tolerate setup friction — the reward is a private, always-on AI assistant that costs $0/month after the hardware spend.

The hybrid approach matters because pure-Pi local LLM is mostly a novelty. A Pi 5 8GB runs llama.cpp with 1-3B parameter models at 4-8 tok/s — enough for Home Assistant intent matching, not enough for code review or document summarization. Pure-desktop local LLM works well but the desktop draws 350-500 W idle if you leave it on 24/7 for a voice assistant, which is a $200+/year electric bill.

Hybrid solves both. The Pi 5 sips 5-8 W, runs the Tailscale-exposed voice frontend, Whisper STT, and a routing layer. When a query needs the big model, the Pi 5 triggers wake-on-LAN to the 5800X, runs inference, and returns the result. The desktop spends 95% of its life in standby.

Key Takeaways

  • The Pi 5 is the orchestrator, not the inference engine — voice frontend, Whisper STT, intent routing, Tailscale gateway.
  • The Ryzen 7 5800X + RTX 3060 12GB delivers 18-30 tok/s on Qwen 3.6 35B A3B at q4 quants.
  • Total stack cost: ~$1,150 (Pi 5 8GB $80 + 5800X $230 + RTX 3060 12GB $280 + board/RAM/PSU/case ~$560).
  • Power: Pi 5 always-on at ~6 W average; desktop wakes on demand, idle 350 W only when active.
  • Quantization at q4_K_M is the sweet spot for 35B models on 12 GB VRAM.
  • Wake-on-LAN + Tailscale = private always-available LLM endpoint.

What roles does each device play in a hybrid local LLM setup?

The Pi 5 8GB runs three persistent services: 1. Whisper STT — small.en model, processes 16 kHz audio in roughly 0.4× realtime on the Pi 5's Cortex-A76 cores. 2. Tailscale node — gives you a private endpoint reachable from your phone over the Tailscale mesh. 3. Routing layer — a small Python service that classifies the incoming query (small/local-only vs needs-the-big-model), wakes the desktop via WoL when needed, and forwards.

The Ryzen 7 5800X + RTX 3060 12GB handles: 1. Ollama on the desktop — exposes a local OpenAI-compatible endpoint on port 11434. 2. The actual LLM inference — Qwen 3.6 35B A3B q4_K_M, Llama 3.1 8B for fast responses, Phi-3 Mini for tool-use. 3. Vector store — for RAG queries against your obsidian vault, codebase, or local docs.

This split is why the ryzen 5800x local llm pairing wins. The 5800X's eight Zen 3 cores feed the RTX 3060 efficiently — CPU is rarely the bottleneck on 30B-class models when GPU offload lands the entire model in VRAM.

Spec delta table: Pi 5 8GB vs Ryzen 7 5800X + RTX 3060

ComponentSpecIdle PowerInference PowerCost
Raspberry Pi 5 8GB4× A76 @ 2.4 GHz, 8 GB LPDDR4X4 W6-8 W (Whisper small)$80
AMD Ryzen 7 5800X8C/16T, Zen 3, 4.7 GHz boost70-95 W (CPU-only inference)$230
ZOTAC RTX 3060 12GB3,584 CUDA, 12 GB GDDR614 W130-170 W (GPU offload inference)$280
MSI RTX 3060 Ventus 12GB3,584 CUDA, 12 GB GDDR614 W130-170 W (GPU offload inference)$280

The 12 GB VRAM is what makes the RTX 3060 a serious local-LLM card despite its consumer-tier rasterization performance. 12 GB fits a 30B-parameter model at q4_K_M quantization with comfortable context-window headroom — something neither the 8 GB RTX 3070 nor the 8 GB RTX 4060 can do.

Benchmark table: tok/s on Qwen 3.6 35B A3B across configs

ConfigQuantContextTokens/secSource
Pi 5 8GB (CPU only)q4_K_M4K0.8-1.4LocalLLaMA threads, May 2026
Ryzen 5800X (CPU only, no GPU)q4_K_M4K4-7LocalLLaMA Zen 3 thread
Ryzen 5800X + RTX 3060 12GB (full offload)q4_K_M4K18-22r/LocalLLaMA RTX 3060 Qwen post
Ryzen 5800X + RTX 3060 12GB (full offload)q5_K_M4K14-17r/LocalLLaMA
Ryzen 5800X + RTX 4090 (full offload)q5_K_M8K55-70r/LocalLLaMA

The 18-22 tok/s on the RTX 3060 12GB at q4_K_M is the headline number. That's faster than typing speed and roughly 2× the throughput of M2 Pro Mac Mini reports for the same model — the RTX 3060 12GB hits a sweet spot for pi 5 ollama-routed hybrid setups that smaller VRAM cards cannot.

Quantization matrix: q4/q5/q6/q8 on Pi 5 vs 5800X vs 3060 12GB

QuantModel sizePi 5 fits?5800X CPU fits?RTX 3060 12GB fits?
q4_K_M~21 GBNoYes (system RAM)Yes (full GPU offload)
q5_K_M~25 GBNoYesYes (full GPU offload)
q6_K~29 GBNoYesPartial offload only
q8_0~37 GBNoYesNo (split CPU+GPU)

For a 12 GB VRAM card with a 35B model, q4_K_M and q5_K_M are the practical choices. q4_K_M loses about 2-3% on perplexity vs q8_0 — imperceptible for chat, slightly more visible for code generation. q5_K_M is the Goldilocks pick for users who want quality first.

Why offload routing + voice frontend to the Pi while inference runs on the desktop

The 24/7 path. The Pi 5 idles at 4 W. Run it 365 days a year and it costs about $4 in electricity. The Ryzen 5800X desktop with the RTX 3060 spinning idles at 80-110 W — leave it on year-round and you pay $90-130 in power for an LLM you used 30 minutes a day.

WoL solves this. The Pi 5 listens for a wake intent (a Whisper-transcribed voice command, a Discord message, a Home Assistant trigger), sends a magic packet to the desktop's MAC, and the desktop boots from S5/S3 in 5-15 seconds. Inference runs, the response comes back, the desktop sleeps again after a 10-minute idle window.

This is also where the Pi's role expands beyond Whisper. The Pi can host a small intent-classification model (Phi-3 Mini, ~2.5 GB) that decides:

  • "Set a timer for 5 minutes" → handle locally on the Pi (no desktop wake).
  • "Summarize this PDF" → wake desktop, route to Qwen 3.6.
  • "Run this code" → wake desktop, route to a code model.

The Pi 5's neural-coprocessor-free architecture limits its model ceiling to ~3B parameters at q4 (4-8 tok/s usable), but the latency on these decisions is under 200 ms, so the user experience is "instant" for short queries and "5-second delay" for big-model queries.

Step-by-step: Ollama on 5800X, Pi 5 as Tailscale gateway + Whisper STT

  1. Install Ollama on the Ryzen 5800X desktop. curl https://ollama.com/install.sh | sh. Pull Qwen 3.6 35B A3B: ollama pull qwen3:35b-a3b-q4_K_M. Verify GPU offload by running nvidia-smi while inference is active — VRAM should hit ~10.5 GB.
  2. Install Tailscale on both devices. Single sign-on with the same account. The desktop and Pi 5 will see each other on a 100.x.x.x mesh IP.
  3. Configure WoL on the desktop. Enable WoL in BIOS, set Linux kernel to allow ethtool -s eth0 wol g. Test from the Pi: wakeonlan <desktop_mac>.
  4. Set up Whisper on the Pi. pip install openai-whisper. Use the small.en or base.en model — small.en is the practical sweet spot for accuracy/speed.
  5. Write a tiny FastAPI router on the Pi. Endpoints: /transcribe (audio in, text out via Whisper), /ask (text in, route to local Phi-3 or wake-and-forward to desktop Ollama).
  6. Expose the FastAPI router via Tailscale. Set tailscale serve on port 8080. Now your phone, laptop, or any other Tailscale device can hit http://pi5.tailnet:8080/ask from anywhere.
  7. Tune the wake/sleep policy. Set the desktop to sleep after 10 min idle. Add a Pi cron job that pings the desktop nightly to verify WoL is still functional.

The result is a private, always-available AI assistant accessible from anywhere on your Tailscale mesh, costing about $5/month in marginal electricity vs the GPU-PC-on-24/7 alternative.

Power draw + perf-per-watt math

DeviceIdle (W)Inference active (W)Cost/year (24/7, $0.13/kWh)
Pi 5 always on46$4.50
5800X + 3060 always on110220$125
5800X + 3060 hybrid (active 30 min/day)110 idle 30 min, sleep 23.5 hr220 (30 min/day)$13

Hybrid wins by ~$110/year for typical use. Over 5 years of running this stack, that's $550 in electricity savings — roughly the cost of upgrading the 3060 to a used 4090 down the line.

Bottom line — when to add a 3060 vs upgrade to a 5090

For a home LLM workstation under $1,200 in 2026, the Pi 5 + Ryzen 7 5800X + RTX 3060 12GB stack is the cheapest meaningful path to running 30B-class models. The 18-22 tok/s on Qwen 3.6 35B A3B is the speed where chat feels instant and code review feels practical.

Upgrade paths from here:

  • More VRAM — RTX 4060 Ti 16GB ($450) for fitting q5_K_M of 35B comfortably; RTX 4090 24GB ($1,500-1,700 used) for 70B-class models.
  • More CPU — Ryzen 7 7700X (AM5) for 15-20% IPC uplift; only worth it on a fresh build, not a swap.
  • Add a second GPU — dual 3060 12GB ($560 total) for 24 GB combined VRAM via tensor parallelism in vLLM, but Ollama's multi-GPU support is rougher.

For a single-creator home setup, this hybrid is the right answer through at least 2027 — Qwen 3.6, Llama 3.1, and Phi-3 all run cleanly, and the Pi 5's voice frontend makes it the always-available assistant the desktop alone never quite was.

Citations and sources

  • r/LocalLLaMA Qwen 3.6 35B A3B benchmark threads, May 2026: https://reddit.com/r/LocalLLaMA
  • Ollama documentation, model library: https://ollama.com/library
  • llama.cpp benchmark methodology: https://github.com/ggerganov/llama.cpp/blob/master/examples/benchmark/README.md
  • Raspberry Pi 5 official spec sheet: https://www.raspberrypi.com/products/raspberry-pi-5/
  • TechPowerUp RTX 3060 12GB review and VRAM analysis: https://www.techpowerup.com/review/nvidia-geforce-rtx-3060-12-gb/

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

— SpecPicks Editorial · Last verified 2026-05-09