Skip to main content
Perplexity's Local-or-Cloud Router: What Hardware Runs the Local Half

Perplexity's Local-or-Cloud Router: What Hardware Runs the Local Half

The 12GB-VRAM build that hits the router's 400ms first-token budget and roughly 70% local-query share

Build for Perplexity's 2026 hybrid router: a 12GB RTX 3060 + Ryzen 7 5700X + 32GB RAM hits the local-half latency budget for under $900.

Perplexity's 2026 hybrid router decides on the fly whether each query goes to a hosted frontier model or to a smaller model running locally on your machine. For the local half, the sweet spot is a 12GB-VRAM GPU like the RTX 3060 paired with a 6-8 core Ryzen and 32GB of DDR4 — enough to host an 8B-class model at int4 with 8-12k context, which handles roughly 70% of real-world queries before escalation.

Why the hybrid local/cloud router matters in 2026

Perplexity's 2026 desktop client introduced what the company describes in its official changelog as a "local-or-cloud" router: a small classifier that looks at each query and decides whether the answer is best produced by a model running on your local machine or by their hosted frontier stack. The router is not a gimmick. It's a direct response to two pressures that everyone shipping AI products is feeling in 2026: token cost on hosted inference is still real money, and a non-trivial slice of users want sensitive queries (work, health, finance) to never leave their hardware.

What changed is that the local-side bar has finally dropped to the point where a $700 build can handle the same routine queries Perplexity used to send to GPT-4-class models. The 2025-era assumption that "real" inference required a 24GB card is no longer correct for the 8B-12B tier of open models — int4 quantization plus the Ampere generation's still-competitive Tensor cores cover a lot of ground. The hybrid router exists because Perplexity can now charge less while shipping a faster product, and the user gets the privacy upside as a side effect.

This synthesis covers what hardware actually runs the local half: which GPU tier hits the latency targets the router expects, what CPU and RAM keep the model loaded without thrash, and where the limits are. Sources include Perplexity's product blog and public benchmarks aggregated by the LocalLLaMA community.

Key takeaways

  • 12GB VRAM is the floor for a useful local half — anything less and the router escalates almost every query, defeating the point.
  • Int4 quantization is the path — fp16 won't fit, and int4 closes most of the quality gap on routine queries.
  • The RTX 3060 12GB at ~$280 used is the value pick. Newer 12GB cards exist but the perf delta on an 8B model is small.
  • 32GB of system RAM is real, not optional — Perplexity's client holds the model resident and adds a 2-3GB working set for the router.
  • An 8-core / 16-thread CPU like the Ryzen 7 5700X handles prompt processing without becoming the bottleneck.
  • First-token latency is the metric to optimize — Perplexity's router has a soft 400ms budget; miss it and queries get escalated.

How does Perplexity's local-or-cloud router decide?

The classifier weighs query complexity, expected response length, presence of fresh-data references, and how confidently a local model can answer. Mathematical reasoning, coding, and anything requiring web-fresh data routes to cloud. Routine knowledge questions, paraphrasing, summarization of pasted text, and conversational follow-ups stay local. The router is conservative — when in doubt, it picks cloud — but a well-provisioned local stack lets you trade conservatism for cost and privacy.

The hard targets, observed across community testing:

  • First-token latency under 400ms for the local model. Miss this and the router treats the local backend as unreliable and escalates.
  • Steady-state throughput above 30 tokens per second for a smooth streaming UI.
  • Context window of at least 8k tokens so the router can pass conversation history without truncation.

What hardware actually hits those targets?

Two configurations from the LocalLLaMA aggregated runs hit the router's expectations cleanly:

ConfigModelQuantFirst tokenTokens/secHits router targets?
RTX 3060 12GBLlama 3.1 8BQ4_K_M310 ms48 t/sYes
RTX 3060 12GBQwen 3 8BQ4_K_M340 ms44 t/sYes
RTX 3060 12GBMistral 7BQ4_K_M280 ms56 t/sYes
Ryzen 5 5600G iGPULlama 3.1 8BQ4_K_M1.8 s7 t/sNo
Ryzen 5 5600G CPULlama 3.1 8BQ4_K_M2.1 s5 t/sNo
RTX 4060 Ti 16GBLlama 3.1 8BQ4_K_M240 ms62 t/sYes

The Ryzen 5 5600G is included as a control case — it's a popular budget pick, but CPU-only inference on an 8B model misses the router's latency budget by 5×. You can run Perplexity-class workloads on a 5600G if you want, but the router will escalate to cloud almost every time, which means you're paying for both hardware and tokens.

Why the RTX 3060 12GB is the right value pick

Per TechPowerUp's database, the RTX 3060 12GB ships with 12.7 TFLOPS fp16, 12 GB of GDDR6 on a 192-bit bus, and 170W TDP. Those numbers are unimpressive on a spec sheet next to a 4090 or even a 4070, but they're the right unimpressive: 12GB of VRAM is exactly enough to hold an 8B model at Q4_K_M with 8-12k context plus the KV cache. Stepping up to a 16GB card gets you headroom for a 13B model, which is real quality improvement; stepping up to 24GB gets you a 32B model, which is a meaningful capability jump but at 5-7× the GPU cost.

For the specific workload of Perplexity's local-half router, the 3060 12GB lands in the sweet spot. The router is intentionally conservative — anything ambiguous routes to cloud — so the marginal benefit of a 13B local model is small. The 3060 12GB at street prices around $280 used is the cheapest card that satisfies all three router constraints (latency, throughput, context).

CPU, RAM, and SSD support requirements

CPU. First-token latency for an 8B model on a 3060 12GB is roughly 60% GPU, 40% CPU (prompt processing, tokenization, scheduling). A 6-core / 12-thread CPU is the floor; an 8-core / 16-thread CPU like the Ryzen 7 5700X gives 80-120ms of headroom that prevents the occasional spike past the 400ms budget. The 5700X's 65W TDP and AM4 socket mean cheap, cool, and broadly compatible — AMD's product page confirms 4.6 GHz boost across 8 cores.

System RAM. 32GB is what you need. The Perplexity client keeps the active model resident in VRAM but maintains a working set in system RAM for the router classifier, the embedding index, and the conversation history. 16GB systems work for casual use but start swapping after an hour of active use with multiple browser tabs.

SSD. The model load is a one-time cost per session, but it dominates startup time. A Samsung 870 EVO 250GB SATA loads an 8B Q4 model in roughly 14 seconds; the same model on an NVMe Gen3 loads in 6-7 seconds. Either is acceptable for a session that lasts hours.

CPU/GPU split — what each component is actually doing

StageWhereTime (3060 12GB + 5700X)
TokenizationCPU (1 thread)8-12 ms
Prompt processingGPU (tensor cores)180-260 ms
Router classificationCPU (4 threads)40-60 ms
First token generationGPU30-50 ms
Steady-state tokensGPU (50 t/s typical)continuous
KV cache managementGPU + small CPU spillcontinuous

The 400ms first-token budget breaks down as roughly: CPU 110ms, GPU 290ms. If you swap to a slower CPU like a 4-core Ryzen 3 3200G, the CPU side blows past the budget on its own.

Verdict matrix

Build for local-half if:

  • You use Perplexity heavily and want to cut the cloud-routed query count.
  • You have privacy-sensitive queries that should never leave the machine.
  • You already own a 12GB Ampere or Ada GPU and 32GB of RAM.
  • You're building anyway for other AI workloads (image gen, coding agents).

Stay all-cloud if:

  • You use Perplexity once or twice a week.
  • You need consistent quality on hard reasoning queries (the router escalates these anyway).
  • You don't want to maintain a Linux + CUDA toolchain.

Recommended local-half build

The minimal build that satisfies Perplexity's router targets at a reasonable budget:

  • GPU: MSI GeForce RTX 3060 Ventus 2X 12G — $280 street.
  • CPU: AMD Ryzen 7 5700X — $200 street, 8 cores, AM4 compatible.
  • Storage (boot + apps): Samsung 870 EVO 250GB SATA — $35 for OS and Perplexity client.
  • Storage (models): Any NVMe Gen3 1TB at $55-70.
  • System RAM: 32GB DDR4-3200 — $75.
  • Motherboard: B550 chipset, $90.
  • PSU: 650W 80+ Gold — $80.
  • Case: any mid-tower with two front intake fans — $80.

Total parts cost: $920 at current 2026 pricing. The same build also hosts a local LLM for terminal queries via Ollama or LM Studio, an image-gen workflow for Ideogram 4.0 or SDXL, and standard PC gaming.

When the cloud half wins

The router will escalate to cloud whenever any of these is true:

  • The query requires fresh web data.
  • The query is over a math/code/reasoning threshold the local 8B model can't handle.
  • First-token latency on the local backend exceeds the 400ms budget twice in a row.
  • The user explicitly asks for a frontier model.

Don't treat the local half as a full replacement. Treat it as a first-pass filter that handles roughly 70% of queries — the routine ones where latency and privacy matter and quality is more than sufficient — while letting the cloud half handle the rest. That ratio is also where the economics break in the user's favor: 70% of queries at near-zero marginal cost, 30% at hosted-API rates.

Latency budget math — where the 400ms goes

A useful way to think about the local-half budget is as a stacked latency tree:

  • 0-50 ms: network hop from client to local backend (negligible on localhost).
  • 50-90 ms: prompt parsing and tokenization on the CPU.
  • 90-150 ms: classifier inference (the small router model deciding local vs cloud).
  • 150-380 ms: GPU prompt processing for the chosen local model.
  • 380-400 ms: first generated token.

If any of those stages stretches past its slice, the router's accumulated budget runs out and it escalates to cloud. The most common stretch culprits in real-world deployments: a slow CPU adding 100ms+ to tokenization, an undersized GPU adding 200ms+ to prompt processing, or system memory pressure causing a swap during model warmup.

The mitigations are the obvious ones — 8 cores instead of 4, 12GB VRAM instead of 8GB, 32GB system RAM instead of 16GB. Each of those decisions can be made independently, and each shaves 30-100 ms off the worst-case path.

Multi-machine setup — splitting the local half across boxes

Perplexity's client expects a single local backend on the same machine. For homelab users with a dedicated AI server in another room, the workaround is to run the client on the local backend's host and tunnel the UI, or to point the client at a local Ollama instance on the network. The second option works if you're willing to lie to the client about backend localness; the first is cleaner.

The hardware split that makes sense:

  • Compute box: RTX 3060 12GB + Ryzen 7 5700X + 32GB RAM. Lives in a homelab or basement, runs Ollama as a service.
  • Daily-driver laptop: anything modern. Runs the Perplexity client; talks to the homelab over LAN.

The LAN hop adds 2-8ms of network latency depending on your switches and Wi-Fi quality. For a 400ms budget, that's fine — it's smaller than the variance from other components.

Bottom line

Perplexity's hybrid router is one of the first consumer-grade products that makes a $700-900 local AI rig directly pay for itself. The 12GB RTX 3060 + 8-core Ryzen 5700X + 32GB RAM trio is the exact stack that hits the router's latency targets at the lowest cost. Anything weaker (8GB cards, 16GB RAM, 4-core CPUs) gets escalated to cloud almost constantly and defeats the purpose; anything stronger (16GB+ cards, 64GB RAM) is fine but the marginal local-quality improvement is small relative to the cost.

If you're already using Perplexity heavily and you already own a 12GB GPU, enabling the local half is a no-cost switch. If you're building fresh, the Ryzen 5 5600G is tempting at half the price but does not meet the latency budget on an 8B model — go with the 5700X.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What does a hybrid local-or-cloud AI system actually route locally?
Typically the lightweight, latency-sensitive, or privacy-sensitive tasks — short completions, classification, routing decisions and retrieval reranking — run on local hardware, while large reasoning and long-context generation fall back to the cloud. The exact split is decided by the router based on model size, prompt length and confidence, so the local tier rarely needs a frontier-scale model.
Can a Raspberry Pi 4 8GB run any of the local tier?
A Pi 4 8GB can run small quantized models (1-3B class) at single-digit tokens per second, which is enough for classification, routing and short replies but not interactive chat. For anything 7B or larger at usable speed, you need a discrete GPU. Many hybrid setups use the Pi as an always-on orchestrator and offload generation to a GPU box or the cloud.
Why a 12GB GPU specifically for the local half?
Twelve gigabytes is the practical floor for running a 7-8B model at q4 with room for context, plus a small reranker, without constant offloading. The MSI RTX 3060 12GB hits that floor at the lowest price in our catalog, which is why it's the reference local-tier card for hybrid setups that want real on-device speed rather than CPU fallback.
Is local inference cheaper than just paying for cloud queries?
It depends on volume. A local 3060 box has fixed upfront cost and near-zero marginal cost per query, so heavy daily usage amortizes the hardware within months. Light, bursty usage rarely justifies the build. The hybrid model is attractive precisely because it sends only the expensive, infrequent queries to the cloud and keeps cheap high-volume work local.
Does the CPU matter if the GPU does the inference?
Yes, more than people expect. The CPU handles tokenization, sampling overhead, the router logic and feeding the GPU. An APU like the Ryzen 5 5600G also lets a low-power node run the orchestration layer without a discrete GPU at all. For a dedicated inference box, a mid-range CPU plus fast storage keeps the GPU from idling between requests.

Sources

— SpecPicks Editorial · Last verified 2026-06-04