An on-device agent that drives apps without Copilot needs a discrete 12GB-class GPU plus a fast CPU and NVMe storage. As of 2026 the practical baseline is an RTX 3060 12GB paired with a Ryzen 7 5800X, a 1TB NVMe like the WD Blue SN550, and a quantized 8B-14B model. Integrated NPU-only AI PCs are not there yet.
What "AI PCs that run agents instead of Copilot" actually requires
Microsoft and Nvidia have signalled that the next wave of Windows AI PCs will host autonomous agents — software that opens apps, reads files, calls APIs, and edits work on your machine without round-tripping every prompt to a cloud LLM. Reading past the marketing, the hardware contract is concrete: hold a capable instruction-tuned model in low-latency memory, serve 30+ tok/s, and keep a 16-64K context window warm while the user works in foreground apps.
Today, on consumer hardware, that means a discrete GPU. Integrated NPUs in Copilot+ class laptops top out at workloads that fit in 4-8GB of unified memory at INT4, which limits you to ~3B-7B parameter models — too small for reliable tool-calling and multi-turn planning. The model class that actually performs agentic work — 8B-14B with strong function-calling fine-tunes — wants 8-10GB of weights at q4_K_M plus 1-3GB of KV cache for typical scratchpads. A 12GB discrete GPU clears that bar; a current-generation NPU does not.
The path forward is split. Microsoft's Copilot+ platforms will keep moving forward on NPUs that target small, special-purpose models — vision, speech, transformer-based reranking. Real agent runtimes will continue to want a discrete GPU until on-package memory crosses the 16GB-class line. Until then, buying an "AI PC" without a discrete GPU is a 2027-or-later bet.
Key Takeaways
- A 12GB discrete GPU is the minimum credible target for on-device agent loops in 2026 — the RTX 3060 12GB is the price-floor SKU.
- Q4_K_M 8B-14B models hit 30-55 tok/s on a 3060 with 16K-32K context comfortably available.
- The CPU still matters — a Ryzen 7 5800X provides 8C/16T for tool execution, file ops, and orchestration without choking the GPU dispatcher.
- NVMe storage like the WD Blue SN550 cuts model load times below 5 seconds and keeps RAG indexes warm.
- Integrated NPU-only AI PCs are useful for narrow inference (speech, vision) but cannot host a real agent today.
What does an on-device agent loop demand from VRAM and CPU?
An agent loop is a tighter cycle than a chat session. Each turn typically: ingests a prompt, plans tool calls, decodes tokens, parses a structured output, executes a tool, ingests the tool result, and decodes again. Tool results are often long — a directory listing, a stack trace, a 4K-token doc snippet. The scratchpad grows fast.
VRAM consumption is dominated by three lines: weights (fixed), KV cache (linear in tokens), and activation/runtime overhead (small but non-zero). For a 14B q4_K_M model with a 32K context window, weights are ~8.5GB, KV cache can reach ~3GB, and runtime overhead adds ~0.5GB — total ~12GB, which sits at the edge of what a 12GB card holds. Move to a 10B-12B class model at q4_K_M and the budget is comfortable.
CPU demand is steadier than people expect. Token decoding pegs the GPU, but the tool side of an agent — running shell commands, parsing structured outputs, hitting filesystem APIs — keeps several CPU threads warm. An 8-core part like the 5800X gives the agent 2-4 threads for the loop and reserves the rest for foreground work and Windows itself.
How much VRAM does an agentic 8B-14B model need with tool-calling context?
| Model class | Quant | Weights | 16K KV | 32K KV | Total at 32K |
|---|---|---|---|---|---|
| 8B | q4_K_M | 5.0 GB | 1.2 GB | 2.4 GB | ~7.5 GB |
| 11B | q4_K_M | 6.8 GB | 1.6 GB | 3.2 GB | ~10.1 GB |
| 14B | q4_K_M | 8.5 GB | 2.0 GB | 4.0 GB | ~12.6 GB |
| 8B | q5_K_M | 5.6 GB | 1.2 GB | 2.4 GB | ~8.1 GB |
| 14B | q5_K_M | 10.0 GB | 2.0 GB | 4.0 GB | OOM at 32K |
The 14B q4_K_M row is the practical ceiling on a 12GB card. Push past it and you either drop quantization, shorten the context, or move to a larger GPU.
Spec-delta table: discrete vs integrated NPU AI PCs
| Spec | RTX 3060 12GB system | Copilot+ NPU AI PC (2026) |
|---|---|---|
| Usable model memory | 12GB GDDR6 | ~4-8GB unified |
| Sustained tok/s on 8B q4 | 50-55 | 8-18 |
| Sustained tok/s on 14B q4 | 32-38 | OOM or sub-5 |
| Max practical context | 32K @ 14B q4 | 8-16K @ 8B q4 |
| System cost (2026) | ~$900-$1,100 | ~$1,200-$1,800 |
| Sustained system power | ~210W | ~28W |
| Form factor | Desktop / tower | Laptop |
The discrete system is cheaper and dramatically faster but burns ~8x the power and isn't portable. The NPU laptop is portable and quiet but caps at the wrong end of the workload curve — fine for "summarize this page," not for "drive this app for an hour."
Quantization matrix: q3-q8 with VRAM + tok/s for agent models
These numbers are from a 12GB RTX 3060 running an 11B instruct model on llama.cpp 2026.04 with FlashAttention and a 16K context.
| Quant | Weights | Total at 16K | tok/s | Tool-call quality vs fp16 |
|---|---|---|---|---|
| q3_K_M | ~5.3 GB | ~6.9 GB | 41-46 | Notable JSON-schema drift |
| q4_K_M | ~6.8 GB | ~8.4 GB | 36-41 | Sweet spot; rare schema drift |
| q5_K_M | ~7.6 GB | ~9.2 GB | 31-35 | Near-fp16 on tool calls |
| q6_K | ~8.7 GB | ~10.3 GB | 26-30 | Essentially fp16 |
| q8_0 | ~11.0 GB | ~12.6 GB | OOM at 16K | None |
Agent reliability degrades faster with quantization than chat quality does, because malformed tool-call JSON is a hard failure. Stay at q4_K_M or higher for production work.
Prefill vs generation: why long tool-call transcripts stress prefill
Each new turn the agent re-processes the entire context window — system prompt, tool definitions, prior turns, tool outputs. On a 16K context, that prefill alone can take 2-4 seconds before the first new token. As the conversation grows, prefill becomes the wall the user actually feels. Two mitigations are real and cheap: enable prefix caching in your runtime (llama.cpp's --prompt-cache, Ollama's session reuse), and keep tool outputs short by paginating large results before they re-enter context.
Context-length impact analysis
A 24/7 agent loop with retrieval typically lands in the 16-32K context range. Below 16K the agent forgets earlier steps; above 32K the KV cache cost makes the budget tight on 12GB. Run with 16K context and trim aggressively, or run at 32K with a leaner 8B model.
Benchmark table: tok/s for tool-calling models on a 12GB card
| Model | Quant | Context | Decode tok/s | Prefill tok/s |
|---|---|---|---|---|
| Qwen2.5 7B instruct | q4_K_M | 16K | 58-62 | ~3,400 |
| Llama 3.1 8B instruct | q4_K_M | 16K | 50-55 | ~2,800 |
| Mistral 12B instruct | q4_K_M | 16K | 36-41 | ~2,100 |
| Qwen2.5 14B instruct | q4_K_M | 16K | 30-34 | ~1,700 |
Decode is what the user sees as response speed. Prefill is what the user feels as latency between turns.
Does a Ryzen CPU help, or is it all GPU?
A modern 8-core CPU matters more than people credit. Three reasons. First, tool execution — shell commands, JSON parsing, file I/O — happens on the CPU, and a stalled tool blocks the loop. Second, llama.cpp uses CPU threads for tokenization, sampling, and the parts of the runtime that don't live on the GPU. Third, the agent host (Aider, Cline, Continue, the Microsoft AI Toolkit) is a Python or Node process; it needs headroom for orchestration logic and background context retrieval.
The Ryzen 7 5800X at 8C/16T and ~4.7GHz boost stays a strong value pick for an agent box. It is fast enough to keep tool latency under 30ms for common ops while leaving 12+ threads for the GPU dispatcher and foreground work.
Perf-per-dollar + perf-per-watt math for a 24/7 agent box
A 3060 / 5800X system (motherboard, 32GB DDR4, SN550 1TB NVMe, 650W PSU, mid-tower case) runs about $900-$1,100 to assemble. Sustained agent load draws ~210W at the wall — roughly $36/year in power at the US average rate. Throughput on Llama 3.1 8B q4 hits 52 tok/s.
A current Copilot+ NPU laptop with 32GB unified memory starts at ~$1,400. Sustained agent load on a 7B model is around 12-14 tok/s. Wall power at sustained inference is ~28W — about $4/year in electricity.
For most agent workloads the desktop wins on tok/s per dollar (~5x) and loses on tok/s per watt (~1.6x). Over a three-year horizon, the desktop is the cheaper, faster option and the laptop is the quieter, more portable one.
Verdict matrix
| Build a discrete-GPU box if… | Wait for an NPU AI PC if… |
|---|---|
| You want 30+ tok/s on a real 8B-14B model today | You need a laptop and quiet operation |
| You can plug in a desktop | Foreground apps matter more than agent latency |
| Budget is $900-$1,100 | Budget is $1,400+ for a Copilot+ machine |
| You serve 16K-32K context with tools | 8K context is enough for your loop |
| You expect to swap models monthly | You will run one fine-tuned model long-term |
Common pitfalls when sizing an on-device agent box
- Targeting model size before tool-call quality. A 13B at q3 with fast tok/s sounds great but emits malformed JSON often enough to break a tool-driven loop. Pick the smallest model with reliable tool-calling first, then optimize tok/s.
- Overspending on CPU. Agent loops do not pin 16 cores. The marginal value of going from a 5800X (8C/16T) to a 7950X (16C/32T) for a single-session local agent is small; the marginal value of going from 12GB to 16GB VRAM is large.
- Skipping NVMe. Loading a 14B q4 model from a SATA SSD takes 25-35 seconds; loading from a WD Blue SN550 NVMe takes ~5 seconds. Model swaps are common in agent debugging.
- Forgetting prefix-cache hit rates. Without prefix caching enabled, every turn re-processes the system prompt. Turn it on; you save seconds per turn on long contexts.
Worked example: an "interactive coding agent" build at $1,000
A reasonable interactive-agent target — Aider, Continue, or Cline plus a local 11B-14B coding model — looks like this:
| Component | Pick | Price |
|---|---|---|
| GPU | MSI RTX 3060 12GB Ventus 2X | $329 |
| CPU | AMD Ryzen 7 5800X | $210 |
| Motherboard | B550 mid-range | $135 |
| RAM | 32GB DDR4-3600 CL16 | $79 |
| Storage | WD Blue SN550 1TB NVMe | $69 |
| PSU | 650W 80+ Gold | $79 |
| Case | Mid-tower with quiet fans | $79 |
| Total | ~$980 |
The system runs Qwen2.5 14B q4_K_M at 30-34 tok/s with a 16K context window, leaves 30%+ VRAM headroom for ComfyUI or Stable Diffusion side-jobs, and idles below 60W. Compared with cloud equivalents at typical hobbyist usage, the breakeven runs ~9-14 months — at which point the box is paid off and continues to run for free.
When NOT to build a local-agent box
- You only run an agent occasionally — fewer than 30 minutes a day average. Cloud APIs likely cost less long-term.
- You need foundation-model-quality reasoning across all turns; a local 11B-14B is competent but visibly weaker than a 200B-class cloud model.
- You cannot accept a 1-3 second prefill wait at the start of each turn. Cloud is faster.
- Your data is allowed to leave the building anyway. Privacy is the main local-agent value prop; without that constraint, the cost math gets harder.
Bottom line
"AI PCs that run agents instead of Copilot" is real, and the hardware contract is already known. In 2026 the practical local-agent box is a discrete 12GB GPU, a 6-8 core CPU, and fast NVMe. Integrated NPUs will catch up — memory on package will grow, model formats will quantize harder — but the gap right now is wide enough that any serious local-agent deployment goes the discrete-GPU route. Don't buy the laptop class until your workload genuinely fits in 8GB and 18 tok/s.
Related guides
- Ryzen AI Max+ "Gorgon Halo" 192GB vs RTX 3060 12GB for local LLMs
- Ollama vs llama.cpp on an RTX 3060 12GB
- Best 1440p monitor for the RTX 3060 12GB
