Yes — an RTX 3060 12GB hosts a working coding agent. A Qwen2.5-Coder 14B model at q4_K_M fits in roughly 9 GB of VRAM and leaves headroom for a 16K-token context, so an editor-side agent (Aider, Cline, Continue) runs entirely on the card with no cloud dependency. Throughput lands in the 25–40 tokens/sec range, fast enough for live autocomplete and one-shot file edits.
Why this matters now
Earlier this week The Decoder reported that OpenAI's Codex now operates a Windows desktop autonomously — clicking, typing, opening apps, navigating GitHub. That capability is impressive, and it is also the exact moment a lot of developers re-read their employer's data-handling policy and look at the source tree on their dev box. Self-hosting a coding agent stopped being a hobbyist concern the day a foundation-model vendor started shipping a screen-driving agent that, by design, sees everything in your editor.
The single hardware question that decides whether you can self-host is VRAM. Coder models in the 7B–14B class — the ones that actually feel useful in an editor — need 6–12 GB to fit a comfortable quant plus a working context. The RTX 3060 12GB is the cheapest discrete GPU that gives you all 12 of those gigabytes; the MSI Ventus 2X variant and Zotac Twin Edge sit between $300 and $400 new, often less on the used market. Pair the card with an AM4 host like the Ryzen 7 5800X, a WD Blue SN550 NVMe for the model store, and any 550W PSU, and you have a private coding agent that costs less than a year of cloud-agent subscription.
This article walks through exactly which models fit, how fast they run, what the quantization trade-off looks like, which agent runners pair best with 12 GB, and where the card stops being enough. All measurements are based on community benchmarks and llama.cpp memory tables current as of 2026; year-stamping matters because the model landscape changes monthly.
Key takeaways
- A 14B-q4_K_M coder model fits in ~9 GB of VRAM; the 12 GB card carries it comfortably with a 16K context.
- Realistic generation throughput is 25–40 tok/s on a 14B-q4 and 60–100 tok/s on a 7B-q4, fast enough for editor-side agents.
- Memory bandwidth is the limit, not compute — the 360 GB/s bus on the 3060 is the reason quantization, not core count, controls speed.
- Aider and Continue are the best 12 GB pairings; Cline works but pushes context the hardest.
- The 12 GB ceiling shows up the moment you try to run >14B coder models, BF16 weights, or 32K-token contexts simultaneously.
- For privacy-sensitive code that must not leave the machine, a 12 GB local agent is genuinely good enough for daily work.
What coding models actually fit in 12 GB of VRAM?
The interesting candidates as of 2026 are Qwen2.5-Coder (7B, 14B, 32B), DeepSeek-Coder V2 (Lite 16B, Instruct 33B), Codestral 22B, and StarCoder2 (3B, 7B, 15B). On a 12 GB card the sweet spot is Qwen2.5-Coder 14B at q4_K_M, which reports ~8.8 GB of weight memory in the public llama.cpp memory tables; that leaves ~3 GB for KV cache, which translates to about 16K tokens at fp16 attention or 32K at q8.
The 7B variants — Qwen2.5-Coder 7B, DeepSeek-Coder 6.7B — fit at much higher quants. A 7B-q8 model fits comfortably in 8 GB of VRAM and runs measurably faster than the 14B-q4, so if your workflow is dominated by short-context single-file edits, dropping to 7B and raising the quant is the right move.
The 32B-class coder models do not fit at any usable quant. A 32B-q4 needs ~19 GB on its own; going to q3 to fit is technically possible but quality drops noticeably on HumanEval and on real refactoring tasks. If you want a 32B-class local coder you need a 24 GB card (RTX 4090, 3090, or used A6000) or a multi-GPU split.
How fast is Qwen2.5-Coder / DeepSeek-Coder on an RTX 3060 12GB?
Throughput numbers below are illustrative figures consistent with community measurements on r/LocalLLaMA and the llama.cpp issue tracker. Treat them as ranges, not promises — your numbers will vary with driver version, PCIe link width, and host CPU.
| Model | Quant | VRAM used | Prefill (tok/s) | Generation (tok/s) | Notes |
|---|---|---|---|---|---|
| Qwen2.5-Coder 7B | q4_K_M | 5.0 GB | ~900 | 75–95 | snappy autocomplete |
| Qwen2.5-Coder 7B | q8_0 | 7.5 GB | ~750 | 60–75 | best 7B quality |
| Qwen2.5-Coder 14B | q4_K_M | 8.8 GB | ~520 | 32–40 | recommended default |
| Qwen2.5-Coder 14B | q5_K_M | 10.2 GB | ~440 | 25–32 | tight, but fits |
| DeepSeek-Coder 6.7B | q4_K_M | 4.5 GB | ~950 | 80–100 | strong on Python |
| DeepSeek-Coder 6.7B | q8_0 | 7.0 GB | ~770 | 62–80 | fp8 quality at half memory |
| Qwen2.5-Coder 14B | q8_0 | 14.8 GB | spills | spills | does NOT fit |
| Qwen2.5-Coder 32B | q4_K_M | 19.0 GB | spills | spills | does NOT fit |
The pattern is consistent across every coder family at this size: each step up in parameter count cuts generation throughput by roughly the parameter ratio, because each token decode must re-read the full model from VRAM. The 14B is ~2× slower than the 7B at the same quant, exactly as memory bandwidth predicts.
Quantization matrix: q2 / q3 / q4 / q5 / q6 / q8 / fp16
Quantization is where you pay for capability. The table below summarises the trade-off for 7B and 14B coder models, again drawing on llama.cpp's published memory tables and quality numbers from the Qwen2.5-Coder release notes.
| Quant | 7B VRAM | 14B VRAM | 7B tok/s | 14B tok/s | Quality (HumanEval delta) |
|---|---|---|---|---|---|
| q2_K | 3.0 GB | 5.5 GB | 100 | 45 | -8 to -12 pts; do not use for code |
| q3_K_M | 3.5 GB | 6.5 GB | 95 | 40 | -4 to -6 pts; edge case only |
| q4_K_M | 5.0 GB | 8.8 GB | 80 | 35 | -1 to -2 pts; default choice |
| q5_K_M | 5.8 GB | 10.2 GB | 70 | 28 | <1 pt; small upgrade |
| q6_K | 6.5 GB | 11.4 GB | 65 | 24 | rounding error |
| q8_0 | 7.5 GB | 14.8 GB (spills) | 60 | n/a | reference for 7B |
| fp16 | 13.5 GB (spills) | 28 GB (spills) | n/a | n/a | doesn't fit at any size |
The practical line is q4_K_M for 14B, q8_0 for 7B. Below q4 the model starts dropping function arguments and confusing variable scope on real refactors; above q5 the gain is invisible on coding-task benchmarks and the speed cost is real.
Prefill vs generation throughput with a long code context
A coding agent feels different from a chat model because it spends much more time in prefill — the moment where it digests your file, your project tree, the last few diffs, and the test output, before generating a single token. Prefill is parallel and compute-bound; generation is serial and bandwidth-bound. On the RTX 3060, prefill on a 14B-q4 model lands around 520 tok/s, which means a 4,000-token system + context block ingests in about 8 seconds.
The crossover point matters. If your agent runner sends the entire file plus a few dependency files on every keystroke (Cline's default), you spend most of the budget on prefill, not generation. Aider's approach of sending only the changed hunk plus a project map keeps prefill cheap, which is why it feels much snappier on this hardware even though the underlying model is the same.
What does the 32K-context limit cost you on a 12 GB card?
A 14B-q4 model occupies ~8.8 GB of weights. The remaining ~2.5 GB of VRAM (after the OS and driver overhead) houses the KV cache, which scales linearly with context length. Llama.cpp's published numbers put 14B-q4 KV at roughly 100 MB per 1,000 tokens with fp16 attention and 50 MB per 1,000 tokens with q8 attention.
That gives you:
- ~16K tokens of context with fp16 attention
- ~32K tokens of context with q8 attention
- Up to 64K with flash-attention q4 KV — at noticeable quality risk on long-range tasks
Most coding tasks fit in 8–16K. The places you'll feel the ceiling are: large monorepo "explain the call graph for X" prompts; multi-file refactor sessions where the agent is holding 20 files of context; or test-driven runs where you're feeding stack traces and previous error output into a long conversation. Drop to 7B at q8 if you need >32K context routinely.
Spec table: RTX 3060 12GB vs the discrete-VRAM tier above it
| Card | VRAM | Bus | Bandwidth | TGP | Coder model ceiling | New street price (early 2026) |
|---|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB GDDR6 | 192-bit | 360 GB/s | 170 W | 14B-q4 + 16K ctx | $300–$400 |
| RTX 4060 Ti 16GB | 16 GB GDDR6 | 128-bit | 288 GB/s | 165 W | 14B-q8 + 16K ctx | $450–$500 |
| RTX 4070 12GB | 12 GB GDDR6X | 192-bit | 504 GB/s | 200 W | 14B-q4 + 32K ctx | $550–$600 |
| RTX 3090 24GB | 24 GB GDDR6X | 384-bit | 936 GB/s | 350 W | 32B-q4 + 32K ctx | $700–$900 used |
| RTX 4090 24GB | 24 GB GDDR6X | 384-bit | 1008 GB/s | 450 W | 32B-q4 + 64K ctx | $1,700+ |
The interesting takeaway from this table is that the 3060 12GB sits in a unique spot: it is the only sub-$400 card with enough VRAM to load a 14B-class coder model. The 4060 Ti 16GB has more memory but lower bandwidth, so generation throughput is roughly the same and you pay $150 more for the extra headroom. If your workflow has settled on a 14B-q4 model, the 3060 is the smarter buy. If you suspect you'll want 14B at q8 or BF16, the 4060 Ti 16GB or used 3090 is worth the step up.
Which agent runner pairs best with 12 GB?
- Aider — the best 12 GB match. Sends only the active "edit" set + a small repo map. Prefill stays under 1K tokens for most edits, so generation dominates and the 14B-q4 runs at full 30+ tok/s. The default
gpt-4o-miniconfig swaps cleanly for an Ollama backend pointed at Qwen2.5-Coder 14B. - Continue (VS Code) — also strong. The inline autocomplete is small-context; the chat panel can be long-context but you control how much code you select. Pair with Qwen2.5-Coder 7B-q8 for autocomplete and 14B-q4 for chat.
- Cline — works, but pushes context hardest. Cline's autonomous mode reads many files at once, so prefill costs add up and you'll feel the 12 GB ceiling on big repos. Keep Cline's max-context low (8–16K) and prefer the 7B model for it.
- Open-WebUI — fine for chat-style coding but it isn't really an "agent" — no file edits, no shell. Use it to evaluate models, not as your daily driver.
The general rule: agent runners that operate at the hunk level pair with 14B-q4 + 12 GB. Agent runners that read whole projects on every turn want a 24 GB card.
Perf-per-dollar + perf-per-watt math at 170W TGP
At ~30 tok/s on Qwen2.5-Coder 14B-q4 and a 170W TGP, the RTX 3060 12GB delivers about 0.18 tok/s per watt. At a typical $0.13/kWh you spend roughly $0.022 per hour to run the card at full load, which works out to ~5 million tokens per dollar of electricity — orders of magnitude cheaper per token than any frontier API and obviously zero data leaving the machine.
On purchase price the math is even clearer. A $349 card amortised over the typical 3-year coder upgrade window costs you less than $10 a month before electricity. Any commercial cloud-agent subscription clears that in one month. The break-even isn't a question, it's the first session.
Bottom line: when a 12 GB local agent is enough, and when it isn't
A 12 GB card running Qwen2.5-Coder 14B-q4 is enough for: privacy-sensitive code that cannot leave the machine; daily test generation, docstring drafting, single-file refactor, and inline autocomplete; coding offline (planes, secure facilities, spotty internet); developers who hate per-token bills and want a flat-rate setup.
A 12 GB card is not enough when: you want a 32B-class coder; you need 32K+ contexts across many files simultaneously; you're chasing frontier-cloud quality on the hardest multi-step agent tasks. For those, a used 24 GB card (3090 or A6000) or a 4090 is the right tier.
If you're sizing a build from scratch, the working combo as of 2026 is: MSI RTX 3060 Ventus 2X 12G or Zotac Twin Edge 12G, Ryzen 7 5800X, 32 GB of DDR4-3600, a WD Blue SN550 1TB NVMe for the model store (or a Crucial BX500 1TB SATA SSD if your storage budget is tight), and any 650W 80+ Gold PSU. That box runs every model in the tables above and gives you a coding agent that owes nobody anything for your source tree.
Related guides
- RX 9070 XT vs RTX 3060 12GB for Local LLM Inference (2026)
- Cut AI API Bills: Run Local LLMs on an RTX 3060 12GB (2026)
- Best Budget Local-AI Workstation Parts in 2026
- Best Budget GPU for CNN & Vision Inference 2026
- ComfyUI on a 12GB RTX 3060: SDXL and Flux Image Gen Benchmarked
