If you only have one 12GB RTX 3060 and you want to pick one open-weights model for everyday local use, the short answer in mid-2026 is: run DeepSeek V4 Flash for coding and tool-use, run GLM-5.2 for reasoning and long-form writing, and never load either above q4_K_M. Both fit on a single 3060 — only one really sings on it, depending on the workload.
Why the model choice matters more than the GPU in 2026
The 12GB RTX 3060 is the most popular AI rig GPU on the planet right now — the TechPowerUp spec page lists 12GB GDDR6 at 360 GB/s on a 192-bit bus, and that block of VRAM is exactly enough to host a 12-14B dense model at q4_K_M or a small Mixture-of-Experts at q4 with offload. The hardware has not moved in three years, but the open-weights landscape has: in the last twelve weeks alone, Z.AI's GLM-5.2 and DeepSeek's V4 (and V4 Flash) both shipped weights you can actually download and run with llama.cpp. For a 12GB owner, the live question is no longer can I run a competitive model — it is which one.
The honest framing: a 3060 is a small-context, small-model rig. You will not run BF16 weights of either family. You will not feed them 128k-token prompts. You will get a working coding pair-programmer or a working reasoning assistant on the same card, but you have to choose your quantization, your context window, and ideally your model with intent. This synthesis pulls together the public benchmark posts, the model cards, and community measurements from r/LocalLLaMA threads through June 2026 to settle the head-to-head on this specific class of card.
Key takeaways
- GLM-5.2 and DeepSeek V4 Flash both fit in 12GB at q4_K_M, with caveats on context window.
- DeepSeek V4 Flash's MoE design produces faster token output on a 3060 once you accept partial CPU offload — expect roughly 18-26 tokens/sec in community measurements with a 6-8GB active-expert footprint.
- GLM-5.2 (dense, 14B class) runs at 22-30 tokens/sec at q4_K_M fully on-GPU, but only with a 4-6K context window.
- Pairing the 3060 with a fast CPU (Ryzen 7 5800X-class) recovers most of the loss from MoE offload — single-core speed matters more than core count once layers spill.
- Buy GLM-5.2 for long prose, planning, and chain-of-thought. Buy DeepSeek V4 Flash for code-completion, tool calls, and JSON-shaped output.
- Buying either of these as an "AI rig" GPU still beats a used RTX 3090 24GB at twice the price for everyone except people who want 70B-class models.
Step 0: Which model fits 12GB at all?
The honest VRAM math, per quant level, for both models is the only thing that matters here. Approximate weights footprint before KV cache and activations:
| Model | fp16 | q8 | q6_K | q5_K_M | q4_K_M | q3_K_M | q2_K |
|---|---|---|---|---|---|---|---|
| GLM-5.2 (14B dense) | 28 GB | 14 GB | 11 GB | 9 GB | 8 GB | 6 GB | 4.5 GB |
| DeepSeek V4 Flash (MoE, ~21B / 3.6B active) | 42 GB | 22 GB | 17 GB | 14 GB | 12 GB | 9 GB | 6.5 GB |
On a 12GB RTX 3060, your usable budget for weights is roughly 9-10 GB after the OS, the desktop compositor, and a 4-8K context KV cache. That puts both models into the same box: q4_K_M is the sweet spot for both, q3 is the only option if you want a 16K+ context, and anything above q5 forces CPU offload for at least some layers.
The KV-cache cost is what trips people up. At a 32K context, llama.cpp's per-token KV grows to roughly 1.5-2 GB on a 14B dense model. That is half of your remaining VRAM. The 3060 is not a long-context card — set --ctx-size to 4096 or 8192 and stop pretending you have a workstation.
Spec-delta — what each model actually is
| Field | GLM-5.2 | DeepSeek V4 Flash |
|---|---|---|
| Architecture | Dense decoder-only transformer | Mixture-of-Experts |
| Parameter count | ~14B | ~21B total / ~3.6B active per token |
| Context | 128K (native) | 128K (native) |
| License | MIT-style permissive | MIT-style permissive |
| Strengths called out by the model card | Reasoning, math, instruction following | Tool use, code, JSON, latency |
| Release | early 2026 | mid-2026 |
The architectural delta is the whole story. A dense 14B model on a 3060 runs every parameter for every token — predictable speed, predictable VRAM, no surprises. An MoE model only activates a subset of experts per token; the catch is you still need every expert in memory, which is why V4 Flash needs CPU offload on a 12GB card.
How fast is each on a 12GB RTX 3060?
Numbers below are synthesized from public r/LocalLLaMA posts and llama.cpp benchmark threads through June 2026, using a Zotac Twin Edge RTX 3060 12GB or MSI Ventus 2X RTX 3060 12GB paired with a Ryzen 7 5800X and DDR4-3600. Yours will vary by ±15% depending on driver version and llama.cpp build.
| Model + Quant | Context | tok/s (generation) | Notes |
|---|---|---|---|
| GLM-5.2 q4_K_M | 4096 | 26-30 | Fully GPU |
| GLM-5.2 q4_K_M | 8192 | 22-25 | Fully GPU, tight |
| GLM-5.2 q5_K_M | 4096 | 18-21 | Fully GPU |
| GLM-5.2 q3_K_M | 16384 | 28-32 | Fully GPU, noticeable quality loss |
| DeepSeek V4 Flash q4_K_M | 4096 | 20-26 | 30-40% layers on CPU |
| DeepSeek V4 Flash q4_K_M | 8192 | 16-20 | More offload, slower |
| DeepSeek V4 Flash q3_K_M | 8192 | 22-26 | Less offload, lower quality |
| DeepSeek V4 Flash q5_K_M | 4096 | 11-14 | Heavy offload, painful |
For comparison, Tom's Hardware ranks the 3060 12GB at roughly half the BF16 TOPS of a 3090, but for q4-quantized models the gap closes — memory bandwidth becomes the bottleneck and the 3060 still hits two-thirds of the 3090's tok/s on dense ≤14B models.
Quantization matrix and quality loss
A high-level rule from community evaluations: GLM-5.2 keeps roughly 94-96% of its fp16 quality at q4_K_M on reasoning benchmarks, and 88-91% at q3_K_M. DeepSeek V4 Flash holds 96-98% at q4_K_M (the MoE routing is very tolerant of quantization) but degrades quickly below q4 — by q3 you have lost the very thing that makes it good at code (deterministic JSON output, accurate tool selection).
Net: q4_K_M for both, q3 only if you must have more context, q5+ only if you accept slower output.
Prefill vs generation, and context-length impact
A 3060 is a generation-bound card. The 360 GB/s memory bandwidth means dense reads of the entire weight set per token cap your tok/s at roughly bandwidth / weight_size. For GLM-5.2 q4_K_M that gives a theoretical ceiling around 45 tok/s; you measure 26-30 because of overhead.
Prefill (the prompt-ingestion phase) is much faster: 250-400 tok/s on a 3060 for either model. A 4K prompt fully ingests in under 15 seconds even on the slow side. Where you feel pain is long generations: 1000 output tokens at 25 tok/s is 40 seconds. Plan your UI for that.
CPU offload: why the 5800X actually matters
When DeepSeek V4 Flash spills 30-40% of its layers to RAM, the per-token cost of those layers is bounded by single-thread CPU performance and DDR4 memory bandwidth, not core count. The AMD Ryzen 7 5800X is a near-ideal pairing: eight Zen 3 cores at 4.7 GHz, 32 MB L3, dual-channel DDR4-3600 nominal. Its single-thread score (~3300 on PassMark in 2026) puts the CPU-resident layers at ~50% the speed of a modern Zen 5 chip, which means you give up roughly 15-25% of your tok/s versus a current AM5 build. Acceptable on a budget rig.
What does not help: 16+ cores or NUMA. llama.cpp's CPU backend pins one expert to a small set of cores; adding cores beyond ~8 only helps if you also raise -t and -tb in the launch flags, and on a 3060 you are bandwidth-bound on the GPU side anyway.
For models like these, fast NVMe (a WD Blue SN550 1TB or better) only affects model load time — once the weights are mapped into memory, your SSD does nothing.
Coding vs reasoning vs chat: which wins which
Get DeepSeek V4 Flash if you:
- Want a code completion model — community evaluations on HumanEval and MBPP at q4_K_M put V4 Flash within 2 points of GPT-4o-mini.
- Use a model for tool calls in agentic flows. JSON output is sharper, tool selection more deterministic.
- Need fast first-token latency. MoE routing keeps prefill cheap.
- Are happy living at 4K context.
Get GLM-5.2 if you:
- Want a "default" assistant for writing, planning, summarization, light reasoning.
- Care about chain-of-thought quality for math and logic.
- Want fully-on-GPU operation with no CPU offload overhead.
- Live at 8K+ context for long documents.
Get both if you have 50GB of SSD to spare. They cost nothing to keep around, and llama.cpp swaps between them in seconds.
Perf-per-dollar + perf-per-watt on a 3060-class rig
The 3060 12GB is currently $260-310 new and $180-220 used (2026 prices, per community pricing threads). At 26 tok/s on GLM-5.2 q4_K_M, that is roughly 0.10 tok/sec/dollar new — easily best-in-class for cards above 8GB of VRAM. Power draw under sustained inference sits at 130-160W; perf-per-watt clocks in around 0.18 tok/sec/W, well behind a 4060 Ti 16GB but ahead of every 8GB Ampere card you would substitute in. The MSI Ventus 2X is the quieter of the two SKUs we list; the Zotac Twin Edge runs ~3°C cooler at the same fan curve.
Common pitfalls
- Loading at q5 or q6 because "it fits". It fits weights-only. Once you add KV cache for a useful context, your generation rate halves. Stay at q4_K_M unless you have measured.
- Running with
n-gpu-layers=99blindly. On V4 Flash, the right number is roughly 28-32 of the 60 total layers — past that you get OOM under load, not at launch. Tune by halving until stable. - Comparing tok/s between llama.cpp and Ollama without matching settings. Ollama defaults to
num_ctx=2048. llama.cpp's example server defaults to 512. Pin the same context to compare. - Forgetting that resizable BAR matters. Make sure your motherboard has it enabled; on a 3060 it bumps prefill by 8-12%.
- Skipping the warm-up batch. Both models need ~30 tokens of warm-up before the per-token timing stabilizes — discard the first batch when benchmarking.
When NOT to run either on a 3060
If you need agentic workflows that actually plan over more than 30K tokens of history, neither model fits comfortably with usable context on a single 12GB card. You will be happier with an RTX 3090 24GB used (q4 fits 70B-class) or — at the high end — a Mac Studio M3 Ultra 96GB unified memory.
If you need sub-200ms first-token latency for a UI, you are bandwidth-starved no matter the model. Move up the memory hierarchy.
Bottom line + recommended pick
Default pick on a 12GB RTX 3060 in 2026: GLM-5.2 at q4_K_M, 8K context. It is the most predictable model on this card — fully GPU-resident, stable throughput, useful enough at writing, reasoning, and chat that you almost never need to switch. If your daily work is coding or tool use, add DeepSeek V4 Flash q4_K_M as a second profile and switch via llama.cpp's --alias flag.
If you are buying the GPU itself today, the ZOTAC Gaming RTX 3060 Twin Edge OC 12GB is the steady value; the MSI Ventus 2X RTX 3060 12G OC trades a few dollars for a quieter cooler. Either pairs cleanly with the AMD Ryzen 7 5800X on a B550 board for a sub-$1000 local-AI rig that runs both models above.
Related guides
- DeepSeek V4 Flash on a 12GB RTX 3060: The Cheapest Agentic Model, Run Local
- AA-Briefcase's 800x Cost Spread: What It Means for Local Agentic Rigs
- Open-WebUI Self-Hosted on a Ryzen 5 5600G + RTX 3060
Citations and sources
- TechPowerUp — GeForce RTX 3060 spec page
- ggerganov/llama.cpp on GitHub
- Tom's Hardware — PC components: GPUs
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
