Qwen 3.6 35B A3B on RTX 3060 12GB — Real Tok/s and What to Expect
Short answer: yes, an RTX 3060 12GB can run Qwen 3.6 35B A3B comfortably at Q3_K_M to Q4_K_M with most of the model in VRAM and the long tail offloaded to system RAM. Per public llama.cpp benchmarks aggregated on r/LocalLLaMA, expect roughly 18–28 tok/s generation at short contexts on Q3_K_M, dropping to 10–15 tok/s as context grows toward 8K, and a further 1.5–1.8× lift when Multi-Token Prediction (MTP) speculative decoding is enabled.
The Qwen 3.6 35B A3B release reshaped the "what can my budget GPU actually run" calculus this month. The numbers floating around r/LocalLLaMA — community threads like "The Qwen 3.6 35B A3B hype is real!!!" and the RTX 5070 Ti benchmark posts — are not just enthusiasm: they reflect a real architectural shift. Mixture-of-Experts models that activate only a few billion parameters per token (the "A3B" suffix on Qwen's release naming literally means Active 3 Billion) finally bring 30B-class quality into the price bracket of a used or new RTX 3060 12GB.
That matters because the RTX 3060 12GB has been the unloved-but-correct answer for entry-level local LLM hosting for two years now. Its 12GB of VRAM is more than the 8GB on a 4060, the 192-bit memory bus is wide enough to keep prompt processing competitive, and the street price still hovers under \$300 for new cards (less for used). Until A3B-class MoE models arrived, asking the 3060 to host a 35B-parameter model meant aggressive Q2_K quantization, painful quality loss, and 6–8 tok/s. Now, with the qwen 3.6 a3b benchmark numbers landing in public threads, the same card runs the same nominal parameter count at usable interactive speeds — because only 3B parameters are actually computed per token.
This piece is a buying-decision article: not a how-to install guide, not a quantization tutorial, but a synthesis of what public benchmarks say the local llm rtx 3060 experience looks like for Qwen 3.6 35B A3B in llama.cpp, and when stepping up to a 4060 Ti 16GB or 5070 Ti is the right call. All numbers cited inline are from community-published runs; no first-party benchmarking is reported.
Key Takeaways
- The MoE A3B architecture activates ~3B parameters per token from a 35B total pool, slashing per-token compute and making the model VRAM-bandwidth-bound on a 12GB card.
- Q3_K_M is the sweet spot for 12GB: ~10.5GB model weights, leaving headroom for KV cache up to ~8K context.
- Q4_K_M is achievable with partial CPU offload (~4–6 layers to system RAM), trading 30–40% of generation tok/s for noticeably tighter outputs.
- Per LocalLLaMA llama.cpp benchmark threads, expect 18–28 tok/s generation at short context on Q3_K_M, 10–15 tok/s at 8K.
- MTP speculative decoding adds a community-measured 1.5–1.8× speedup on top — that's a 3060 12GB hitting 30+ tok/s effective on simple completions.
- Stepping up to a 4060 Ti 16GB unlocks Q5_K_M fully in VRAM; the 5070 Ti unlocks Q6_K with room for 16K+ context.
What's an A3B model and why does it fit on a 3060?
A3B stands for Active 3 Billion — the marketing shorthand Qwen uses to describe the mixture-of-experts (MoE) routing in this generation. The model has 35B total parameters spread across many expert subnetworks, but the router selects only a small subset for each token. Per the Qwen 3.6 release notes on GitHub, the active parameter count is ~3B; the remaining ~32B sit in VRAM (or system RAM, on offload) but contribute zero FLOPs to any individual token.
Why does this matter for a 12GB card? Two reasons. First, the bandwidth required per token is much lower than a dense 35B — only the activated experts plus the shared attention layers need to be read from VRAM. The RTX 3060's 360 GB/s memory bandwidth, often the bottleneck on dense models, becomes adequate. Second, the compute per token is dramatically reduced — the GPU isn't doing dense matmuls across 35B parameters, it's doing dense matmuls across 3B parameters plus routing. The 3060's modest 12.7 TFLOPS FP32 compute keeps up.
The catch: you still need to store all 35B parameters somewhere fast enough that the router can pull whichever experts it picks. Q3_K_M quantization compresses the full 35B into ~10.5GB. That fits on a 12GB card with room for the KV cache. Q4_K_M is ~13.5GB — over budget by 1.5GB, which forces a partial CPU offload of 4–6 layers via llama.cpp's --n-gpu-layers knob. Inference speed drops because system-RAM-resident experts are read over PCIe (16–32 GB/s effective) rather than VRAM (360 GB/s), but quality climbs.
This combination — MoE sparsity + aggressive quantization + partial offload — is the unlock that puts a 35B-class model on a sub-\$300 GPU. Two years ago this would have been a 4090-class workload. The local llm rtx 3060 community has been chasing this exact configuration since the Mixtral 8x7B days; Qwen 3.6 A3B is the first 35B-tier model where the numbers land cleanly.
Quantization matrix — Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
Per llama.cpp release notes and community-published GGUF size measurements for Qwen 3.6 35B A3B:
| Quant | Approx file size | Fits in 12GB? | Generation tok/s (short ctx) | Quality loss vs FP16 |
|---|---|---|---|---|
| Q2_K | ~7.2 GB | Yes, full GPU | 22–30 | Noticeable — code/math degrade |
| Q3_K_M | ~10.5 GB | Yes, full GPU | 18–28 | Mild — best 12GB-native pick |
| Q4_K_M | ~13.5 GB | Partial offload (4–6 layers) | 11–18 | Minor — recommended quality target |
| Q5_K_M | ~16.0 GB | Heavy offload (12+ layers) | 6–10 | Negligible |
| Q6_K | ~19.0 GB | Mostly CPU-resident | 4–7 | Effectively lossless |
| Q8_0 | ~25.0 GB | CPU-resident (impractical on 3060) | 2–4 | Lossless |
The cited tok/s ranges are pulled from r/LocalLLaMA threads benchmarking Qwen 3.6 35B A3B on RTX 3060 12GB under llama.cpp with default settings, batch size 1, and 2K-context prompts. Numbers vary by quantization tool, prompt characteristics, and CUDA driver version — treat the table as a rough map, not a calibration.
The buying decision falls between Q3_K_M and Q4_K_M. Q3_K_M is the only fully-VRAM-resident option that doesn't compromise quality the way Q2_K does, and you get the full ~25 tok/s the card is capable of. Q4_K_M pushes quality up a meaningful notch — the difference shows up in code generation, structured-output reliability, and long-context reasoning — but you pay with 30–40% lower throughput due to the offloaded experts. If your workflow tolerates 12–15 tok/s, Q4_K_M is the better default.
llama.cpp build flags + offload split for 12GB
For a clean build targeting an RTX 3060 12GB on Linux:
The CMAKE_CUDA_ARCHITECTURES=86 flag targets Ampere (the 3060's compute capability), which produces a slimmer binary than the multi-arch default. For a Q3_K_M run with the full model on the GPU:
Setting --n-gpu-layers 999 tells llama.cpp to offload every layer it can; with Q3_K_M weights at ~10.5GB and an 8K context KV cache at ~1.3GB, you're at ~11.8GB — just under the 12GB ceiling. If you hit OOM at higher contexts, drop --ctx-size to 4096 or step down to --n-gpu-layers 30 and let the last few transformer blocks live on CPU.
For a Q4_K_M run with partial offload, the typical split is --n-gpu-layers 38 (the model has 48 layers total in Qwen 3.6 35B; numbers from community quantization metadata). The remaining 10 layers run on CPU, fed over PCIe. Inference speed drops because of the PCIe round-trip, but Q4_K_M quality is worth it for any code or structured-output workload.
The --flash-attn flag is on by default in recent llama.cpp builds and gives a measurable speedup on Ampere — keep it enabled unless you're debugging.
Prefill vs generation tok/s at 2K, 4K, 8K context
Per llama.cpp performance reports on r/LocalLLaMA for the qwen 3.6 a3b benchmark variant on RTX 3060 12GB at Q3_K_M:
| Context | Prefill tok/s | Generation tok/s |
|---|---|---|
| 2K | ~520 | 25–28 |
| 4K | ~410 | 18–22 |
| 8K | ~280 | 12–15 |
Prefill — the rate at which the model processes the prompt — is compute-bound and benefits heavily from the 3060's tensor cores. Generation — the rate at which it produces tokens — is memory-bandwidth-bound, which is why generation tok/s degrades more steeply with context length (the KV cache grows and competes with weight reads for bandwidth).
The practical implication: short, focused prompts feel fast on a 3060 12GB. Long-context summarization or RAG workloads (8K+ retrieved chunks) drop into the 10–15 tok/s range, which is still usable for batch processing but feels sluggish for interactive chat. If your use case is heavy on long-context, this is the strongest argument to step up to a 16GB card.
Multi-Token Prediction (MTP) speedup — community-measured 1.5–1.8× per LocalLLaMA
Multi-Token Prediction, the technique sometimes called "speculative decoding" or "draft model decoding," uses a smaller, faster model to propose several tokens at once, then has the main model verify them in parallel. When the draft model's guesses match, you get those tokens "for free" — the verification pass costs the same as generating one token but produces several.
Per LocalLLaMA threads measuring mtp speculative decoding qwen on RTX 3060 12GB, the speedup landed in the 1.5–1.8× range for typical chat workloads, with peaks above 2× on highly predictable output (code completion, structured JSON). The draft model is typically a Qwen 3 0.6B or 1.5B GGUF, which uses negligible additional VRAM (~500MB at Q4) but produces high-quality drafts because it's the same model family.
To enable in llama.cpp:
The -md flag points at the draft model, and --draft-max/--draft-min control how many tokens the draft proposes per verification round. The 1.5–1.8× speedup pushes effective tok/s into the 30–45 range at short contexts on Q3_K_M — territory that used to require a 4070-class card on dense models.
RTX 3060 12GB vs RTX 4060 Ti 16GB vs RTX 5070 Ti for Qwen 3.6 35B
| GPU | VRAM | Max comfortable quant | Tok/s at Q3_K_M | Tok/s at Q4_K_M | Street price |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | Q3_K_M full / Q4_K_M offload | 18–28 | 11–18 | ~\$280 new |
| RTX 4060 Ti 16GB | 16 GB | Q4_K_M full / Q5_K_M offload | 26–34 | 22–28 | ~\$450 new |
| RTX 5070 Ti | 16 GB | Q5_K_M full / Q6_K offload | 50–70 | 45–60 | ~\$800 new |
The 4060 Ti 16GB is the obvious mid-tier step-up — it puts Q4_K_M fully in VRAM, eliminates PCIe offload entirely for that quantization, and the wider memory bandwidth (288 GB/s vs 3060's 360 GB/s is actually lower, but the 16GB capacity matters more here) gets you to 22–28 tok/s on Q4_K_M. That's a meaningful interactive-chat improvement over the 3060's Q4_K_M offload performance.
The 5070 Ti is a different class entirely — Blackwell tensor cores, GDDR7, and bandwidth approaching 900 GB/s. Per community RTX 5070 Ti threads on r/LocalLLaMA, the same Qwen 3.6 35B A3B model at Q5_K_M lands in the 50–70 tok/s range, with room for Q6_K and 16K+ contexts. The price gap (~\$800 vs \$280) is the deciding factor, not the technical capability.
Verdict — when the 3060 is the right pick, when to step up
The RTX 3060 12GB is the right pick if:
- You're building your first local LLM rig and your budget caps at ~\$300 for the GPU
- Your primary use cases are chat, light coding assistance, summarization at <4K context
- You're comfortable running Q3_K_M and treating Q4_K_M as a "quality mode" you toggle on for important prompts
- You plan to use MTP speculative decoding (effectively turns the 3060 into a 4060-class card for many workloads)
Step up to the 4060 Ti 16GB if:
- You want Q4_K_M as your default, fully in VRAM, no offload penalty
- Your prompts routinely run 8K+ context (long documents, RAG)
- You're willing to spend ~\$170 more for ~50% better real-world tok/s
Step up to the 5070 Ti if:
- You're doing serious local development with structured outputs, agentic loops, or long-context analysis
- Q5_K_M is your target quality bar
- The \$500+ price delta over a 4060 Ti 16GB is justified by daily-use frequency
Bottom line
Qwen 3.6 35B A3B is the moment when a sub-\$300 GPU finally hosts a 35B-class model at interactive speeds, and the RTX 3060 12GB is the natural beneficiary. Public benchmarks point to 18–28 tok/s on Q3_K_M, 11–18 tok/s on Q4_K_M with offload, and a 1.5–1.8× lift from MTP that pushes the effective ceiling above 40 tok/s. The card was already the budget-tier sweet spot for 7B–13B dense models; A3B sparsity extends its useful life into the 30B tier without any hardware change.
The decision tree is simple: if you've already got a 3060 12GB, the qwen 3.6 35b rtx 3060 12gb combination is the upgrade-without-upgrading move of the year. If you're buying new and the budget is fixed at \$300, this is the card. If you have \$450 to spend, the 4060 Ti 16GB is the cleaner option and removes the offload-induced latency from Q4_K_M.
Related guides
- The Qwen 3.6 35B A3B MTP on RTX 3060 12GB deep-dive covers the MTP setup in more detail
- For a broader budget-GPU comparison see best GPUs for local LLMs under \$500
- The llama.cpp tuning checklist walks through every relevant build flag
Citations and sources
- Qwen 3.6 GitHub release notes — A3B architecture description, active parameter count, official quantization benchmarks
- llama.cpp release notes — CUDA performance changelog —
--flash-attn, MTP/speculative decoding flags, Ampere build optimization - r/LocalLLaMA — "The Qwen 3.6 35B A3B hype is real!!!" — community Qwen 3.6 launch-week benchmark thread (tok/s tables for 3060 / 4060 Ti / 5070 Ti)
- r/LocalLLaMA — RTX 5070 Ti Qwen 3.6 benchmark thread — comparison datapoints across Ampere / Ada / Blackwell generations
- r/LocalLLaMA — MTP speculative decoding measurements — 1.5–1.8× speedup measurements on Qwen 3.6 with 0.6B draft
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
