Yes — you can run Qwen3.6 35B on a single RTX 3060 12GB, but only with CPU offload. The model needs roughly 22GB at q4_K_M, so the 3060 holds about half the layers in VRAM while the remainder spill to system RAM. Expect 3–6 tokens per second for generation in a typical 8K-context chat, dropping under heavy prefill on long prompts.
Why this article exists
The release of Qwen3.6 35B in mid-2026 reset what counts as a "useful local model" for newcomers. Until very recently, anything beyond 14B parameters required a 24GB card or a multi-GPU rig. Qwen3.6 changed that math: the architecture is tuned for aggressive quantization, the prefill kernels in llama.cpp and vLLM are now mature on Ampere, and 12GB cards are still the dominant tier in the Steam Hardware Survey. That combination means a $250 used RTX 3060 12GB is suddenly the cheapest plausible on-ramp into 30B-class local inference.
The trade-off is real, though. Running a 35B model on 12GB of VRAM is not the same experience as running a 14B model on the same card. Offload changes everything: prefill becomes the bottleneck on long prompts, KV cache eats into your available VRAM faster than you would expect, and the speed gap between "fits fully on GPU" and "spills to CPU" is roughly an order of magnitude. This piece walks through exactly what fits, what does not, and where the 3060 12GB stops being the right answer.
We are not claiming first-party measurements. The numbers below are synthesized from publicly reported benchmarks in the Qwen team blog, Hugging Face model cards, the llama.cpp performance discussions on GitHub, and community measurements posted to r/LocalLLaMA. Where a number varies materially across sources, we note the spread.
Key takeaways
- Qwen3.6 35B at q4_K_M weighs about 21–22 GB on disk and at runtime, so the RTX 3060 12GB cannot hold the whole model.
- With roughly 28–32 of the 64 layers on GPU and the rest on CPU, expect 3–6 tok/s generation in 8K-context chat and 2–3 tok/s with a long prompt.
- 32 GB of system DDR4 at 3600 MT/s is the practical floor; 16 GB will swap once the runtime, KV cache, and other apps are resident.
- A faster CPU and faster RAM help generation noticeably; a faster SSD only helps cold-start load times.
- If you want full-VRAM 30B-class speed, the next sensible step is a 16 GB card such as the RTX 4060 Ti 16GB, then 24 GB on a 3090.
What is Qwen3.6 35B and how big is it on disk per quant?
Qwen3.6 35B is a 64-layer, dense Transformer with grouped-query attention and a 128K-token context window. The Qwen team's release post lists the parameter count at about 35 billion and the native FP16 size at roughly 70 GB on disk. Quantization compresses both the on-disk weights and the runtime footprint at the cost of a small quality regression.
The community-quantized GGUF files at Bartowski's and TheBloke-style Hugging Face mirrors give a clean per-quant view:
| Quant | On-disk size | Runtime VRAM (weights only) | Quality vs FP16 |
|---|---|---|---|
| q2_K | ~13.0 GB | ~13.5 GB | Heavy degradation; not recommended |
| q3_K_M | ~16.3 GB | ~17.0 GB | Visible regressions on reasoning |
| q4_K_M | ~20.8 GB | ~21.5 GB | Minor regressions; community default |
| q5_K_M | ~24.3 GB | ~25.0 GB | Near-FP16 in most benchmarks |
| q6_K | ~28.2 GB | ~29.0 GB | Very close to FP16 |
| q8_0 | ~36.5 GB | ~37.5 GB | Effectively FP16-equivalent |
| FP16 | ~70.0 GB | ~70.5 GB | Reference |
"Runtime VRAM" here is weights only — KV cache and activations add another 1–4 GB depending on context length, which we cover later.
Does Qwen3.6 35B fit in 12 GB of VRAM, or do you have to offload?
It does not fit at any quant you would actually want to run. Even q2_K — which is too degraded to be useful for serious work — needs ~13.5 GB, which already exceeds the 12 GB the RTX 3060 exposes. The realistic answer is partial offload: keep as many of the 64 transformer layers as possible on the GPU, and let the rest run on the CPU through llama.cpp's offload path.
A practical configuration at q4_K_M on a stock 12 GB 3060 is about 28–32 layers on GPU and 32–36 layers on CPU. That leaves roughly 1.5–2 GB of VRAM free for the KV cache at 8K context. Going below 28 layers on GPU starts to hurt more than it helps; going above 32 typically forces VRAM exhaustion the moment you load a longer prompt.
Spec table: RTX 3060 12GB vs the VRAM Qwen3.6 35B needs per quant
| Quant | Total weight VRAM | Layers on 3060 12GB | Layers offloaded to CPU | Expected quality loss |
|---|---|---|---|---|
| q2_K | ~13.5 GB | ~50 of 64 | ~14 | Severe — avoid |
| q3_K_M | ~17.0 GB | ~42 of 64 | ~22 | Noticeable on reasoning |
| q4_K_M | ~21.5 GB | ~30 of 64 | ~34 | Community default |
| q5_K_M | ~25.0 GB | ~26 of 64 | ~38 | Minimal vs FP16 |
| q6_K | ~29.0 GB | ~22 of 64 | ~42 | Essentially FP16 |
| q8_0 | ~37.5 GB | ~17 of 64 | ~47 | Reference-grade |
Layer counts assume ~340 MB per layer at q4 with 1.5 GB reserved for KV cache, activations, and the runtime context.
Benchmark table: tok/s on RTX 3060 12GB at q4_K_M with CPU offload vs full-GPU smaller models
Numbers below are the median of publicly reported community measurements at 8K context, using llama.cpp built with CUDA 12 and -ngl (n-GPU-layers) tuned to fill VRAM without OOM. Synthesis sources include llama.cpp discussion #5021 threads and r/LocalLLaMA benchmark posts from May 2026.
| Model | Quant | Fits fully on 3060 12GB? | Prefill tok/s | Generation tok/s |
|---|---|---|---|---|
| Qwen3.6 7B | q4_K_M | Yes | ~280 | ~58 |
| Qwen3.6 14B | q4_K_M | Yes | ~165 | ~34 |
| Qwen3.6 32B (dense) | q4_K_M | No (~21GB) | ~38 | ~4.0 |
| Qwen3.6 35B | q3_K_M | No (~17GB, 42 layers GPU) | ~62 | ~5.5 |
| Qwen3.6 35B | q4_K_M | No (~21GB, 30 layers GPU) | ~46 | ~4.2 |
| Qwen3.6 35B | q5_K_M | No (~25GB, 26 layers GPU) | ~38 | ~3.4 |
The interesting line is the gap between 14B at q4 (34 tok/s, fits) and 35B at q4 (4.2 tok/s, partial offload). The 35B model is roughly 8× slower despite being only 2.5× larger, which is the offload tax made visible.
How much does CPU and system RAM matter when you offload Qwen3.6 35B?
When 30–35 layers are sitting in DDR4, the CPU side of inference is no longer a footnote. Generation throughput depends on two things: the matrix-multiply bandwidth of your CPU (cores × AVX2/AVX-512 width × clock) and the memory bandwidth between the CPU and DRAM. The disk almost never matters once the model is loaded.
In practical terms:
- CPU: A modern 6-core like a Ryzen 5 5600 will reach about 3.5 tok/s on partial-offload q4_K_M. An 8-core Ryzen 7 5700X pushes that closer to 4.2 tok/s. The headroom past 8 cores diminishes quickly because llama.cpp's offload kernels are memory-bound, not compute-bound.
- RAM speed: Going from 2666 MT/s to 3600 MT/s DDR4 lifts generation by roughly 15–20% on a 5700X. Going to DDR5-6400 on an AM5 platform lifts it again by a similar margin, but the CPU upgrade usually costs more than the speed-up justifies for this workload alone.
- RAM capacity: 16 GB is the absolute floor for a 35B model with the OS, the runtime, and a browser running. 32 GB is the comfortable default and the configuration most community benchmarks assume.
System RAM does not change prefill much — prefill is dominated by the GPU side of the layered compute — but it sets a hard ceiling on what you can run at all. Below 32 GB you will start swapping to NVMe and lose another 2–3× on generation.
Prefill vs generation: why offloaded 35B stalls on long prompts
Prefill is the one-time pass that ingests the prompt and builds the KV cache; generation is the per-token loop that follows. When the whole model is on GPU, prefill is fast and generation is fast. When part of the model is on CPU, both slow down — but prefill slows down disproportionately because every prompt token has to traverse the layered compute end-to-end before the first response token comes out.
On a 12 GB 3060 with Qwen3.6 35B q4_K_M, a 256-token prompt warms up in roughly 5–6 seconds. A 4,000-token prompt takes 80–90 seconds before the first token of the answer appears. At 16,000 tokens prefill alone is several minutes, which is why people running RAG or long-document workflows on 12 GB cards either drop to a smaller model or invest in a 24 GB card.
If your use case is short-prompt conversational chat or code completion, the prefill penalty is tolerable. If it is long-document Q&A, summarization, or agentic workflows that re-send a growing context every turn, this configuration will frustrate you.
Context-length impact: KV-cache growth on a 12 GB card
Qwen3.6 35B's KV cache at 8K context is about 1.2 GB in FP16, or about 600 MB if you enable f16 KV with --cache-type-k q8_0 --cache-type-v q8_0 quantization. At 32K the KV cache is roughly 4.8 GB uncompressed, which on a 12 GB card means you have to take layers off the GPU to make room.
Practical guidance:
- 8K context: safe at 30+ layers on GPU at q4_K_M.
- 16K context: drop to ~24 layers on GPU and quantize the KV cache to q8.
- 32K context and above: consider q3_K_M weights to free more VRAM, or accept generation in the 2–3 tok/s range.
The 128K theoretical context Qwen3.6 supports is not realistically usable on 12 GB — the math just does not fit. If you need >32K context regularly, you are buying a different card.
Perf-per-dollar: is a used RTX 3060 12GB still the cheapest 12GB on-ramp?
As of mid-2026, used RTX 3060 12GB cards sell for $220–$280 on the secondary market. The closest competitive options at the same VRAM tier:
- A new RTX 4060 is 8GB and rules itself out for 30B inference.
- A new RTX 4060 Ti 16GB is $400–$450 and gives you the 16 GB that lifts Qwen3.6 35B at q3_K_M closer to "fits."
- A used RTX 3060 Ti is faster for gaming but only 8 GB, so it does not help here.
- A used RTX 2080 Ti at 11 GB has the wrong memory tier and is generally worse value for inference now.
For pure dollars-per-token-per-second on 30B+ models, the 3060 12GB is still the cheapest plausible answer. The next sensible step up is the 4060 Ti 16GB, which roughly doubles the price and roughly doubles real-world generation throughput for Qwen3.6 35B by keeping more layers on GPU.
When should you step up to 16 GB or 24 GB instead?
Three signals tell you the 3060 12GB has stopped being the right card:
- You routinely hit prefill timeouts. If you are sending 4K+ token prompts and watching the screen for minutes before a response, you are paying the offload tax constantly. A 16 GB card cuts that in half by keeping more layers on GPU; a 24 GB card (used 3090) removes the offload tax entirely for Qwen3.6 35B at q4_K_M.
- You want q5 or higher quality. q5_K_M and above mean fewer layers on GPU at 12 GB and progressively worse generation throughput. A 16 GB card lets q4_K_M fit fully and a 24 GB card lets q5_K_M fit comfortably.
- You are running an agent or RAG pipeline. Anything that resends a growing context every turn punishes offloaded inference. The 3060 12GB can run those workloads but will feel slow; a 24 GB card makes them workable.
If none of those apply — if you are doing short-prompt chat, evaluation, or learning the toolchain — the 3060 12GB at q4_K_M with partial offload is genuinely the best dollar value in 2026 for 30B-class local inference.
Common pitfalls on 12 GB + 35B partial offload
- Forgetting to quantize the KV cache. Default FP16 KV silently steals 1–2 GB you could have spent on more layers on GPU. Use
--cache-type-k q8_0 --cache-type-v q8_0in llama.cpp. - Running on Windows with WDDM driver overhead. WDDM reserves several hundred MB of VRAM for the desktop compositor. Linux with the
nvidia-openor proprietary driver gives you 400–600 MB more usable VRAM, which translates to 1–2 more layers on GPU. - Loading the wrong GGUF. Imatrix-quantized q4_K_M files from Bartowski are about 5–10% better quality than the older non-imatrix variants for the same VRAM. Always check the upload date.
- Pinning the model to a slow PCIe slot. A 3060 in a Gen3 x16 slot is fine; the same card in a Gen3 x4 chipset slot bottlenecks transfers between CPU layers and GPU layers and costs 10–15% on partial-offload generation.
Bottom line
Qwen3.6 35B on a 12 GB RTX 3060 works. It does not feel like running a model that fits, but at 4 tok/s of generation on a $250 card you can do real work: code review, summarization, conversational chat, evaluation against frontier APIs. The main caveats are prefill latency on long prompts and a hard practical ceiling around 16K context.
For most people stepping into local LLMs in 2026 with a Steam library on the same machine, that combination — gaming GPU plus 30B-class local inference on a budget — is the right starting point. When you outgrow it, the RTX 4060 Ti 16GB or a used 24 GB card is the natural next step. Until then, the 3060 12GB is doing more work in 2026 than its launch reviews ever predicted.
Related guides
- Best GPU for 1440p Esports in 2026: Why the RTX 3060 12GB Still Delivers
- Claude Opus 4.8 Raised the Bar — Best Local Coding LLMs for a 12GB RTX 3060
- Best Budget Ryzen Gaming PC Build for 1080p in 2026
Citations and sources
- Qwen team release blog — model architecture and parameter counts for Qwen3.6 35B.
- Qwen on Hugging Face — community-quantized GGUF files and per-quant size references.
- TechPowerUp RTX 3060 12GB page — VRAM bandwidth, GA106 silicon details.
- NVIDIA RTX 40-series page — RTX 4060 Ti 16GB spec reference used in the upgrade comparison.
- llama.cpp performance discussions — partial-offload tok/s threads used to synthesize the benchmark table.
- Steam Hardware Survey — 12 GB VRAM tier distribution context.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
