Qwen3.6 35B on a Single RTX 3060 12GB: What Actually Fits

Name: Qwen3.6 35B on a Single RTX 3060 12GB: What Actually Fits
Item: MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060
Author: Mike Perry

How partial CPU offload, quant choice, and KV-cache pressure shape what a 12 GB card can actually do.

By Mike Perry · Published 2026-05-28 · Last verified 2026-07-22 · 11 min read

Yes, Qwen3.6 35B runs on a 12 GB RTX 3060 — with CPU offload. Here's exactly what fits per quant, how much system RAM you need, and where the bottleneck moves.

Yes — you can run Qwen3.6 35B on a single RTX 3060 12GB, but only with CPU offload. The model needs roughly 22GB at q4_K_M, so the 3060 holds about half the layers in VRAM while the remainder spill to system RAM. Expect 3–6 tokens per second for generation in a typical 8K-context chat, dropping under heavy prefill on long prompts.

Why this article exists

The release of Qwen3.6 35B in mid-2026 reset what counts as a "useful local model" for newcomers. Until very recently, anything beyond 14B parameters required a 24GB card or a multi-GPU rig. Qwen3.6 changed that math: the architecture is tuned for aggressive quantization, the prefill kernels in llama.cpp and vLLM are now mature on Ampere, and 12GB cards are still the dominant tier in the Steam Hardware Survey. That combination means a $250 used RTX 3060 12GB is suddenly the cheapest plausible on-ramp into 30B-class local inference.

The trade-off is real, though. Running a 35B model on 12GB of VRAM is not the same experience as running a 14B model on the same card. Offload changes everything: prefill becomes the bottleneck on long prompts, KV cache eats into your available VRAM faster than you would expect, and the speed gap between "fits fully on GPU" and "spills to CPU" is roughly an order of magnitude. This piece walks through exactly what fits, what does not, and where the 3060 12GB stops being the right answer.

We are not claiming first-party measurements. The numbers below are synthesized from publicly reported benchmarks in the Qwen team blog, Hugging Face model cards, the llama.cpp performance discussions on GitHub, and community measurements posted to r/LocalLLaMA. Where a number varies materially across sources, we note the spread.

Key takeaways

Qwen3.6 35B at q4_K_M weighs about 21–22 GB on disk and at runtime, so the RTX 3060 12GB cannot hold the whole model.
With roughly 28–32 of the 64 layers on GPU and the rest on CPU, expect 3–6 tok/s generation in 8K-context chat and 2–3 tok/s with a long prompt.
32 GB of system DDR4 at 3600 MT/s is the practical floor; 16 GB will swap once the runtime, KV cache, and other apps are resident.
A faster CPU and faster RAM help generation noticeably; a faster SSD only helps cold-start load times.
If you want full-VRAM 30B-class speed, the next sensible step is a 16 GB card such as the RTX 4060 Ti 16GB, then 24 GB on a 3090.

What is Qwen3.6 35B and how big is it on disk per quant?

Qwen3.6 35B is a 64-layer, dense Transformer with grouped-query attention and a 128K-token context window. The Qwen team's release post lists the parameter count at about 35 billion and the native FP16 size at roughly 70 GB on disk. Quantization compresses both the on-disk weights and the runtime footprint at the cost of a small quality regression.

The community-quantized GGUF files at Bartowski's and TheBloke-style Hugging Face mirrors give a clean per-quant view:

Quant	On-disk size	Runtime VRAM (weights only)	Quality vs FP16
q2_K	~13.0 GB	~13.5 GB	Heavy degradation; not recommended
q3_K_M	~16.3 GB	~17.0 GB	Visible regressions on reasoning
q4_K_M	~20.8 GB	~21.5 GB	Minor regressions; community default
q5_K_M	~24.3 GB	~25.0 GB	Near-FP16 in most benchmarks
q6_K	~28.2 GB	~29.0 GB	Very close to FP16
q8_0	~36.5 GB	~37.5 GB	Effectively FP16-equivalent
FP16	~70.0 GB	~70.5 GB	Reference

"Runtime VRAM" here is weights only — KV cache and activations add another 1–4 GB depending on context length, which we cover later.

Does Qwen3.6 35B fit in 12 GB of VRAM, or do you have to offload?

It does not fit at any quant you would actually want to run. Even q2_K — which is too degraded to be useful for serious work — needs ~13.5 GB, which already exceeds the 12 GB the RTX 3060 exposes. The realistic answer is partial offload: keep as many of the 64 transformer layers as possible on the GPU, and let the rest run on the CPU through llama.cpp's offload path.

A practical configuration at q4_K_M on a stock 12 GB 3060 is about 28–32 layers on GPU and 32–36 layers on CPU. That leaves roughly 1.5–2 GB of VRAM free for the KV cache at 8K context. Going below 28 layers on GPU starts to hurt more than it helps; going above 32 typically forces VRAM exhaustion the moment you load a longer prompt.

Spec table: RTX 3060 12GB vs the VRAM Qwen3.6 35B needs per quant

Quant	Total weight VRAM	Layers on 3060 12GB	Layers offloaded to CPU	Expected quality loss
q2_K	~13.5 GB	~50 of 64	~14	Severe — avoid
q3_K_M	~17.0 GB	~42 of 64	~22	Noticeable on reasoning
q4_K_M	~21.5 GB	~30 of 64	~34	Community default
q5_K_M	~25.0 GB	~26 of 64	~38	Minimal vs FP16
q6_K	~29.0 GB	~22 of 64	~42	Essentially FP16
q8_0	~37.5 GB	~17 of 64	~47	Reference-grade

Layer counts assume ~340 MB per layer at q4 with 1.5 GB reserved for KV cache, activations, and the runtime context.

Benchmark table: tok/s on RTX 3060 12GB at q4_K_M with CPU offload vs full-GPU smaller models

Numbers below are the median of publicly reported community measurements at 8K context, using llama.cpp built with CUDA 12 and -ngl (n-GPU-layers) tuned to fill VRAM without OOM. Synthesis sources include llama.cpp discussion #5021 threads and r/LocalLLaMA benchmark posts from May 2026.

Model	Quant	Fits fully on 3060 12GB?	Prefill tok/s	Generation tok/s
Qwen3.6 7B	q4_K_M	Yes	~280	~58
Qwen3.6 14B	q4_K_M	Yes	~165	~34
Qwen3.6 32B (dense)	q4_K_M	No (~21GB)	~38	~4.0
Qwen3.6 35B	q3_K_M	No (~17GB, 42 layers GPU)	~62	~5.5
Qwen3.6 35B	q4_K_M	No (~21GB, 30 layers GPU)	~46	~4.2
Qwen3.6 35B	q5_K_M	No (~25GB, 26 layers GPU)	~38	~3.4

The interesting line is the gap between 14B at q4 (34 tok/s, fits) and 35B at q4 (4.2 tok/s, partial offload). The 35B model is roughly 8× slower despite being only 2.5× larger, which is the offload tax made visible.

How much does CPU and system RAM matter when you offload Qwen3.6 35B?

When 30–35 layers are sitting in DDR4, the CPU side of inference is no longer a footnote. Generation throughput depends on two things: the matrix-multiply bandwidth of your CPU (cores × AVX2/AVX-512 width × clock) and the memory bandwidth between the CPU and DRAM. The disk almost never matters once the model is loaded.

In practical terms:

CPU: A modern 6-core like a Ryzen 5 5600 will reach about 3.5 tok/s on partial-offload q4_K_M. An 8-core Ryzen 7 5700X pushes that closer to 4.2 tok/s. The headroom past 8 cores diminishes quickly because llama.cpp's offload kernels are memory-bound, not compute-bound.
RAM speed: Going from 2666 MT/s to 3600 MT/s DDR4 lifts generation by roughly 15–20% on a 5700X. Going to DDR5-6400 on an AM5 platform lifts it again by a similar margin, but the CPU upgrade usually costs more than the speed-up justifies for this workload alone.
RAM capacity: 16 GB is the absolute floor for a 35B model with the OS, the runtime, and a browser running. 32 GB is the comfortable default and the configuration most community benchmarks assume.

System RAM does not change prefill much — prefill is dominated by the GPU side of the layered compute — but it sets a hard ceiling on what you can run at all. Below 32 GB you will start swapping to NVMe and lose another 2–3× on generation.

Prefill vs generation: why offloaded 35B stalls on long prompts

Prefill is the one-time pass that ingests the prompt and builds the KV cache; generation is the per-token loop that follows. When the whole model is on GPU, prefill is fast and generation is fast. When part of the model is on CPU, both slow down — but prefill slows down disproportionately because every prompt token has to traverse the layered compute end-to-end before the first response token comes out.

On a 12 GB 3060 with Qwen3.6 35B q4_K_M, a 256-token prompt warms up in roughly 5–6 seconds. A 4,000-token prompt takes 80–90 seconds before the first token of the answer appears. At 16,000 tokens prefill alone is several minutes, which is why people running RAG or long-document workflows on 12 GB cards either drop to a smaller model or invest in a 24 GB card.

If your use case is short-prompt conversational chat or code completion, the prefill penalty is tolerable. If it is long-document Q&A, summarization, or agentic workflows that re-send a growing context every turn, this configuration will frustrate you.

Context-length impact: KV-cache growth on a 12 GB card

Qwen3.6 35B's KV cache at 8K context is about 1.2 GB in FP16, or about 600 MB if you enable f16 KV with --cache-type-k q8_0 --cache-type-v q8_0 quantization. At 32K the KV cache is roughly 4.8 GB uncompressed, which on a 12 GB card means you have to take layers off the GPU to make room.

Practical guidance:

8K context: safe at 30+ layers on GPU at q4_K_M.
16K context: drop to ~24 layers on GPU and quantize the KV cache to q8.
32K context and above: consider q3_K_M weights to free more VRAM, or accept generation in the 2–3 tok/s range.

The 128K theoretical context Qwen3.6 supports is not realistically usable on 12 GB — the math just does not fit. If you need >32K context regularly, you are buying a different card.

Perf-per-dollar: is a used RTX 3060 12GB still the cheapest 12GB on-ramp?

As of mid-2026, used RTX 3060 12GB cards sell for $220–$280 on the secondary market. The closest competitive options at the same VRAM tier:

A new RTX 4060 is 8GB and rules itself out for 30B inference.
A new RTX 4060 Ti 16GB is $400–$450 and gives you the 16 GB that lifts Qwen3.6 35B at q3_K_M closer to "fits."
A used RTX 3060 Ti is faster for gaming but only 8 GB, so it does not help here.
A used RTX 2080 Ti at 11 GB has the wrong memory tier and is generally worse value for inference now.

For pure dollars-per-token-per-second on 30B+ models, the 3060 12GB is still the cheapest plausible answer. The next sensible step up is the 4060 Ti 16GB, which roughly doubles the price and roughly doubles real-world generation throughput for Qwen3.6 35B by keeping more layers on GPU.

When should you step up to 16 GB or 24 GB instead?

Three signals tell you the 3060 12GB has stopped being the right card:

You routinely hit prefill timeouts. If you are sending 4K+ token prompts and watching the screen for minutes before a response, you are paying the offload tax constantly. A 16 GB card cuts that in half by keeping more layers on GPU; a 24 GB card (used 3090) removes the offload tax entirely for Qwen3.6 35B at q4_K_M.
You want q5 or higher quality. q5_K_M and above mean fewer layers on GPU at 12 GB and progressively worse generation throughput. A 16 GB card lets q4_K_M fit fully and a 24 GB card lets q5_K_M fit comfortably.
You are running an agent or RAG pipeline. Anything that resends a growing context every turn punishes offloaded inference. The 3060 12GB can run those workloads but will feel slow; a 24 GB card makes them workable.

If none of those apply — if you are doing short-prompt chat, evaluation, or learning the toolchain — the 3060 12GB at q4_K_M with partial offload is genuinely the best dollar value in 2026 for 30B-class local inference.

Common pitfalls on 12 GB + 35B partial offload

Forgetting to quantize the KV cache. Default FP16 KV silently steals 1–2 GB you could have spent on more layers on GPU. Use --cache-type-k q8_0 --cache-type-v q8_0 in llama.cpp.
Running on Windows with WDDM driver overhead. WDDM reserves several hundred MB of VRAM for the desktop compositor. Linux with the nvidia-open or proprietary driver gives you 400–600 MB more usable VRAM, which translates to 1–2 more layers on GPU.
Loading the wrong GGUF. Imatrix-quantized q4_K_M files from Bartowski are about 5–10% better quality than the older non-imatrix variants for the same VRAM. Always check the upload date.
Pinning the model to a slow PCIe slot. A 3060 in a Gen3 x16 slot is fine; the same card in a Gen3 x4 chipset slot bottlenecks transfers between CPU layers and GPU layers and costs 10–15% on partial-offload generation.

Bottom line

Qwen3.6 35B on a 12 GB RTX 3060 works. It does not feel like running a model that fits, but at 4 tok/s of generation on a $250 card you can do real work: code review, summarization, conversational chat, evaluation against frontier APIs. The main caveats are prefill latency on long prompts and a hard practical ceiling around 16K context.

For most people stepping into local LLMs in 2026 with a Steam library on the same machine, that combination — gaming GPU plus 30B-class local inference on a budget — is the right starting point. When you outgrow it, the RTX 4060 Ti 16GB or a used 24 GB card is the natural next step. Until then, the 3060 12GB is doing more work in 2026 than its launch reviews ever predicted.

Related guides

Citations and sources

Qwen team release blog — model architecture and parameter counts for Qwen3.6 35B.
Qwen on Hugging Face — community-quantized GGUF files and per-quant size references.
TechPowerUp RTX 3060 12GB page — VRAM bandwidth, GA106 silicon details.
NVIDIA RTX 40-series page — RTX 4060 Ti 16GB spec reference used in the upgrade comparison.
llama.cpp performance discussions — partial-offload tok/s threads used to synthesize the benchmark table.
Steam Hardware Survey — 12 GB VRAM tier distribution context.

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

What the 5800X Should Have Been: AMD Ryzen 7 5700X CPU Review & Benchmarks — Gamers Nexus on YouTube

Frequently asked questions

Can Qwen3.6 35B run entirely on a 12GB RTX 3060 with no offload?

No. A 35B-parameter model at any usable quant (q4_K_M and up) exceeds 12GB once you include the KV cache, so part of the model must live in system RAM. The RTX 3060 12GB can hold roughly half to two-thirds of the layers at q4, with the remainder offloaded to the CPU — workable, but slower than a model that fits fully in VRAM.

What quant should I use on a 12GB card?

q4_K_M is the usual balance of size and quality for a 35B model, but on a 12GB card you may drop to q3_K_M to keep more layers on the GPU and reduce offload penalty. Going below q3 saves VRAM but degrades reasoning noticeably, so most users on 12GB accept partial offload at q4 rather than running a heavily damaged q2 weight.

How much system RAM do I need to offload Qwen3.6 35B?

Plan for at least 32GB of system RAM so the offloaded layers and the OS have headroom; 16GB is tight once the model, KV cache, and your inference runtime are all resident. Dual-channel memory matters here because offloaded layers are bottlenecked by CPU-to-RAM bandwidth, not just capacity, so faster DDR4 helps generation speed on cards like the RTX 3060.

Will a faster CPU like the Ryzen 7 5700X speed up an offloaded model?

Yes, partially. When layers are offloaded to the CPU, generation throughput depends on CPU compute and memory bandwidth, so an 8-core Ryzen 7 5700X will out-generate a weaker quad-core when offload is heavy. It will not match a card that fits the whole model in VRAM, but it meaningfully lifts tok/s on a partial-offload 12GB setup compared with a low-end CPU.

Is a 12GB RTX 3060 still worth buying for local LLMs in 2026?

For an entry budget it remains one of the cheapest paths to 12GB of CUDA-accessible VRAM, which comfortably runs 7B-14B models fully on-GPU and lets you experiment with larger models via offload. If your main interest is 30B-plus models at full speed, a 16GB or 24GB card is the better long-term buy, but the 3060 12GB is a low-risk starting point.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Qwen3.6 35B on a Single RTX 3060 12GB: What Actually Fits

Why this article exists

Key takeaways

What is Qwen3.6 35B and how big is it on disk per quant?

Does Qwen3.6 35B fit in 12 GB of VRAM, or do you have to offload?

Spec table: RTX 3060 12GB vs the VRAM Qwen3.6 35B needs per quant

Benchmark table: tok/s on RTX 3060 12GB at q4_K_M with CPU offload vs full-GPU smaller models

How much does CPU and system RAM matter when you offload Qwen3.6 35B?

Prefill vs generation: why offloaded 35B stalls on long prompts

Context-length impact: KV-cache growth on a 12 GB card

Perf-per-dollar: is a used RTX 3060 12GB still the cheapest 12GB on-ramp?

When should you step up to 16 GB or 24 GB instead?

Common pitfalls on 12 GB + 35B partial offload

Bottom line

Related guides

Citations and sources

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Qwen3.6 35B on a Single RTX 3060 12GB: What Actually Fits

Why this article exists

Key takeaways

What is Qwen3.6 35B and how big is it on disk per quant?

Does Qwen3.6 35B fit in 12 GB of VRAM, or do you have to offload?

Spec table: RTX 3060 12GB vs the VRAM Qwen3.6 35B needs per quant

Benchmark table: tok/s on RTX 3060 12GB at q4_K_M with CPU offload vs full-GPU smaller models

How much does CPU and system RAM matter when you offload Qwen3.6 35B?

Prefill vs generation: why offloaded 35B stalls on long prompts

Context-length impact: KV-cache growth on a 12 GB card

Perf-per-dollar: is a used RTX 3060 12GB still the cheapest 12GB on-ramp?

When should you step up to 16 GB or 24 GB instead?

Common pitfalls on 12 GB + 35B partial offload

Bottom line

Related guides

Citations and sources

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

AMD Ryzen 7 5700X 8-Core, 16-Thread Unlocked Desktop Processor

Western Digital 1TB WD Blue SN550 NVMe Internal SSD - Gen3 x4 PCIe 8Gb/s, M.2…

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review