Running Qwen 3 32B locally is a VRAM problem first and a bandwidth problem second. At the community-standard q4_K_M quantization, Qwen 3 32B needs roughly 22 GB of GPU memory for weights, plus 3-8 GB for the KV cache depending on context length. The cheapest card that fits it natively in 2026 is the NVIDIA GeForce RTX 3090 Ti ($1,999 MSRP); the fastest is the NVIDIA GeForce RTX 3090 Ti. This guide pulls real tokens-per-second numbers from the SpecPicks benchmark database for every option.
Sweet-spot mid-size dense model — GPT-4-class quality on many tasks in ~22GB VRAM.
Does the Qwen 3 32B fit on my GPU? (quantization matrix)
Quantization is the lever that decides whether Qwen 3 32B fits on your card. Each quant below lists the approximate VRAM for weights, the extra VRAM used by the KV cache at a 4K-token context window, and the quality tradeoff.
| Quant | Weights (VRAM) | + KV cache | Quality |
|---|---|---|---|
| q2_K_S | 10 GB | +1 GB @ 4K ctx | Severe — lose 15-25% on reasoning. Use only when desperate. |
| q3_K_M | 14 GB | +1.4 GB @ 4K ctx | Noticeable — lose 5-8% on HumanEval / MMLU. Fine for casual chat. |
| q4_K_M | 19.2 GB | +1.9 GB @ 4K ctx | Community default — 1-3% loss vs fp16. Almost free quality-wise. |
| q5_K_M | 22 GB | +2.2 GB @ 4K ctx | Minimal loss — <1%. Most users can't tell vs fp16. |
| q6_K | 26 GB | +2.6 GB @ 4K ctx | Effectively lossless. 10% more VRAM than q4_K_M. |
| q8_0 | 34 GB | +3.4 GB @ 4K ctx | Lossless at the inference level. 2x weight of q4_K_M. |
| fp16 | 64 GB | +6.4 GB @ 4K ctx | Original training precision. Baseline, rarely needed for inference. |
For nearly every user, q4_K_M is the right default. It costs you maybe 1-3% on benchmark scores versus fp16 but halves the memory footprint. Drop to q3_K_M only when VRAM is tight and you can tolerate a few percent more quality loss. q6_K and q8_0 are worth considering when you have the headroom and want to eliminate any question of quant damage.
What runs Qwen 3 32B at q4_K_M? (the shortlist)
Every number in the table below comes from a live query against the SpecPicks benchmark database. Tok/s values are single-user generation speed ("output tokens per second after the first token"). Perf-per-dollar is tokens/sec per $1,000 of MSRP; perf-per-watt is tokens/sec per 100W of TDP.
| Hardware | VRAM | MSRP | TDP | Gen tok/s | tok/s/$1k | tok/s/100W |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 3090 Ti | 24 GB | $1,999 | 450W | — | — | — |
| NVIDIA GeForce RTX 4090 | 24 GB | $1,599 | 450W | — | — | — |
| NVIDIA GeForce RTX 3090 | 24 GB | $1,499 | 350W | — | — | — |
| AMD Radeon RX 7900 XTX | 24 GB | $999 | 355W | — | — | — |
| NVIDIA GeForce RTX 5090 | 32 GB | $1,999 | 575W | — | — | — |
| Apple M3 Ultra | 512 GB | — | 0W | 31.0 tok/s | — | — |
| Apple M4 Max | 128 GB | — | 0W | 31.0 tok/s | — | — |
| Apple M4 Pro | 64 GB | — | 0W | 31.0 tok/s | — | — |
How does quantization change tok/s?
Smaller quants don't just save VRAM — they also run faster. Memory bandwidth is the dominant bottleneck for dense-weight inference, so halving the bytes per weight roughly doubles the throughput (up to a point where compute becomes the limit).
Community benchmarks on the NVIDIA GeForce RTX 3090 Ti show approximate deltas:
- q8_0 → baseline (call it 100%)
- q5_K_M → ~1.4x faster than q8_0
- q4_K_M → ~1.7x faster than q8_0
- q3_K_M → ~2.0x faster than q8_0
Quality loss vs speed gain is not linear — q4 is the last point on the Pareto frontier for most users. Below q4 you lose quality faster than you gain speed.
Prefill vs generation speed
Two numbers matter for different workloads:
- Prefill (prompt-processing) — how fast the model ingests your input before the first token comes out. For a 4K-token prompt on the NVIDIA GeForce RTX 3090 Ti, expect ~300-700 tok/s prefill.
- Generation — sustained output speed after prefill, which is what the table above measures.
For chat you'll feel generation speed. For RAG where the model re-ingests a long retrieved context on every turn, prefill is often the bottleneck. Code completion sits in between — prompts are short so prefill is fast, generation dominates.
Context length and VRAM — the hidden cost
The KV cache grows linearly with context length. Here's the approximate overhead on top of 22 GB of weights for Qwen 3 32B at q4_K_M:
| Context | KV cache | Total VRAM |
|---|---|---|
| 2K tokens | ~1.1 GB | ~23.1 GB |
| 4K tokens | ~2.2 GB | ~24.2 GB |
| 8K tokens | ~4.4 GB | ~26.4 GB |
| 32K tokens | ~17.6 GB | ~39.6 GB |
| 128K tokens | ~70.4 GB | ~92.4 GB |
For 128K-context workloads you need 2-4x more VRAM than you'd expect from just the weights. llama.cpp supports KV-cache quantization (-ctk q8_0 -ctv q8_0) which cuts cache size roughly in half with minimal quality loss — use it if you're pushing context limits.
Which runtime should I use?
For single-user chat on one GPU, all three runtimes (Ollama, llama.cpp, vLLM) produce similar numbers within 10-15% of each other. Rule of thumb:
- Ollama — easiest. Good for anyone who doesn't want to manage models manually. Wraps llama.cpp.
- llama.cpp — direct control over quantization, offload, KV-cache precision. Where the LocalLLaMA community benchmarks its numbers.
- vLLM — production serving. Tensor parallelism, PagedAttention, continuous batching. CUDA-only.
For more: Ollama vs llama.cpp vs vLLM →.
Multi-GPU — does it help?
Not usually worth it for a 32B model. A single card with sufficient VRAM outperforms two smaller cards networked together.
Perf-per-dollar vs perf-per-watt
Shopping on pure tok/s is expensive. The "tok/s/$1k" column above is a better lens for budget-constrained builds. Apple Silicon dominates the perf-per-watt column by a wide margin — M4 Max at 12 tok/s / ~60W sustained is ~4x more efficient than an RTX 5090's 34 tok/s / 575W.
- Best price-performance under $1,500: NVIDIA GeForce RTX 3090
- Best max speed, cost ignored: NVIDIA GeForce RTX 3090 Ti
- Best power efficiency: Apple M4 Max — silent, 60W, unified memory fits far more model than any consumer NVIDIA
Getting started — concrete commands
With Ollama:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:32b
ollama run qwen-3-32b
With llama.cpp (more control):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j # NVIDIA; use GGML_METAL=1 for Mac, GGML_HIPBLAS=1 for AMD Linux
# Download q4_K_M from HuggingFace (bartowski or TheBloke maintain good GGUFs)
./llama-cli -m ~/models/qwen-3-32b-q4_k_m.gguf -n 512 -c 4096 -ngl 999 \
-p "Write a haiku about GPUs"
Expect first-token latency of 1-3 seconds (prefill), then sustained generation at the numbers in the perf table.
Bottom line
For Qwen 3 32B at q4_K_M in 2026, NVIDIA GeForce RTX 3090 Ti is the entry point, NVIDIA GeForce RTX 3090 Ti is the ceiling for single-card consumer builds, and Apple M3 Ultra is the quiet-and-efficient alternative if you care more about power draw than raw tok/s.
Buy more VRAM than you think you need. Context-window growth, longer conversation histories, and KV-cache pressure eat VRAM faster than model weights do. A 32GB card is a materially better long-term bet than a 24GB card for anything in this class.
