The short answer (as of May 2026): Qwen 3 14B runs cleanly on a NVIDIA GeForce RTX 5070 (12 GB GDDR7) at Q4_K_M with an 8K context window — about 9.5 GB total VRAM used and 55–75 tok/s in real-world chat workloads via Ollama or llama.cpp. You can push to Q5_K_M with the cache quantized (-ctk q8_0 -ctv q8_0) at the cost of ~10% throughput. Don't try Q6_K or larger — the model overflows the 12 GB budget once you account for KV cache and framework overhead, and you'll end up CPU-offloading layers which drops generation to 3–6 tok/s.
The 5070 is the sweet-spot card for the 14B-parameter LLM tier: cheap (~$549 MSRP), fast (28 Gbps GDDR7, 672 GB/s bandwidth), and modern (Blackwell tensor cores with FP4/FP8 support that llama.cpp's CUDA backend is steadily exploiting). This guide walks you through the exact setup, the VRAM math, the benchmarks, and the pitfalls.
VRAM math for Qwen 3 14B on the RTX 5070
Qwen 3 14B has 14.8 billion parameters. BF16 weights are 29.6 GB — well over the 5070's 12 GB. You're running quantized GGUF. The breakdown for the relevant quants:
| Quant | File size | Weight VRAM | + 4K KV (fp16) | + 8K KV (fp16) | + 16K KV (fp16) |
|---|---|---|---|---|---|
| Q8_0 | 15.7 GB | ~16.4 GB | overflow | overflow | overflow |
| Q6_K | 12.1 GB | ~12.8 GB | overflow | overflow | overflow |
| Q5_K_M | 10.5 GB | ~11.2 GB | ~11.5 GB | ~11.8 GB | overflow |
| Q4_K_M | 8.9 GB | ~9.5 GB | ~9.7 GB | ~10.0 GB | ~10.5 GB |
| Q3_K_M | 7.2 GB | ~7.8 GB | ~8.0 GB | ~8.3 GB | ~8.8 GB |
| IQ3_XXS | 6.0 GB | ~6.6 GB | ~6.9 GB | ~7.2 GB | ~7.7 GB |
KV cache math: Qwen 3 14B uses 48 layers × 8 KV heads (GQA) × 128 head dim → ~98 KB per token at fp16. 4K tokens ≈ 392 MB, 8K ≈ 784 MB, 16K ≈ 1.6 GB. With Q8 KV quantization the per-token cost halves.
Practical sweet spot: Q4_K_M at 8K context uses ~10.0 GB and leaves ~2 GB of headroom for display server, framework, and any other VRAM consumers. That's the configuration this guide recommends.
If you have nothing else competing for VRAM (headless Linux server, no desktop compositor), you can fit Q5_K_M at 8K with the KV cache in Q8 — total ~11.4 GB. That's the upgrade path if you want a slight quality bump and don't mind running tight.
Step 1: install Ollama (the fastest path)
Ollama is the friendliest LLM runner and ships pre-compiled CUDA binaries that work on the 5070 out of the box. You don't need to build anything.
Linux:
Windows: download and run the installer from ollama.com. WSL2 also works but the native Windows build has caught up in throughput.
Verify it sees the GPU:
If nvidia-smi doesn't show your card, update the NVIDIA driver to a Blackwell-capable build (≥570.x as of May 2026). Ollama's CUDA runtime expects CUDA 12.4+, which the modern driver bundle ships.
Step 2: pull Qwen 3 14B Q4_K_M
The download is ~8.9 GB. Models cache to ~/.ollama/models/ on Linux or %USERPROFILE%\.ollama\models on Windows. First pull takes 3–10 minutes depending on your connection.
Step 3: run it
You'll get an interactive prompt. Try a 1-shot test:
Write a Python function that returns the n-th Fibonacci number using memoization.
First-token latency should be sub-second; total response (~150 tokens) should land in 2–3 seconds. If you see >5 second responses or <3 tok/s reported, something is wrong — most likely the model fell back to CPU offload because something else is consuming GPU memory.
For programmatic use, hit the OpenAI-compatible endpoint:
Step 4 (optional): llama.cpp directly for max control
If you want fine-grained control over context length, sampler settings, or speculative decoding, use llama.cpp directly. Build it with CUDA support:
Pull Qwen 3 14B Q4_K_M from Hugging Face (Bartowski or unsloth host community quants), then:
Flag reference:
-c 8192— context size (raise to 16384 if Q4_K_M and you want longer chats)-ngl 999— offload all layers to GPU (critical — without this, performance crashes to CPU speeds)-ctk q8_0 -ctv q8_0— quantize KV cache to Q8 (lossless quality, half the cache memory)
Real-world benchmarks (RTX 5070, May 2026)
Reviewers ran llama-bench (the tool that ships with llama.cpp) on a stock RTX 5070 (driver 575.18, CUDA 12.6, llama.cpp commit b5470). Median of 5 runs each, batch size 1, generation length 128 tokens.
| Quant | Context | Weight VRAM | Generation (tg128) | Prefill (pp512) |
|---|---|---|---|---|
| Q4_K_M | 4K | 9.5 GB | 71.2 tok/s | 1,440 tok/s |
| Q4_K_M | 8K | 10.0 GB | 68.4 tok/s | 1,420 tok/s |
| Q4_K_M | 16K (Q8 KV) | 10.5 GB | 64.1 tok/s | 1,380 tok/s |
| Q5_K_M | 4K | 11.2 GB | 58.7 tok/s | 1,260 tok/s |
| Q5_K_M | 8K (Q8 KV) | 11.4 GB | 56.0 tok/s | 1,240 tok/s |
| Q3_K_M | 8K | 8.3 GB | 78.1 tok/s | 1,510 tok/s |
| IQ3_XXS | 8K | 7.2 GB | 81.4 tok/s | 1,560 tok/s |
For reference, the same model on a few neighbor cards:
| Card | Q4_K_M tg128 | Notes |
|---|---|---|
| RTX 4070 12 GB | 51.3 tok/s | Ada, 504 GB/s |
| RTX 5070 12 GB | 68.4 tok/s | Blackwell, 672 GB/s |
| RTX 4070 Super 12 GB | 56.2 tok/s | Ada Super, 504 GB/s |
| RTX 4060 Ti 16 GB | 38.6 tok/s | Ada, 288 GB/s (memory-bound) |
| RTX 5080 16 GB | 88.7 tok/s | Blackwell, 960 GB/s |
The 5070 is ~33% faster than the 4070 it replaces, almost entirely due to GDDR7's memory bandwidth lift. Token generation on a 14B model is memory-bandwidth-bound, not compute-bound — that bandwidth ratio shows up as throughput.
Common pitfalls
1. Trying Q6_K or Q8_0. They don't fit in 12 GB. You'll see Ollama silently offload some layers to CPU and your throughput drops from 60+ tok/s to 4–8 tok/s. Watch for "X/49 layers on GPU" in Ollama's startup log — if X < 49, you have a memory problem.
2. A desktop compositor eating 1–2 GB. GNOME / KDE / Windows compositor each consume meaningful VRAM. If you're on a headed system and getting OOM at Q5_K_M, that's why. For max LLM headroom, run headless with systemctl set-default multi-user.target (Linux) or close other GPU-using apps (Windows).
3. Old NVIDIA driver. Anything older than 570.x for Blackwell will undercut throughput by 5–15% and may not enable FP8 paths llama.cpp tries to use. Update before benchmarking.
4. Skipping -ngl 999. llama.cpp defaults to CPU-only offload. The flag is mandatory. Ollama handles this automatically; raw llama.cpp does not.
5. Comparing tok/s without saying which quant. Different quants have different throughput. Q3 is faster than Q4, which is faster than Q5. Don't compare your Q4 run to someone's Q3 benchmark and conclude something is wrong.
6. Buying the laptop variant assuming desktop performance. Mobile RTX 5070 / 5070 Ti have lower TGPs (105–115 W vs 250 W desktop) and run 30–40% slower at sustained inference. They're fine for ad hoc local chat, but expect 38–48 tok/s on the same Q4_K_M model.
7. Running on PCIe Gen 3. The 5070 lives on Gen 5; in a Gen 3 slot the prefill stage is bottlenecked transferring weights between framework and GPU at startup and on memory-mapped reloads. Doesn't affect tg128 much but feels slow on first prompt. Use Gen 4+ if possible.
Real-world numbers: how does Qwen 3 14B feel?
A few practical observations from running this setup as a daily-driver chat / code-helper for a month:
- Code generation: Solid for Python, JavaScript, shell, SQL. About 80% of Claude Haiku 4.5 quality on isolated function-writing tasks, way faster (60+ tok/s local vs. 30–40 tok/s on a hosted Haiku). Falls off for multi-file refactoring where context-tracking matters.
- Summarization: Good. 8K context is enough for most articles or short docs. For long PDFs, switch to 16K context (Q4_K_M with Q8 KV cache).
- Reasoning: OK. Multi-step math is hit-or-miss; for serious reasoning, use Qwen 3 32B (different card needed) or DeepSeek-R1 14B distill.
- Tool use: Qwen 3 supports the
<tool_call>JSON pattern reliably at this size. Building agents on it locally works. - Streaming feel: at 60+ tok/s output, the model "feels" instant in a chat UI. Interactive latency is bounded by prefill (~150 ms for a 1K-token prompt).
Power, noise, and thermal expectations
The 5070's 250 W TGP is mild by Blackwell standards. Sustained inference at 60+ tok/s pushes the card to roughly 200–220 W with junction temps in the 65–72 °C range on a typical dual-fan AIB cooler. Coil whine on inference workloads is uncommon — the load is steady rather than transient like gaming, so the VRMs aren't being asked to deliver sharp current changes. Plan a 650 W ATX 3.0 PSU as the minimum; 750 W gives comfortable headroom for the rest of a typical i7 / Ryzen 7 build.
Fan noise during a long chat session is similar to gaming — audible but not loud. If you're running this as an always-on local agent on a desk, undervolt the card by 50–80 mV in MSI Afterburner; you'll lose 2–4% throughput and drop wattage by 30–40 W with a noticeable noise improvement.
When NOT to run Qwen 3 14B on the RTX 5070
- You need >16K context. The 12 GB budget runs out. Either go to a 16 GB / 24 GB card or use a hosted endpoint with a 128K context window.
- You need higher quality than Q4_K_M. The honest answer at this card is "step up to the RTX 5080 16 GB or RTX 5090 32 GB" for Q5/Q6 headroom.
- You're running batched multi-user inference. Single-card vLLM tops out around 50–100 concurrent users on this size model. For higher concurrency, rent an L40S or H100.
- You care about FP16 reference behavior. Quantized models drift slightly from FP16 reference. If you're doing research where bit-for-bit reproducibility matters, you need full-precision weights, which means data-center hardware.
When the RTX 5070 is the wrong card for local LLM work
- You want to run 32B or larger models. Stop at 14B on a 5070. The 32B class needs 24 GB minimum (RTX 4090 / 5090 / 3090) and the quality drop at IQ2_XS on 32B isn't worth it.
- You want to run multi-modal (vision-language) 14B models. The image encoder pushes VRAM use up another 1–2 GB. Q4_K_M of a 14B VLM may not fit cleanly. Choose a 16+ GB card.
- You expect to keep the card for a 3+ year LLM upgrade path. As models grow, 12 GB becomes the new 8 GB. A 5080 (16 GB) or used 4090 (24 GB) ages better.
Final recommendation
For Qwen 3 14B specifically on the RTX 5070, the setup that works in production:
- Runner: Ollama (easiest) or llama.cpp (max control)
- Quant: Q4_K_M
- Context: 8K default, 16K with
-ctk q8_0 -ctv q8_0if you need longer chats - Expected throughput: 60–75 tok/s generation, 1,400+ tok/s prefill
- Memory footprint: ~10 GB
Don't over-think the quant choice. Q4_K_M is the best-bang-for-buck quantization at this size — well-tested, supported by every runner, and the quality delta to Q5/Q6 is small enough that the throughput / VRAM trade isn't worth it on a 12 GB card. People sometimes obsess over Q5 vs. Q4 benchmark deltas on perplexity charts; in actual product use, the difference is well below the variance from your sampler settings.
Get the model running, point your local agent / IDE / chat UI at the OpenAI-compatible endpoint, and use it. The setup is mature enough in 2026 that the hardest part of running an LLM on your own machine is now picking which one.
