Quick answer. DeepSeek-R1 32B runs on an NVIDIA GeForce RTX 5070 the same way Qwen 3 32B does — partial GPU offload with~26 of 64 layerson the GPU at q4_K_M, 18–24 tokens/sec generation at 4K context. The model's reasoning style produces long internal chain-of-thought, so context grows fast: budget the KV cache aggressively or you'll OOM on the second turn. Below: VRAM math, install commands, and what to do when R1's<think>blocks blow out your context.
Why DeepSeek-R1 32B is different from Qwen 3 32B
DeepSeek-R1 is a reasoning model — it's trained to emit a <think>...</think> block before its final answer, similar to OpenAI's o1 family. The 32B variant available as open weights is actually DeepSeek-R1-Distill-Qwen-32B — a Qwen 2.5 32B base fine-tuned with R1's distilled reasoning traces. Same parameter count and layer count as Qwen 3 32B (64 layers), same q4_K_M file size (~18.5 GB), but very different behavior at runtime:
- Average response is 2–4× longer because of the internal reasoning trace
- Effective context-per-turn is roughly 3× higher than a non-reasoning 32B
- Stronger on math/coding evals (MATH, AIME) than Qwen 3 32B's general-purpose tune
- Weaker on multilingual / Chinese tasks than Qwen 3 32B
That makes the context budget your binding constraint on a 12 GB card. A Qwen 3 32B answer fits in 2K output tokens; a R1 32B answer routinely blows past 4K. Plan accordingly.
VRAM math — same model size, tighter context budget
Each of the 64 transformer blocks at q4_K_M is ~285 MB. The full breakdown for the 5070 at 4K context:
| Item | VRAM |
|---|---|
| Embeddings + output head (q4_K_M) | 1.1 GB |
| 26 transformer layers on GPU | 7.4 GB |
| KV cache at 4K context, 26 layers, fp16 | 1.6 GB |
| CUDA + activations | 0.7 GB |
| Display compositor overhead | 0.4 GB |
| Total | ~11.2 GB |
That leaves 0.8 GB of safety margin — workable but tight for prefill spikes. For R1 specifically, where each turn lengthens the conversation by 1–2K tokens of <think> content, you'll want to either:
- Drop
-nglto 24 and run at 8K context (better headroom), - Quantize the KV cache to q8_0 and stay at 26 layers with 8K context, or
- Programmatically strip
<think>blocks from the conversation history after each turn before sending the next prompt — this is the trick most production R1 deployments use.
Install — ollama is the fastest path
By default Ollama serves the q4_K_M Distill-Qwen-32B variant. If you want the larger q5_K_M (~22 GB), it won't fit in offload mode without dropping below -ngl 16 — generally not worth it.
For llama.cpp with explicit knobs:
Notes:
--chat-template chatmlis required — DeepSeek-R1 uses ChatML tokens (<|im_start|>,<|im_end|>). Without it the model produces gibberish.-c 8192is the minimum sensible context for R1 because of the<think>block overhead. 4K is too small in practice.-faFlash-Attention 2 is essential on R1 because attention activations on the GPU half of the model spike during the long internal reasoning passes.
Real-world numbers — R1 32B on the 5070
Benchmark rig: Ryzen 9 7900X (12c/24t), 64 GB DDR5-6000, RTX 5070 12 GB, Ubuntu 24.04, llama.cpp 2026-04 build, single user:
| Setting | Prefill PP tok/s | Generation TG tok/s |
|---|---|---|
| 4096 ctx, -ngl 28, KV fp16 | 405 | 23 |
| 8192 ctx, -ngl 26, KV q8 | 360 | 19 |
| 16384 ctx, -ngl 22, KV q8 | 290 | 13 |
| 32768 ctx, -ngl 18, KV q8 | 220 | 8 |
R1 will think for 1,000–2,500 tokens before answering, so the 23 tok/s figure is misleading in practice: you typically wait 60–90 seconds before the user-visible answer starts. The total round-trip for "what's 23 × 17?" through R1 32B on a 5070 is ~30 seconds; for "rewrite this sort routine and explain your reasoning" expect 2–4 minutes.
If you want immediate-token output, set the system message to suppress reasoning: You are a concise assistant. Do not produce <think> blocks. This reduces answer quality on math/code prompts but cuts latency by 3–5×.
Common pitfalls — five we see repeatedly
1. The <think> block consumes your context. Every turn appends the model's reasoning to the history. By turn 4 a 4K context window is full of reasoning traces, and the model starts hallucinating or forgetting the original instruction. Fix: strip <think>...</think> blocks programmatically before adding the message to the next turn's history.
2. Output randomly cuts off mid-thought. The default num_predict in Ollama is 128 tokens, which is way too low for R1. Set PARAMETER num_predict 2048 in your Modelfile.
3. First token takes 30 seconds even on short prompts. That's normal for reasoning models with offload. The <think> block has to generate before the visible answer starts, and on a 5070 offload that's 500–1500 tokens at 19 tok/s. If you streamed the model's full output (with --include-reasoning in newer Ollama versions), you'd see it working.
4. Wrong chat template gives broken outputs. R1 needs ChatML or deepseek-r1 template, not llama2 or chatml-zephyr. Mismatched templates produce structured but wrong outputs that look sensible. Always pin --chat-template chatml in llama.cpp.
5. Q3 quants destroy reasoning. Unlike non-reasoning models where q3_K_M is "good enough", R1 reasoning relies on accurate intermediate steps. Q3 introduces enough numerical error that math/coding evals drop by 10–15 %. Stay at q4_K_M or higher; if you can't fit q4_K_M, use a smaller R1-Distill (e.g., 14B) at full quant.
When NOT to use the 5070 for R1 32B
- Interactive coding with <5 s latency. R1's thinking pass means a 15–30 s minimum first-token-latency on offload. Use the Qwen 3 14B on RTX 5070 path or Llama 3.1 8B instead.
- High-concurrency serving. vLLM doesn't yet support DeepSeek-R1's
<think>token streaming as of mid-2026 — its OpenAI-API shim strips the reasoning block, which breaks downstream code that relies on seeing it. llama.cpp's server is single-user. - Long-context analysis (>32K tokens). KV cache at 64K tokens, q8_0 = 5+ GB; you'd need to drop to
-ngl 12and your throughput collapses to ~5 tok/s. - Cost-sensitive low-volume use. DeepSeek hosts the full 671B R1 API at $0.55/M output tokens. For <100M tokens/month, hosted is cheaper than running R1 32B locally when you factor in the GPU electricity.
For most users, the right home for R1 32B is a card with 24 GB+ VRAM. The 3090 24 GB, 4090 24 GB, or 5090 32 GB all hold the full q4_K_M in VRAM and produce 35–55 tok/sec generation rates. The same model on the Arc B580 faces a similar offload constraint to the 5070.
Worked example — math word problem
Prompt: "A train leaves Chicago at 60 mph at 9:00. Another leaves at 75 mph at 10:30. When do they meet?"
R1 32B on the 5070:
- Prefill 60 input tokens → 0.15 s
- Thinking pass: ~800 tokens of reasoning → 42 s
- Final answer: 60 tokens → 3 s
- Total: ~45 s
The same prompt on a non-reasoning Qwen 3 32B on the same 5070:
- Prefill 60 tokens → 0.15 s
- Direct answer: 100 tokens → 5.5 s
- Total: ~6 s
R1 produces a more reliably correct answer (it shows its work and catches off-by-one errors), but the latency cost is real. Pick R1 when correctness > speed.
Third worked example — multi-turn reasoning conversation
Where R1 really shines is multi-turn reasoning where each turn refines the answer. Realistic flow:
- Turn 1: "Plan a database schema for an inventory system" — R1 thinks for ~1500 tokens about entities, relationships, and indexing; produces a 600-token schema draft. Total: ~110 s.
- Turn 2: "What if items can belong to multiple categories?" — R1 thinks ~900 tokens about junction tables, normalization tradeoffs; produces a 400-token revised schema. Total: ~70 s.
- Turn 3: "Show me the SQL DDL for that with indexes" — R1 thinks ~500 tokens, produces a 700-token SQL script. Total: ~65 s.
The total session is ~4 minutes for a back-and-forth that would take a junior engineer an hour. Throughput is poor but quality is high. If you don't need a fast assistant — you need a careful one — this is what R1 32B on a 5070 is good for.
VRAM safety check before you load
The 5070's 12 GB total is ~12,288 MiB. Linux+X11 typically reserves 250 MiB, leaving ~12,000 MiB usable. If you see <11,500 MiB free, kill your browser tabs — Chrome will hold 700+ MiB of compositor cache, which is enough to push your -ngl 26 config into OOM territory on prefill.
Tuning recipe by use case
Math homework / step-by-step problems (long reasoning trace, short final answer):
R1 needs the long --predict to give the reasoning room. Don't set it lower than 2048 or the model gets cut off mid-thought.
Code review / refactor (medium reasoning, structured output):
Lower temperature keeps the reasoning focused; R1 is happy to ramble at high temp. Drop -ngl to 24 if you OOM on the first long prompt.
Latency-critical (suppressed reasoning):
And in the system prompt: Answer concisely. Do not produce <think> blocks; respond directly. This tanks math/code accuracy by ~15 % but cuts latency to 5–10 seconds.
Benchmark methodology
Public benchmarks measured at -n 256 instead of 128 because R1's response distribution skews longer — 128 tokens isn't enough to see steady-state behaviour. We also forced the chat template at bench time so the prefill costs reflect what users see, not a synthetic non-instruction-tuned baseline.
Second worked example — debug a Python traceback
A common R1 use case: paste a 100-line stack trace and ask "what's wrong?"
- Prefill 800 input tokens at 360 PP-tok/s → 2.2 s
- Reasoning: ~1500 tokens at 19 TG-tok/s → 79 s
- Final diagnosis: 250 tokens → 13 s
- Total: ~95 s
Compare to GPT-5 cloud (which doesn't show reasoning): ~6 s total for the same prompt. R1 locally is 15× slower but free, private, and offline-capable. Pick based on which constraint matters more for your workflow.
See also
- Best GPU for AI code generation in 2026
- How to run DeepSeek-R1 32B on RTX 5080
- How to run DeepSeek-R1 32B on RTX 5090
- How to run DeepSeek-R1 32B on Arc B580
- VRAM calculator: what can you actually run on your GPU?
Cited sources
- llama.cpp documentation on chat templates and offload tuning (GitHub discussion 4167)
- Ollama install + Modelfile guide (ollama.com/install.sh)
- vLLM serving framework reference (GitHub — vllm-project/vllm)
- DeepSeek-R1 distill release notes and community benchmarks (r/LocalLLaMA)
- 2026 GPU hierarchy and bandwidth comparison (Tom's Hardware GPU hierarchy)
As of May 2026 — DeepSeek's R1.1 update is rumoured but not shipped; if a 32B R1.1 lands with a tighter reasoning policy these latency numbers will improve.
