How to run DeepSeek-R1 32B on Arc B580

Name: How to run DeepSeek-R1 32B on Arc B580
Item: MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060, 12GB GDDR6 Memory, 192-bit, 15 Gbps
Author: SpecPicks Editorial

Exact commands, expected tok/s, VRAM math for this specific combination.

By SpecPicks Editorial · Published 2026-04-21 · Last verified 2026-06-21 · 2 min read

Arc B580 has 12 GB of GDDR6. DeepSeek-R1 32B at q4KM wants ~19 GB of it for weights alone, plus another 2-3 GB of KV cache at 4K context. Verdict

This tutorial walks you through running DeepSeek-R1 32B on an Arc B580. Exact commands, expected tokens-per-second, and the tradeoffs you should know before starting.

Does it fit?

Arc B580 has 12 GB of GDDR6. DeepSeek-R1 32B at q4_K_M wants ~19 GB of it for weights alone, plus another 2-3 GB of KV cache at 4K context.

Verdict: ⚠️ Tight fit. You'll need CPU offload (llama.cpp with 5 layers on CPU) or drop to q3_K_S. Expect 15-25 tok/s with offload.

Install Ollama (the easy path)

bash

# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b

Ollama handles GPU detection automatically on NVIDIA (via CUDA), AMD (via ROCm on Linux), and Apple Silicon (via Metal). For Intel Arc B580, upstream Ollama support uses llama.cpp's SYCL backend and may require a SYCL-enabled build or the IPEX-LLM fork of Ollama — confirm the Arc is detected at startup before assuming it offloaded.

Install llama.cpp (more control)

llama.cpp gives you flag-level control over quantization, context length, and layer offload. Build from source:

bash

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Download a quantized GGUF (community favorite: bartowski or TheBloke on HuggingFace)
./llama-cli -m ~/models/deepseek-r1-32b-q4_k_m.gguf \
 -n 512 -c 4096 \
 -ngl 10 \
 -p "Write a haiku about GPUs"

With 12GB VRAM, offload ~10 layers; the rest stays on CPU. Expect roughly half the all-GPU speed but much cleaner than running at smaller quants.

Expected performance

Community reports from LocalLLaMA suggest ~10-25 tok/s on this class of hardware.

If you need more speed, bigger VRAM is the answer. Consider upgrading to a card with more headroom →.

Common issues

"out of memory" on the first prompt: reduce context length (-c 2048) or quantization (q4_K_S instead of q4_K_M).

Slow first token but fast generation: that's prompt processing ("prefill"). Normal — blame the KV cache building. Subsequent messages in the same session will be snappy.

Frequent swapping / system hangs: VRAM is full AND system RAM is full. Close Chrome. Add more DDR5.

Arc B580 specs: 12GB memory, 190W TDP, 2024 launch. MSRP $249.

Does it fit? Full quantization matrix

Weight-only VRAM for DeepSeek-R1 32B at every common quant, plus the KV-cache overhead for a 4K-token context. KV cache scales linearly with context — see the context-length table further down.

Quant	Weights	+ KV @ 4K ctx	Total	Fits on this GPU?	Quality loss
q2_K_S	9.6 GB	2.6 GB	12.2 GB	❌	Severe (15-25%)
q3_K_M	14.4 GB	2.6 GB	17.0 GB	❌	Noticeable (5-8%)
q4_K_M	19.2 GB	2.6 GB	21.8 GB	❌	Minimal (1-3%) — community default
q5_K_M	22.4 GB	2.6 GB	25.0 GB	❌	<1%
q6_K	25.6 GB	2.6 GB	28.2 GB	❌	Effectively lossless
q8_0	35.2 GB	2.6 GB	37.8 GB	❌	Inference-lossless
fp16	64.0 GB	2.6 GB	66.6 GB	❌	Baseline (original precision)

Values are approximate — actual footprint depends on batch size, whether the KV cache is quantized (-ctk q8_0 -ctv q8_0 in llama.cpp halves it), and whether you reserve VRAM for a display. Rule of thumb: budget 5-10% headroom on top of the table.

How public benchmarks show and compared

Every tok/s, FPS, and synthetic score in this article is pulled live from the SpecPicks benchmark catalog (hardware_specs, ai_benchmarks, synthetic_benchmarks). We cite the source_name on each row — the vast majority are community-reported numbers from r/LocalLLaMA and llama.cpp GitHub Discussions, with synthetic scores from PassMark, Phoronix, and Tom's Hardware's GPU hierarchy.

Where DB rows exist for a specific model+quant+GPU combination, we quote the number exactly. Where they don't, we fall back to published spec-sheet values (VRAM capacity, TDP, memory bandwidth) plus the closest community-verified ballpark — clearly flagged as a ballpark, not a measurement. We prefer "we don't know" over a fabricated number.

SpecPicks does not run paid hardware review cycles; we aggregate. If you see a number you can improve on, pull-request the row.

Measured tok/s on this GPU

Live data from ai_benchmarks for Arc B580, filtered to the DeepSeek-R1 32B family where available:

Model	Quant	Runtime	Gen tok/s	VRAM used	Source
_No direct matches in the DB yet — see community thread below_

For the full tok/s matrix on this card across every model we've logged, see the Arc B580 benchmark page.

Context length and VRAM — the hidden cost

KV cache grows linearly with context. Here's the approximate overhead on top of 19.2 GB of q4_K_M weights for DeepSeek-R1 32B:

Context	KV cache	Total VRAM
2K tokens	~1.3 GB	~20.5 GB
4K tokens	~2.6 GB	~21.8 GB
8K tokens	~5.1 GB	~24.3 GB
32K tokens	~20.5 GB	~39.7 GB
128K tokens	~81.9 GB	~101.1 GB

For long-context workloads (32K and beyond) on consumer hardware, use llama.cpp's KV-cache quantization — -ctk q8_0 -ctv q8_0 roughly halves cache footprint with sub-1% quality loss. This is the single biggest VRAM-saving flag for long context.

Which runtime wins on this hardware?

Three mainstream runtimes target Arc B580; the right one depends on your workload:

Ollama — easiest. Auto-detects SYCL / Level Zero, handles model downloads, exposes an OpenAI-compatible API out of the box. Wraps llama.cpp; you give up fine-grained control for zero setup.
llama.cpp — direct flag-level control over quant, context, KV-cache precision, batch size, split layers across GPUs. Where the LocalLLaMA community benchmarks its numbers (see the Apple-Silicon megathread #4167 for reference tok/s across M-series chips).
vLLM — built for production serving. Tensor parallelism, PagedAttention, continuous batching. Limited support on this platform — Ollama/llama.cpp are safer bets. If you're not serving multiple concurrent users, the overhead isn't worth it.

For head-to-head numbers and install commands across all three, see our Ollama vs llama.cpp vs vLLM guide.

Troubleshooting — three failure modes and fixes

1. First token takes 5-30 seconds, then generation is fast. That's normal prefill: the model is processing your prompt before it can start generating. On a long prompt (4K+ tokens) prefill dominates the first-token latency. If it's unexpectedly slow, check that you actually offloaded layers to the GPU — intel_gpu_top or xpu-smi should show near-100% utilisation during prefill. If utilisation is flat, your inference is running on CPU.

2. "Out of memory" halfway through a long chat. The KV cache grew past what the card can hold. Drop to a smaller quant (q4_K_M → q3_K_M), cut -c context length, or enable KV-cache quantization (-ctk q8_0 -ctv q8_0 in llama.cpp). On Ollama set num_ctx smaller in your Modelfile.

3. Tok/s is ~30% of what LocalLLaMA threads report. Three usual suspects: (a) power/thermal throttling — check sustained clocks during a long prompt; (b) PCIe x8 or x4 link when you expected x16 — lspci -vv | grep -i 'LnkSta:' for the Arc device; (c) running the wrong binary, e.g. a CPU-only llama.cpp build on the Arc. Rebuild with GGML_SYCL=1 (oneAPI) or GGML_VULKAN=1 and confirm SYCL/Vulkan devices are listed at startup.

Frequently asked questions

Can I run DeepSeek-R1 32B on Arc B580 without offloading to CPU?

Yes at q4_K_M if the model weights plus KV cache fit in the card's 12 GB GDDR6. For DeepSeek-R1 32B that's approximately 19.2 GB of weights plus 0.5-2 GB of KV cache depending on context length.

What quantization should I use on Arc B580?

q4_K_M is the community default — 1-3% quality loss vs fp16 with less than half the memory. Drop to q3_K_M only when VRAM is tight. Go to q6_K or q8_0 when you have headroom and want to eliminate quant damage as a variable.

Is Arc B580 bottlenecked by memory or compute for this model?

Dense-weight inference is memory-bandwidth-bound on almost every consumer card. At Arc B580's ~456 GB/s of memory bandwidth, the sustained tok/s ceiling ≈ memory bandwidth ÷ weight bytes read per token. The compute units are rarely the limit for single-user inference; they matter more for batched serving.

Does multi-GPU help for this model?

For a 32B model, usually no. If the model already fits in one card, a second card mainly helps batch throughput (vLLM) not single-user latency. Tensor parallelism adds inter-GPU traffic that often nets negative for interactive chat. Multi-GPU pays off on 70B+ models where you need to stack VRAM across cards.

Where can I report or compare my own tok/s numbers?

The r/LocalLLaMA community benchmark threads are the canonical place. llama.cpp also maintains a GitHub Discussions thread for Apple Silicon and per-platform performance. SpecPicks imports numbers from both into ai_benchmarks; if you want a figure added, pull-request the row.

Sources

Related guides

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What are the performance expectations for running DeepSeek-R1 32B on Arc B580?

Community benchmarks suggest 10-25 tokens per second on the Arc B580, depending on the quantization level and whether CPU offloading is used. Performance can vary based on context length, layer offload configuration, and VRAM utilization. For optimal results, ensure proper quantization and layer distribution between GPU and CPU.

What are the common issues when running DeepSeek-R1 32B on Arc B580?

Common issues include 'out of memory' errors, slow first-token generation due to prefill processing, and frequent system swapping. These can be mitigated by reducing context length, using lower quantization levels, enabling KV-cache quantization, or ensuring sufficient system resources like VRAM and RAM.

How does context length affect VRAM usage for DeepSeek-R1 32B?

VRAM usage increases linearly with context length due to the KV cache. For example, a 4K-token context adds ~2.6 GB to the 19.2 GB required for q4_K_M weights, totaling ~21.8 GB. Longer contexts, such as 32K or 128K tokens, can significantly exceed the Arc B580's 12 GB VRAM capacity.

What are the advantages of using llama.cpp over Ollama for this setup?

llama.cpp offers fine-grained control over quantization, context length, KV-cache precision, and layer offloading. This allows users to optimize performance and memory usage for specific workloads. In contrast, Ollama simplifies setup and GPU detection but sacrifices some configurability.

Can the Arc B580 handle DeepSeek-R1 32B at higher quantization levels like q6_K or fp16?

No, the Arc B580's 12 GB VRAM is insufficient for higher quantization levels like q6_K or fp16, which require 28.2 GB and 66.6 GB respectively. These configurations exceed the card's capacity, making them unsuitable without significant CPU offloading or multi-GPU setups.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

How to run DeepSeek-R1 32B on Arc B580

Does it fit?

Install Ollama (the easy path)

Install llama.cpp (more control)

Expected performance

Common issues

Related

Does it fit? Full quantization matrix

How public benchmarks show and compared

Measured tok/s on this GPU

Context length and VRAM — the hidden cost

Which runtime wins on this hardware?

Troubleshooting — three failure modes and fixes

Frequently asked questions

Can I run DeepSeek-R1 32B on Arc B580 without offloading to CPU?

What quantization should I use on Arc B580?

Is Arc B580 bottlenecked by memory or compute for this model?

Does multi-GPU help for this model?

Where can I report or compare my own tok/s numbers?

Sources

Related guides

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 3X 12G OC, Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

ASUS Gaming Graphics Card - GeForce Dual RTX 3060, 12GB GDDR6, RGB, LHR, Ray…

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

How to run DeepSeek-R1 32B on Arc B580

Does it fit?

Install Ollama (the easy path)

Install llama.cpp (more control)

Expected performance

Common issues

Related

Does it fit? Full quantization matrix

How public benchmarks show and compared

Measured tok/s on this GPU

Context length and VRAM — the hidden cost

Which runtime wins on this hardware?

Troubleshooting — three failure modes and fixes

Frequently asked questions

Can I run DeepSeek-R1 32B on Arc B580 without offloading to CPU?

What quantization should I use on Arc B580?

Is Arc B580 bottlenecked by memory or compute for this model?

Does multi-GPU help for this model?

Where can I report or compare my own tok/s numbers?

Sources

Related guides

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks