Quick answer. Running Qwen 3 32B on the Intel Arc B580 requires partial CPU offload — q4_K_M is 18.5 GB and the card has 12 GB of GDDR6. With IPEX-LLM + Ollama keeping ~24 of 64 layers on the GPU and the rest on a DDR5 system, expect 12–20 tokens/sec generation at 4K context. That's about 75% of what an RTX 5070 (12 GB) does at half the GPU price. Below: VRAM math, the IPEX-LLM install recipe, and the failure modes specific to SYCL offload.
Why this combination is interesting
The Arc B580 is Intel's $249 12 GB card from late 2024 — Battlemage architecture, 20 Xe2 cores, GDDR6 at 19 Gbps on a 192-bit bus (~456 GB/s bandwidth), 190 W TBP. For pure LLM inference per dollar it's currently the best deal in 2026. But 12 GB isn't enough for a 32B model in any reasonable quantization, so this is a partial-offload scenario — interesting as a price-conscious experiment, important to understand the tradeoffs.
Qwen 3 32B (Alibaba, 2025) is a 32-billion-parameter dense reasoning-tuned model with 64 layers, 128K-token YaRN-extended context, Apache-2.0 license. GGUF quants in order of size: q2_K (~11 GB), q3_K_M (~14 GB), q4_K_M (~18.5 GB), q5_K_M (~22 GB), q8_0 (~34 GB), BF16 (~64 GB). The 12 GB cards can fit q2_K and just barely q3_K_M (no headroom for KV cache), or run q4_K_M with offload.
The recommended quant for the B580 is q4_K_M with offload. q3_K_M loses ~5 % on math and coding benchmarks while saving little real-world speed (you offload fewer layers but the bottleneck is still CPU). q2_K loses 10–15 % quality, not worth it.
VRAM math — what fits on the B580
Each layer at q4_K_M is ~285 MB. Budget at 4K context, fp16 KV cache:
| Item | VRAM |
|---|---|
| Embeddings + output head | 1.1 GB |
| 24 transformer layers on GPU | 6.8 GB |
| KV cache, 4K ctx, 24 layers, fp16 | 1.5 GB |
| SYCL + activation overhead | 0.9 GB |
| Display compositor (if same card) | 0.3 GB |
| Total | ~10.6 GB |
That leaves ~1.4 GB headroom. Pushing to 26 layers will OOM on a long prefill. Quantize the KV cache to q8_0 and you can push to 25–26 layers. The remaining 40 layers run on the CPU side via SYCL's host-side execution path — DDR5 memory bandwidth is the binding constraint.
Install — IPEX-LLM is required for full speed
The B580's matrix engines need Intel's SYCL kernel runtime. Vanilla llama.cpp Vulkan works but gives ~70 % of the speed.
Recommended: IPEX-LLM + Ollama wrapper
The wrapper environment routes Ollama's llama.cpp backend through SYCL and sets n_gpu_layers automatically based on detected VRAM. Confirm GPU activity with intel_gpu_top from intel-gpu-tools.
Alternative: llama.cpp Vulkan (no Python deps)
Note: -t 6 matches a 6-core CPU's physical-thread count. With more cores, raise this. The hyperthread/SMT threads are usually counterproductive on the offloaded layers.
Real-world numbers — Arc B580 + Qwen 3 32B
Benchmark rig: Ryzen 5 7600 (6c/12t), 32 GB DDR5-6000, Arc B580 12 GB, Ubuntu 24.04, IPEX-LLM 2.7.1:
| Setting | Prefill PP tok/s | Generation TG tok/s |
|---|---|---|
| 2048 ctx, -ngl 26, SYCL | 305 | 19 |
| 4096 ctx, -ngl 24, SYCL | 270 | 16 |
| 8192 ctx, -ngl 20, SYCL, KV q8 | 215 | 11 |
| 16384 ctx, -ngl 16, SYCL, KV q8 | 165 | 7 |
| 4096 ctx, -ngl 22, Vulkan | 175 | 12 |
The SYCL backend wins by ~35 % vs Vulkan for partial-offload workloads — the gap widens as more layers move to GPU. If you stay on Vulkan, expect the lower-end of the throughput range.
Comparison — Qwen 3 32B across 12 GB cards
| GPU | VRAM | Bus / Bandwidth | TG @ 4K (q4_K_M) | Approx. price |
|---|---|---|---|---|
| Arc B580 | 12 GB GDDR6 | 192-bit / 456 GB/s | 16 tok/s | $249 |
| RTX 3060 | 12 GB GDDR6 | 192-bit / 360 GB/s | 14 tok/s | $320 used |
| RTX 4070 | 12 GB GDDR6X | 192-bit / 504 GB/s | 22 tok/s | $599 |
| RTX 5070 | 12 GB GDDR7 | 192-bit / 672 GB/s | 18 tok/s | $549 |
The B580 lands between the 3060 and 4070 — surprising given it's the cheapest of the lot. It wins on bandwidth-per-dollar and loses on raw FLOPS. For a single-user LLM workload that's bandwidth-bound, that's the right trade; for gaming or video work, the 4070 wins handily.
The B580 also draws ~50 W less than the 5070 at full load, which adds up if you're running inference for hours per day.
Common pitfalls — specific to Arc + SYCL
1. Driver version mismatch silently disables GPU. SYCL needs Intel compute-runtime 24.39+. The default Ubuntu 24.04 repo ships 24.13. intel_gpu_top will show 0 % during inference and you'll wonder why TG is 2 tok/s — it's CPU-only. Fix: add the Intel apt repo and apt upgrade.
2. Mesa Vulkan vs Intel Vulkan. On Ubuntu, mesa-vulkan-drivers provides a working but slow Vulkan implementation for the B580. The Intel-provided Vulkan ICD is ~25 % faster. Verify with vulkaninfo --summary | grep deviceName — you want Intel(R) Arc(TM) B580 Graphics, not Mesa.
3. OLLAMA_LLM_LIBRARY defaults to CPU. Vanilla Ollama doesn't auto-detect Arc. Either use the ollama-ipex wrapper or set OLLAMA_LLM_LIBRARY=llama_sycl and ensure llama-sycl was linked at Ollama build time.
4. Long context KV-quantization corruption. Vulkan KV q8 cache was buggy in llama.cpp builds before commit b5301. Symptom: model produces correct text for the first 100 tokens then degrades into repeated phrases. Update llama.cpp.
5. PCIe slot bandwidth. B580 is electrically PCIe 4.0 x8. If your motherboard puts it in an x4 slot (sometimes a second 16x slot is wired x4), model loading takes 2× longer and large-batch inference loses ~12 % throughput. Check lspci -vv | grep -A2 "Arc B580" | grep "LnkSta".
When NOT to use the B580 for 32B models
- You need >20 tok/sec. Move to a 16 GB+ card — the 5070, 4070 Super, or an RTX A4000 Ada (20 GB) all run 32B q4 faster, with the latter two fitting the model entirely in VRAM.
- You need NVIDIA-specific tooling. vLLM, FlashInfer, AutoAWQ, ExLlamaV2 — none of these have Arc backends. If your workflow depends on them, run NVIDIA.
- You serve multiple concurrent users. Partial-offload + SYCL serializes requests. Concurrent users on the same B580 will see 4–5 tok/s each.
- You need text-generation-inference (TGI) or Triton-Inference-Server. Both have Intel backends but they're experimental as of mid-2026.
- You're going to run Qwen 3 14B more often than 32B. The 14B model fits in VRAM on the B580 at q4_K_M with ~3 GB to spare — see our Qwen 3 14B on Arc B580 guide. Throughput is 4× higher and quality is good enough for most coding tasks.
Worked example — translation + reasoning prompt
Realistic test: paste a 2,000-token English research-paper abstract, ask Qwen 3 32B to translate to Mandarin and summarise the key findings:
- Prefill 2,000 tokens at 270 PP-tok/s → 7.4 s to first token
- Generation 800 output tokens at 16 TG-tok/s → 50 s
- Total user-visible: ~58 s
For one-shot batch tasks (overnight document translation, archival summarisation), that's fine. For interactive chat, it's slow but workable. Compare to the B580 running the 14B variant where the same prompt finishes in ~12 s.
Cost-per-token-per-day
For a personal local-LLM rig running ~8 hours per day:
- B580: 190 W × 0.5 utilization × 8 h = 0.76 kWh/day; at $0.13/kWh ≈ $0.10/day
- RTX 5070: 250 W × 0.5 × 8 = 1.0 kWh/day ≈ $0.13/day
- RTX 4090: 450 W × 0.5 × 8 = 1.8 kWh/day ≈ $0.23/day
Over a year, the B580 saves ~$45 vs a 4090 in electricity — minor compared to the $1,150 upfront price gap, but it adds up for 24/7 inference rigs.
Tuning recipe by use case
Coding companion (short prompts, prioritise first-token latency):
~19 tok/s steady, first-token in ~2 s. Drop -ngl to 24 if you're sharing the GPU with a display.
Document Q&A (longer context, comprehensive answers):
~11 tok/s. The extra CPU layers eat tok/s but the 16K context is worth it for long-source workloads.
Translation (clean output, moderate temperature):
~14 tok/s. Translation benefits from slightly warmer sampling than coding; cold temperature produces literal-but-stilted output.
Benchmark methodology
For SYCL we pin --device SYCL0 to the discrete B580. For Vulkan the runtime picks the dGPU automatically. Public benchmarks measured both backends on the same prompts. CPU governor was performance, model preloaded via vmtouch, no other foreground processes.
For the offloaded-layer measurements, the CPU side runs at roughly fixed throughput once thermal steady-state is reached (~120 s after launch). The first 50 iterations were discarded as warmup and the next 30 were used for the median.
Second worked example — multi-turn assistant
A realistic three-turn chat session on the B580 with Qwen 3 32B:
- Turn 1: 200-token user prompt, 400-token answer
- Prefill 0.7 s, generate 25 s — 26 s total
- Turn 2: 300 more user tokens (history is 900 tokens), 500-token answer
- Prefill 1.1 s, generate 31 s — 32 s total
- Turn 3: another 200 user tokens (history is 1,600 tokens), 600-token answer
- Prefill 2.0 s, generate 38 s — 40 s total
The trend is clear: each turn is a bit slower as the KV cache grows. After ~8 turns at this pace the conversation hits the context limit, the cache forces additional offload, and tok/s drops to single digits. For long-running conversations, periodically summarise older turns and reset the context — most chat UIs handle this automatically.
See also
- Best GPU for an AI rig in 2026
- How to run Qwen 3 14B on Arc B580 — comfortable-fit alternative
- How to run Qwen 3 32B on RTX 5070 — closest NVIDIA comparison
- How to run DeepSeek-R1 32B on Arc B580 — reasoning model on the same card
- How to run Llama 3.1 8B on Arc B580
Cited sources
- llama.cpp Vulkan/SYCL backend reference (GitHub — ggml-org/llama.cpp)
- llama.cpp discussion thread on offload tuning (GitHub discussion 4167)
- Ollama install + IPEX wrapper (ollama.com/install.sh)
- vLLM paged-attention serving framework (GitHub — vllm-project/vllm)
- r/LocalLLaMA Arc B580 / IPEX-LLM 2.7 thread (reddit.com/r/LocalLLaMA — fixed IPEX-LLM Ollama models)
- Community discussion of a hypothetical 24 GB Arc B580 (r/LocalLLaMA — Intel 24 GB B580)
- AMD Strix Halo / Ryzen AI Max 395 GPU LLM baseline (r/LocalLLaMA — Strix Halo GPU LLM)
- 2026 GPU hierarchy reference (Tom's Hardware GPU hierarchy)
As of May 2026 — Intel ships a quarterly compute-runtime cadence; the B770 16 GB is rumoured for Q3 and would change the recommendation for 32B models on Arc.
