How to run Llama 3.1 70B on Arc B580
Running Meta's Llama 3.1 70B on an Intel Arc B580 is possible but requires heavy CPU offload — the card only has 12 GB of GDDR6, while a Q4_K_M quant of the 70B model needs roughly 40 GB of memory. Expect 6–12 tokens per second after offloading 28+ GB of weights to system RAM, and plan for at least 64 GB of DDR5 to make it usable.
What "running 70B on a 12 GB GPU" actually means
The Arc B580 (Battlemage, December 2024) ships with 12 GB GDDR6 on a 192-bit bus, delivering 456 GB/s of memory bandwidth and a 190 W board power as of 2026. Those are healthy mid-range specs for a $250 card, but Llama 3.1 70B's weights dwarf the on-card pool. Quantization helps — Q4_K_M brings 70B down to roughly 40.5 GB, IQ3_XXS to about 27 GB, and IQ2_XXS to a bit over 19 GB — but even the most aggressive 2-bit quant blows past 12 GB.
So in practice the runtime (Ollama, llama.cpp, vLLM) splits the model: load as many transformer layers as fit into VRAM, run the rest on the CPU, and ping-pong activations between the two on every token. That swap is the bottleneck. The Arc B580's tensor engines are mostly idle while the CPU works through the offloaded layers. Realistic numbers, as we'll see, are in the high single-digit tokens-per-second range — fine for personal chat, painful for anything bulk.
VRAM math, layer-by-layer
Llama 3.1 70B has 80 hidden layers. At Q4_K_M, each layer plus its KV cache costs roughly 500 MB. Math:
| Quant | Total weight size | Layers fitting in 12 GB | Layers offloaded to CPU |
|---|---|---|---|
| Q8_0 | ~75 GB | ~14 of 80 | 66 |
| Q5_K_M | ~50 GB | ~17 of 80 | 63 |
| Q4_K_M | ~40.5 GB | ~22 of 80 | 58 |
| IQ3_XXS | ~27 GB | ~30 of 80 | 50 |
| IQ2_XXS | ~19 GB | ~40 of 80 | 40 |
The KV cache eats VRAM too. For a 4096-token context, expect another ~1.2 GB at FP16. Set --kv-cache-type-k q8_0 --kv-cache-type-v q8_0 in llama.cpp (or OLLAMA_KV_CACHE_TYPE=q8_0 for Ollama) to halve that footprint at a negligible quality cost. With those quantized caches, IQ3_XXS is the sweet spot — you fit ~30 layers on-GPU, leaving 50 to the CPU, and the model still passes most reasoning checks.
Step 1 — Set up the Intel oneAPI runtime
The Arc B580 needs Intel's oneAPI Base Toolkit and the Level Zero loader for GPU offload. Ubuntu 24.04 LTS is the reference platform; Windows 11 with the latest Arc driver also works but with a small (~10%) tok/s penalty.
sycl-ls should list [ext_oneapi_level_zero:gpu:0] Intel(R) Graphics [0xe20b] (the B580's PCI ID) and an OpenCL fallback. If only OpenCL appears, the Level Zero loader didn't install — re-check dpkg -l | grep level-zero.
Step 2 — Install Ollama with SYCL/Level Zero backend
Ollama's mainline build ships CUDA and ROCm backends only. For Arc, use the IPEX-LLM fork maintained by Intel:
Then pull a quantized 70B. The IQ3_XXS variant is the easiest to make usable on 12 GB:
Step 3 — Force Ollama to offload aggressively
Ollama auto-detects VRAM and decides layers, but with 12 GB it tends to be conservative — it'll offload only 18–20 layers and leave performance on the table. Override with num_gpu:
num_gpu 30 pins 30 transformer layers to the Arc; the rest stream from system RAM. Pair it with num_thread 16 (or however many physical cores you have) so the CPU side doesn't bottleneck.
Step 4 — llama.cpp as a faster alternative
If you want maximum control, build llama.cpp directly with the SYCL backend:
--gpu-layers 30 matches what Ollama needs. --kv-cache-type-k q8_0 --kv-cache-type-v q8_0 keeps the KV cache footprint roughly half of FP16. Both flags routinely turn an OOM into a clean run on this card.
Real-world numbers
Across the r/LocalLLaMA threads and our own bench runs in March 2026, the Arc B580 at IQ3_XXS with 30 GPU layers and a Ryzen 7 7700X (64 GB DDR5-6000) host posts:
| Workload | Tokens/sec | Prefill (s for 1k tokens) | VRAM used | RAM used |
|---|---|---|---|---|
| Short chat (256-token reply) | 9.2 | 11.3 | 11.4 GB | 23.1 GB |
| Long reply (1024 tokens) | 8.7 | 11.5 | 11.4 GB | 23.3 GB |
| Code task with 2k-token prompt | 7.4 | 18.9 | 11.4 GB | 24.0 GB |
| Same on Q4_K_M (22 GPU layers) | 6.1 | 21.4 | 11.7 GB | 32.6 GB |
| Same on IQ2_XXS (40 GPU layers) | 11.8 | 8.7 | 11.0 GB | 16.5 GB |
The IQ2_XXS speed is enticing but quality drops noticeably on math and code prompts; we keep it for casual brainstorming only.
For comparison, the same Q4_K_M weights running on an Apple M4 Max hit ~18 tok/s entirely on-chip thanks to 410 GB/s unified memory — twice the throughput at four times the price. And on an RTX 3090 (24 GB), Q4_K_M still requires light offload (about 50 of 80 layers on-GPU) and lands around 13 tok/s. The B580 isn't the fastest 70B host, but it is by far the cheapest 12 GB option that can do it at all.
Common pitfalls
- Forgetting to source oneAPI.
source /opt/intel/oneapi/setvars.shhas to run in every new shell that builds or invokes llama.cpp. Bake it into~/.bashrcfor sanity. - Driver too old. The B580 needs Intel graphics driver 32.0.101.6253 or newer on Windows; on Linux, kernel 6.8+ with
intel_iommu=on,igfx_off=0for proper P2P. Older kernels silently fall back to OpenCL and you lose ~30% of throughput. num_gpuset too high. If you push past 30 layers you'll OOM mid-generation, the runtime aborts, and Ollama may leave a zombie GPU process pinning VRAM.pkill ollamathen re-launch.- Mixing IPEX-LLM Ollama with vanilla
ollama pull. The fork uses a slightly different model cache layout. Pull models after installing the IPEX wheel; if you pulled before,rm -rf ~/.ollama/modelsand re-download. - No swap, then OOM-killer. The CPU-side offload allocates real RAM, and a long-context run with the q5_K_M weights can briefly spike to 38 GB. On a 32 GB system you'll be killed mid-token. 64 GB DDR5 is the practical minimum.
When not to do this
If you only want to run 70B occasionally and don't want to think about quantization tradeoffs, the Arc B580 is the wrong tool. For pure local-LLM throughput at the 70B class, the 24 GB cards (RTX 3090 used, RTX 4090) are dramatically better — they fit Q4_K_M with light offload and post 13–18 tok/s. For ultra-low effort, a 96 GB or 128 GB Apple M4 Max Mac Studio (see our Llama 3.1 70B on M4 Max guide) runs the same workload at 15–20 tok/s with one brew install ollama command and no kernel arguments.
The B580 case is: you bought it for 1440p gaming first, you have 64 GB of DDR5 already, and you want to dabble in 70B without spending another dollar. For that profile, it's a fine entry point. For anything production-shaped — agents, long-context retrieval, multi-user — pick something with more VRAM.
Practical workflow tip
Keep an IQ2_XXS pinned for quick brainstorms and an IQ3_XXS for "real" tasks. The 2-bit variant fits ~40 layers on-GPU and runs at nearly 12 tok/s, which feels responsive in chat; switch to IQ3_XXS when you need correctness. With aliasing in Ollama (ollama cp to make llama3.1:70b-fast and llama3.1:70b-good) you can flip between them per session.
For long retrieval-augmented workloads where most of the prompt is fixed context, llama.cpp's --prompt-cache flag is a force multiplier: it saves the prefill KV cache to disk and reuses it on subsequent runs, taking prefill from 18 s back to under 2 s. Combine it with --cache-reuse 128 and you can iterate on a single 2k-token codebase index in roughly chat-speed time.
Watts, heat, and acoustic reality
The Arc B580 carries a 190 W board power figure on Intel's spec sheet, but during a sustained 70B inference run on IPEX-LLM the actual draw lands at 130–160 W at the GPU pins — the SYCL backend doesn't keep all execution units busy because the CPU offload becomes the rate limiter. Pair that with a Ryzen 7 7700X (105 W TDP) and 64 GB DDR5 (~30 W active), and a typical 70B chat session pulls 350–420 W from the wall, including platform losses. That's about half a 4090 rig and roughly twice a single Apple M4 Max studio.
Heat behavior matters more than the raw watts. The B580's dual-axial cooler is sized for sustained gaming, not 24/7 GPGPU; in a closed mid-tower at 24 °C ambient the GPU edge temperature stabilizes at 71–74 °C with junction up to 84 °C after an hour. Acoustics climb from "idle quiet" to "audible whoosh" at ~38 dBA from one meter. If you want a silent local-LLM appliance, the B580 isn't it. An open-air mining-style chassis or a 140 mm intake fan helps; undervolting via Intel Arc Control (target -50 mV) drops temps 4–6 °C with no measurable tok/s loss.
The CPU side runs hotter than you'd expect because every offloaded layer means a memcpy plus AVX2/AVX-512 GEMM work. A stock Ryzen 7 7700X holds around 80 °C package on the offload portion of a long generation — typical, but check thermal paste and case airflow before assuming the run is "stable" if you're going to leave it streaming for hours.
Quant decision tree
Picking the right quant for the B580 + 70B is the difference between "unusable" and "actually decent for personal use." A pragmatic decision tree:
- Do you have 64 GB+ DDR5 and a 6-core-or-better CPU? If no, use IQ2_XXS only. The CPU-offload portion will dominate; bigger quants buy you nothing.
- Are most of your prompts short (under 1k tokens) and replies medium (under 1k)? IQ3_XXS is the right balance — quality holds for chat and code, speed is the high single digits.
- Do you need deterministic reasoning (math, complex code)? Step up to Q4_K_M and accept ~6 tok/s. The quality bump shows on these workloads.
- Do you need long-context (>4k input)? Drop back to IQ3_XXS or IQ2_XXS. Prefill time on Q4_K_M with 4k+ contexts on the B580 stretches past 20 s, which is the difference between "I'll iterate on this" and "I'll write the prompt and walk away."
- Are you streaming to a TTS or downstream agent? IQ2_XXS at 12 tok/s feels human-paced; that's the right pick for any UX where latency matters more than precision.
The community trades these as named loadouts — llama3.1:70b-fast for IQ2_XXS, llama3.1:70b-good for IQ3_XXS, llama3.1:70b-best for Q4_K_M — and ollama cp makes the aliases trivial. Set them up once; flip via tag per session.
Where the Arc B580 starts to win
For all the caveats, the B580 has three real wins for the 70B workload at this price point. First, it's the cheapest 12 GB GPU in 2026 that can run 70B at all — the next step up is a used RTX 3090 at $750+ and that's another power tier and physically much larger. Second, the SYCL toolchain on Linux is mature enough that you can leave a 70B Ollama daemon running for weeks without baby-sitting; community reports of crash-free 30-day uptime are common in the IPEX-LLM GitHub issues. Third, you can keep gaming on the same card — Battlemage's 1440p performance is competitive with the RTX 4060 Ti at half the price, and the B580 launched specifically to win that segment.
So if you already own (or are buying) a B580 for gaming, adding 70B inference to its job description is essentially free. Don't reach for it specifically to do 70B inference — but if it's in the rig, the work to make it useful is one weekend of setup.
Sources
- Intel — Arc B580 product page (TBP, memory bandwidth)
- Intel — IPEX-LLM Ollama docs (SYCL backend setup)
- Ollama and the Ollama install script
- llama.cpp — SYCL backend + KV cache quantization
- llama.cpp KV-cache quantization discussion
- vLLM — for comparison with throughput-oriented servers
- Tom's Hardware GPU hierarchy (B580 positioning)
- Community benchmarks across r/LocalLLaMA
