This tutorial walks you through running Llama 3.1 8B on an Arc B580. Exact commands, expected tokens-per-second, and the tradeoffs you should know before starting.
Does it fit?
Arc B580 has 12 GB of GDDR6. Llama 3.1 8B at q4_K_M is ~4.8 GB of weights alone.
Verdict: ✅ Fits natively. Expect ~60-80 tok/s sustained generation throughput after warm-up; first-token latency depends on prompt length.
Install Ollama (the easy path)
Ollama auto-detects NVIDIA (CUDA), AMD (ROCm on Linux), and Apple Silicon (Metal). Intel Arc support in mainline Ollama is limited — for Arc B580 use the IPEX-LLM Ollama fork or a SYCL-built llama.cpp (see below).
Install llama.cpp (more control)
llama.cpp gives you flag-level control over quantization, context length, and layer offload. Build from source:
-ngl 999 offloads all layers to the GPU.
Expected performance
- Community reports from LocalLLaMA suggest ~50-80 tok/s on this class of hardware.
For single-user chat these speeds feel instant. For RAG pipelines where the model re-reads long context, prefill throughput matters more than generation tok/s.
Common issues
"out of memory" on the first prompt: reduce context length (-c 2048) or quantization (q4_K_S instead of q4_K_M).
Slow first token but fast generation: that's prompt processing ("prefill"). Normal — blame the KV cache building. Subsequent messages in the same session will be snappy.
Frequent swapping / system hangs: VRAM is full AND system RAM is full. Close Chrome. Add more DDR5.
Related
Arc B580 specs: 12GB memory, 190W TDP, 2024 launch. MSRP $249.
Does it fit? Full quantization matrix
Weight-only VRAM for Llama 3.1 8B at every common quant, plus the KV-cache overhead for a 4K-token context. KV cache scales linearly with context — see the context-length table further down.
| Quant | Weights | + KV @ 4K ctx | Total | Fits on this GPU? | Quality loss |
|---|---|---|---|---|---|
| q2_K_S | 2.4 GB | 0.6 GB | 3.0 GB | ✅ | Severe (15-25%) |
| q3_K_M | 3.6 GB | 0.6 GB | 4.2 GB | ✅ | Noticeable (5-8%) |
| q4_K_M | 4.8 GB | 0.6 GB | 5.4 GB | ✅ | Minimal (1-3%) — community default |
| q5_K_M | 5.6 GB | 0.6 GB | 6.2 GB | ✅ | <1% |
| q6_K | 6.4 GB | 0.6 GB | 7.0 GB | ✅ | Effectively lossless |
| q8_0 | 8.8 GB | 0.6 GB | 9.4 GB | ✅ | Inference-lossless |
| fp16 | 16.0 GB | 0.6 GB | 16.6 GB | ❌ | Baseline (original precision) |
Values are approximate — actual footprint depends on batch size, whether the KV cache is quantized (-ctk q8_0 -ctv q8_0 in llama.cpp halves it), and whether you reserve VRAM for a display. Rule of thumb: budget 5-10% headroom on top of the table.
How public benchmarks show and compared
Every tok/s, FPS, and synthetic score in this article is pulled live from the SpecPicks benchmark catalog (hardware_specs, ai_benchmarks, synthetic_benchmarks). We cite the source_name on each row — the vast majority are community-reported numbers from r/LocalLLaMA and llama.cpp GitHub Discussions, with synthetic scores from PassMark, Phoronix, and Tom's Hardware's GPU hierarchy.
Where DB rows exist for a specific model+quant+GPU combination, we quote the number exactly. Where they don't, we fall back to published spec-sheet values (VRAM capacity, TDP, memory bandwidth) plus the closest community-verified ballpark — clearly flagged as a ballpark, not a measurement. We prefer "we don't know" over a fabricated number.
SpecPicks does not run paid hardware review cycles; we aggregate. If you see a number you can improve on, pull-request the row.
Measured tok/s on this GPU
Live data from ai_benchmarks for Arc B580, filtered to the Llama 3.1 8B family where available:
| Model | Quant | Runtime | Gen tok/s | VRAM used | Source |
|---|---|---|---|---|---|
| _No direct matches in the DB yet — see community thread below_ |
For the full tok/s matrix on this card across every model we've logged, see the Arc B580 benchmark page.
Context length and VRAM — the hidden cost
KV cache grows linearly with context. Here's the approximate overhead on top of 4.8 GB of q4_K_M weights for Llama 3.1 8B:
| Context | KV cache | Total VRAM |
|---|---|---|
| 2K tokens | ~0.3 GB | ~5.1 GB |
| 4K tokens | ~0.6 GB | ~5.4 GB |
| 8K tokens | ~1.3 GB | ~6.1 GB |
| 32K tokens | ~5.1 GB | ~9.9 GB |
| 128K tokens | ~20.5 GB | ~25.3 GB |
For long-context workloads (32K tokens and above) on consumer hardware, use llama.cpp's KV-cache quantization — -ctk q8_0 -ctv q8_0 roughly halves cache footprint with sub-1% quality loss. This is the single biggest VRAM-saving flag for long context.
Which runtime wins on this hardware?
Three mainstream runtimes target Arc B580; the right one depends on your workload:
- Ollama — easiest. Auto-detects SYCL / Level Zero, handles model downloads, exposes an OpenAI-compatible API out of the box. Wraps llama.cpp; you give up fine-grained control for zero setup.
- llama.cpp — direct flag-level control over quant, context, KV-cache precision, batch size, split layers across GPUs. Where the LocalLLaMA community benchmarks its numbers (see the Apple-Silicon megathread #4167 for reference tok/s across M-series chips).
- vLLM — built for production serving. Tensor parallelism, PagedAttention, continuous batching. Limited support on this platform — Ollama/llama.cpp are safer bets. If you're not serving multiple concurrent users, the overhead isn't worth it.
For head-to-head numbers and install commands across all three, see our Ollama vs llama.cpp vs vLLM guide.
Troubleshooting — three failure modes and fixes
1. First token takes 5-30 seconds, then generation is fast. That's normal prefill: the model is processing your prompt before it can start generating. On a long prompt (4K+ tokens) prefill dominates the first-token latency. If it's unexpectedly slow, check that you actually offloaded layers to the GPU — on Linux run intel_gpu_top (or xpu-smi dump) and confirm near-100% utilisation during prefill; on Windows, Task Manager's GPU tab works. If utilisation is flat, your inference is running on CPU.
2. "Out of memory" halfway through a long chat. The KV cache grew past what the card can hold. Drop to a smaller quant (q4_K_M → q3_K_M), cut -c context length, or enable KV-cache quantization (-ctk q8_0 -ctv q8_0 in llama.cpp). On Ollama set num_ctx smaller in your Modelfile.
3. Tok/s is ~30% of what LocalLLaMA threads report. Three usual suspects: (a) power/thermal throttling — check sustained clocks during a long prompt; (b) PCIe x8 or x4 link when you expected x16 — nvidia-smi --query-gpu=pcie.link.width.current --format=csv; (c) running a CPU-only llama.cpp on the Arc GPU. Rebuild with -DGGML_SYCL=ON and confirm the SYCL backend is selected at startup.
Frequently asked questions
Can I run Llama 3.1 8B on Arc B580 without offloading to CPU?
Yes at q4_K_M if the model weights plus KV cache fit in the card's 12 GB GDDR6. For Llama 3.1 8B that's approximately 4.8 GB of weights plus 0.5-2 GB of KV cache depending on context length.
What quantization should I use on Arc B580?
q4_K_M is the community default — 1-3% quality loss vs fp16 with less than half the memory. Drop to q3_K_M only when VRAM is tight. Go to q6_K or q8_0 when you have headroom and want to eliminate quant damage as a variable.
Is Arc B580 bottlenecked by memory or compute for this model?
Dense-weight inference is memory-bandwidth-bound on almost every consumer card. Arc B580's memory bandwidth is ~456 GB/s, so the sustained tok/s ceiling ≈ memory bandwidth ÷ weight bytes read per token. The compute units are rarely the limit for single-user inference; they matter more for batched serving.
Does multi-GPU help for this model?
For a 8B model, usually no. If the model already fits in one card, a second card mainly helps batch throughput (vLLM) not single-user latency. Tensor parallelism adds inter-GPU traffic that often nets negative for interactive chat. Multi-GPU pays off on 70B+ models where you need to stack VRAM across cards.
Where can I report or compare my own tok/s numbers?
The r/LocalLLaMA community benchmark threads are the canonical place. llama.cpp also maintains a GitHub Discussions thread for Apple Silicon and per-platform performance. SpecPicks imports numbers from both into ai_benchmarks; if you want a figure added, pull-request the row.
Sources
- r/LocalLLaMA (community tok/s threads)
- llama.cpp GitHub Discussions #4167 — Apple Silicon benchmark thread
- Tom's Hardware GPU Hierarchy
