Yes — Intel's llm-scaler-vllm 1.4 release lands first-class inference support for the Arc Pro B70 24 GB workstation card, with measured throughput of roughly 38 tokens per second on Llama 3 8B at FP16 and around 11 tokens per second on Mixtral 8×22B at AWQ-INT4. That makes the B70 the cheapest 24 GB VRAM card on the market in 2026 that has a real, supported vLLM backend behind it, and it changes the local-inference calculus for anyone who was about to spend $1,800 on a used RTX A5000.
What changed in llm-scaler-vllm 1.4
The 1.4 release is the first version of Intel's llm-scaler-vllm to merge the Arc Pro B70 device tables, kernel implementations for the SDPA (scaled dot-product attention) primitive on Xe2 architecture, and the dynamic batching path that vLLM relies on for high-throughput multi-request serving. Before 1.4, B70 owners had to fall back to Intel's IPEX-LLM compatibility shim, which works for inference but does not implement vLLM's paged-attention or continuous-batching scheduler — meaning it ran model weights, but not at production throughput.
The Intel team's llm-scaler GitHub repo shipped 1.4 in March 2026 with three headline changes: the B70 (codename Battlemage Pro) is now in the supported devices list, the SYCL kernels for INT4 group-quantized matmul are tuned for Battlemage's wider vector units, and the OneCCL collective communications library now supports B70 device-to-device transfers for tensor-parallel workloads. For multi-card B70 setups (two cards over PCIe x16, no NVLink equivalent) you finally get tensor-parallel without falling back to slow staging through host memory.
The practical implication is that the B70 went from "kind-of works for hobbyists" to "a deployable inference card" within a single release cycle. The reference target is Llama 3 70B at AWQ-INT4 (~21 GB on disk) on a pair of B70s with tensor-parallel size 2, which the release notes claim delivers around 18 tokens per second sustained for a single request and roughly 130 tokens per second aggregate at 32-request batch.
Key takeaways
- Intel Arc Pro B70 (24 GB GDDR6, $999 launch price) now has first-class vLLM support via llm-scaler-vllm 1.4.
- Measured throughput: ~38 tok/s on Llama 3 8B FP16, ~11 tok/s on Mixtral 8×22B AWQ-INT4, ~18 tok/s on Llama 3 70B AWQ-INT4 (dual B70).
- Tensor-parallel across two B70s is supported through the updated OneCCL backend; no NVLink equivalent, so workloads above 70B-class start to suffer PCIe staging penalties.
- The 24 GB VRAM gives the B70 a real per-card capacity advantage over the RTX 3060 12 GB at roughly twice the price ($999 vs ~$300 used).
- INT4 AWQ and GGUF/GGML quantization paths are both supported; FP8 is not.
- Driver maturity is still behind NVIDIA's CUDA stack — expect occasional kernel panics on long-running serving, and budget a daily watchdog restart.
What is the Arc Pro B70?
The Arc Pro B70 is Intel's professional-workstation discrete GPU in the Battlemage architecture, announced at CES 2026 and shipping in March. The headline spec is 24 GB GDDR6 on a 256-bit memory bus (~624 GB/s peak bandwidth), 32 Xe2 cores delivering roughly 41 TFLOPs FP32 and 660 TOPS INT8 via the XMX matrix engines, and a 175 W TDP that fits on a single 8-pin EPS connector. The card is a single-slot blower-style design intended for workstation chassis, not gaming towers.
For context: 24 GB of VRAM at $999 list (and street prices around $920 to $1,050 in 2026) is well below the next NVIDIA option. The NVIDIA RTX 4500 Ada at 24 GB lists at $2,250; the RTX A5000 24 GB used clears $1,400 to $1,800. The B70 undercuts the A5000-used price by 40 to 50 percent on the same VRAM tier. If the inference software stack works — and as of llm-scaler-vllm 1.4 it does — that is the most aggressive VRAM-per-dollar pricing on a current-generation card from a top-three vendor.
The catch in 2026 was always the software. Intel's oneAPI documentation shipped solid SYCL and Level Zero support, but the gap between "I can call a kernel" and "I can run vLLM in a Kubernetes pod under load" was wide. The 1.4 release closes most of that gap.
How does Arc Pro B70 inference throughput compare to RTX 3060 12GB and RTX A5000 used?
The B70's value proposition is its VRAM capacity, not its raw bandwidth or compute, so the comparison depends heavily on the model size. We pulled measurements from Intel's release notes for llm-scaler-vllm 1.4, the vLLM project benchmarks for NVIDIA cards, and our own dual-3060 test bench.
| Card | VRAM | $ (2026) | Llama 3 8B FP16 | Llama 3 70B AWQ-INT4 | Mixtral 8×22B AWQ-INT4 | Best fit |
|---|---|---|---|---|---|---|
| Intel Arc Pro B70 (single) | 24 GB | $999 | 38 tok/s | does not fit | does not fit | 8B–13B FP16, 32B Q4 |
| Intel Arc Pro B70 (dual, TP=2) | 48 GB | $1,998 | 41 tok/s | 18 tok/s | 11 tok/s | 70B AWQ-INT4 |
| NVIDIA RTX 3060 12GB (single) | 12 GB | $300 used | 47 tok/s | does not fit | does not fit | 7B FP16, 13B Q4 |
| NVIDIA RTX 3060 12GB (dual) | 24 GB | $600 used | 52 tok/s | 14 tok/s | 8 tok/s | 13B FP16, 70B Q4 |
| NVIDIA RTX A5000 (used) | 24 GB | $1,500 used | 71 tok/s | 22 tok/s | 14 tok/s | Same as B70 but faster |
| NVIDIA RTX 4090 24GB | 24 GB | $1,900 | 142 tok/s | 31 tok/s | 19 tok/s | High-throughput single-card |
The B70 is roughly 65 percent the speed of a used A5000 at a 30 to 35 percent discount. The Llama 3 8B FP16 number is the most flattering for the B70 (38 tok/s on a card that has 24 GB), because the model is small enough to live entirely in cache and the XMX matrix engines run at peak FP16 throughput. For 70B-class workloads where the model spans both cards, the 18 tok/s number is honest but not impressive — a single 4090 at the same model and quantization clears 31 tok/s.
The pattern: the B70 wins on VRAM-per-dollar at 24 GB; it loses on tokens-per-second across most workloads. Whether that math works for you depends on whether the bottleneck on your model is "did it fit in VRAM" or "is generation fast enough."
When does the Intel Arc Pro B70 win on price-performance?
There is a specific band where the B70 is the right answer: 30B to 70B-class models at INT4 quantization, where 24 GB lets you fit the entire model on a single card (or dual-card via TP=2) without GPU-to-host staging, and where you do not need the absolute throughput of a 4090. That covers Llama 3 70B AWQ, Mixtral 8×22B AWQ, Qwen 2.5 72B AWQ, and DeepSeek V2.5 at AWQ — all of which are popular targets for self-hosted production inference.
The competing options for that band:
- Used RTX A5000 ($1,500): ~30 percent more expensive, ~50 percent faster.
- Used RTX A6000 48 GB ($2,800): 3× more expensive, ~2× faster, fits 70B at higher quality.
- Dual RTX 3060 12GB ($600 used): half the price, ~25 percent slower, but you spend the VRAM headroom and cannot fit larger models.
If you are deploying inference for a small team (10–50 daily active users), the dual-B70 build at $1,998 hits a real sweet spot. You can serve Llama 3 70B AWQ at 130 tok/s aggregate, and the per-request 18 tok/s is faster than ChatGPT-4o was at launch — fast enough that nobody complains. The same workload on a single A5000 cannot fit at 70B; it has to drop to a 32B model class.
How do I actually deploy llm-scaler-vllm 1.4?
The install path is straightforward once you have Intel's oneAPI base toolkit on the host:
For tensor-parallel across two B70s, swap to --tensor-parallel-size 2 and ensure OneCCL has the right backend selected. Intel's llm-scaler README covers the OneCCL config — the short version is export CCL_BACKEND=native and export CCL_ATL_TRANSPORT=ofi.
The OpenAI-compatible server endpoint comes up on port 8000 by default and accepts the standard /v1/chat/completions payload. If you have an application stack pointed at a remote OpenAI endpoint, you can switch to the local B70 box by changing the base URL — no client-library changes.
Common pitfalls
- Mixing oneAPI versions. llm-scaler-vllm 1.4 is built against oneAPI 2025.1. Installing the 2024.2 toolkit alongside it produces import-time SYCL errors that look like missing kernels.
- Running on a B-series consumer Arc. The Arc B580 and B570 (consumer) do not have the XMX matrix engine count of the Pro line, and the device tables in llm-scaler-vllm 1.4 explicitly do not include them. You get a "device not supported" error at vLLM startup, not a silent fallback.
- Forgetting to enable Resizable BAR. The B70 needs ReBAR for the PCIe BAR to map all 24 GB of VRAM into host address space. Without it, vLLM throws a host-allocation error well before model load completes.
- Hot-path memory leaks. llm-scaler-vllm 1.4 ships with a known issue where long-running serving (more than 36 hours of continuous requests) accumulates GPU memory in the paged-attention block manager. Plan for a daily
kill -SIGTERMplus restart, at least until 1.5. - Workstation PSU sizing. The B70 is a 175 W TDP card but pulls transient spikes well above that. A 600 W PSU is the realistic floor for a dual-B70 box with a sensible CPU.
When NOT to buy the Arc Pro B70
If you are running 8B to 13B models exclusively, two used RTX 3060 12 GB cards at $300 each will outperform a single B70 at half the total cost. The 3060's CUDA stack is more mature, vLLM support for NVIDIA is the reference implementation rather than a port, and the dual-3060 build gets you 24 GB of VRAM through tensor-parallel.
If you are running models that need FP8 quantization (some of the newer instruction-tuned releases ship FP8 reference weights), the B70's Battlemage architecture does not implement FP8 in the XMX engines. You will have to dequantize to BF16 at load time, which costs about 50 percent extra VRAM and a measurable throughput hit.
If you need 100+ tok/s on 70B-class workloads for end-user-facing chat, a used RTX 4090 24 GB at $1,900 will deliver 31 tok/s on a single card versus the B70's 18 tok/s on a dual-card setup. The 4090 also has a much more mature serving stack.
Verdict matrix
Buy a dual Arc Pro B70 if: You are deploying 70B-class AWQ-INT4 inference for a small team and you want the cheapest path to 24 GB-per-card without going to used datacenter parts.
Buy a single Arc Pro B70 if: You want to host 13B FP16 or 32B INT4 models for personal use and you value the VRAM headroom for context expansion.
Buy dual RTX 3060 12GB if: You are under $700 total budget and you can live with 13B FP16 as your model ceiling.
Buy a used RTX A6000 if: Total budget allows $2,800 and you want a single-card 48 GB target with the mature NVIDIA software stack.
Bottom line + perf-per-dollar math
The Intel Arc Pro B70 at $999 lands at roughly $26 per token per second on Llama 3 8B FP16, $55 per tok/s on Llama 3 70B AWQ-INT4 (dual), and $91 per tok/s on Mixtral 8×22B AWQ-INT4. The dual RTX 3060 12 GB build at $600 lands at $43 per tok/s on 70B Q4. The used RTX A5000 at $1,500 lands at $68 per tok/s on 70B AWQ-INT4. The B70 wins on price-per-tok/s for 70B workloads, narrowly.
What the table cannot show is software risk. The B70 is a first-year card on a brand-new architecture (Battlemage) with a Python serving stack that is still chasing parity with vLLM upstream. NVIDIA's serving stack has eight years of production hardening. The B70 is a good buy if you are deploying for yourself and can tolerate the occasional restart; it is a riskier buy if you are committing it to a customer-facing service tier.
Related guides
- Best GPU for Local Llama 70B in 2026: RTX 3060 12GB Stack vs Single Workstation Card
- Intel Optane DIMMs Run 1-Trillion-Parameter LLM on One Workstation
- AMD Ryzen AI Max 400 'Gorgon Halo': 192GB Unified Memory APU Hits $3,999
- ZOTAC GeForce RTX 3060 12GB
- AMD Ryzen 7 5800X (host platform)
