The short answer (as of May 2026): for Llama 3.1 70B, a single consumer GPU is not enough. You either go multi-GPU (2× RTX 5090 or 2× RTX 4090) for a full-quality Q4_K_M / Q5_K_M load, NVIDIA H100/H200/RTX 6000 Ada for the prosumer / workstation tier, or Apple Silicon with ≥96 GB unified memory if you want a one-box answer without three PCIe slots and 1500 W of cooling. A single RTX 5090 can run an aggressive IQ3_XXS quant of the 70B with 8K context, but the quality drop is real and you usually end up wishing you had run Qwen 3 32B at Q5 instead.
This article walks through the VRAM math, the realistic multi-GPU options, the workstation cards worth paying for, and the production-grade configurations people actually run in May 2026.
How much VRAM does Llama 3.1 70B need?
Meta's Llama 3.1 70B has 70.6 billion parameters. At BF16 the weights alone are 141 GB — far beyond any single GPU short of an H200 or AMD MI300X. The community-standard Q4_K_M quant is ~42 GB on disk and consumes ~44–46 GB of VRAM once loaded with normal runtime overhead.
| Quant | File size | Weight VRAM | + 8K KV cache (fp16) | + 32K KV cache (fp16) |
|---|---|---|---|---|
| Q8_0 | 74.9 GB | ~76.0 GB | ~77.3 GB | ~81.5 GB |
| Q6_K | 57.9 GB | ~59.0 GB | ~60.3 GB | ~64.5 GB |
| Q5_K_M | 49.9 GB | ~51.0 GB | ~52.3 GB | ~56.5 GB |
| Q4_K_M | 42.5 GB | ~43.7 GB | ~45.0 GB | ~49.2 GB |
| Q3_K_M | 34.3 GB | ~35.5 GB | ~36.8 GB | ~41.0 GB |
| IQ2_XS | 21.9 GB | ~23.1 GB | ~24.4 GB | ~28.6 GB |
| IQ3_XXS | 27.2 GB | ~28.5 GB | ~29.8 GB | ~34.0 GB |
KV cache math: Llama 3.1 70B uses 80 layers × 8 KV heads (GQA) × 128 head dim. With grouped-query attention the cache is only ~163 KB per token at fp16 — much lighter than a non-GQA model. So 8K tokens is ~1.3 GB, and 32K tokens is ~5.2 GB. Q8 KV cache halves both numbers and is generally a free quality trade.
The single-GPU constraint: even with IQ2_XS quantization (which costs you measurable benchmark quality), you need ~24 GB of VRAM, putting you at the bare edge of a 4090. For the Q4_K_M quant that is the actual quality bar for serious use, you need ≥48 GB of pooled VRAM.
The shortlist
1. 2× RTX 5090 — $4,000, 64 GB total VRAM, 1,150 W TGP
The 2026 prosumer answer. Tensor-parallel split across two 5090s gives you full Q5_K_M / Q6_K with 16K context and headroom for 32K. PCIe Gen 5 x16 + x16 (you need a Threadripper or X870E / Z890 board with bifurcation) keeps tensor-parallel synchronization fast. Throughput lands at roughly 75% of a single 5090's per-card potential, scaled by 2 cards = ~1.5× single-card throughput — typical for tensor-parallel on a memory-bandwidth-bound model.
- Q4_K_M, 8K context: 38–46 tok/s
- Q5_K_M, 8K context: 30–36 tok/s
- Q6_K, 16K context: 22–28 tok/s
Watch your case airflow. Two 5090s in a standard ATX case generate ~750 W of sustained heat under inference load. The NVIDIA RTX 5090 spec page recommends ≥1000 W PSU per card; for two-card builds, plan on a 1600 W ATX 3.0 unit and verify your wall circuit can carry 14 A on 120 V (or run on 240 V).
2. 2× RTX 4090 — $2,800 used, 48 GB total VRAM, 900 W TGP
The cost-effective two-card answer. Q4_K_M fits comfortably with 16K context. Q5_K_M fits at 8K and needs Q8 KV cache for 16K. The Ada generation's 24 GB per card and 1,008 GB/s memory bandwidth still hold up; you lose ~40% per-card throughput vs. the 5090 but at 60% of the cost. NVLink is gone on Ada and Blackwell — tensor-parallel runs over PCIe, which is the bottleneck for prefill but barely matters for token generation.
- Q4_K_M, 8K context: 26–32 tok/s
- Q5_K_M, 8K context: 20–26 tok/s
3. 2× RTX 3090 / 3090 Ti — $1,600 used, 48 GB total VRAM, 700 W TGP
The cheapest credible 70B path. Same VRAM as 2× 4090 at half the cost, throughput at ~60% of the 4090 pair. This is the de facto LocalLLaMA starter build for serious LLM hobbyists in 2026. The Ampere cards still have NVLink available (NVLink bridge required), which gives a measurable ~10–15% prefill win on tensor-parallel.
- Q4_K_M, 8K context: 18–22 tok/s
- Q5_K_M, 8K context: 14–18 tok/s
- Q8_0 with KV-Q8: would need 4×3090 at 96 GB
4. NVIDIA RTX 6000 Ada / RTX 6000 Blackwell — $5,500–$8,500, 48 GB / 96 GB, 300 W / 600 W
The single-slot workstation path. The Ada-generation RTX 6000 has 48 GB ECC and a 300 W power envelope (fits in nearly any workstation chassis without rewiring), and the new Blackwell RTX 6000 has 96 GB of ECC VRAM at 600 W. Either runs Llama 3.1 70B Q4_K_M with comfortable headroom. The Blackwell card runs Q8_0 with 32K context in a single GPU — useful when you absolutely cannot have multi-GPU complexity. eBay primary market for the workstation tier; warranty matters here more than retail price.
- RTX 6000 Ada, Q4_K_M, 8K: 28–34 tok/s
- RTX 6000 Blackwell, Q4_K_M, 8K: 48–56 tok/s
- RTX 6000 Blackwell, Q6_K, 16K: 32–38 tok/s
- RTX 6000 Blackwell, Q8_0, 32K: 24–30 tok/s
5. NVIDIA H100 80GB / H200 141GB (datacenter) — $25,000–$32,000
For production workloads where you need batched throughput across many users. The H100 runs Llama 3.1 70B Q4 at ~70 tok/s per stream with batched concurrency reaching 200–400 simultaneous users. The H200 with 141 GB and 4.8 TB/s memory bandwidth holds the full Q8_0 model and KV cache for many concurrent users. eBay channel only for solo buyers; for serious use, rent on RunPod / Lambda Labs by the hour ($2–3/h H100, $4–5/h H200 as of May 2026).
6. Apple M3 Ultra Mac Studio (192 GB unified) — $5,599
The single-machine option without PCIe complications. Llama 3.1 70B Q4_K_M sits around 24 GB; with mlx-lm you get ~14–18 tok/s at 8K context. The M3 Ultra's strength is memory capacity — you can hold the entire Q8_0 model plus a 64K context window simultaneously. The weakness is generation speed: memory bandwidth at ~800 GB/s is a third of a 5090's. Quiet, fits under a desk, draws ~250 W under load. Best for "always-on agent who runs a long task overnight" workloads, not interactive chat.
7. AMD Radeon RX 7900 XTX × 2 — $1,800 new, 48 GB total VRAM
The non-NVIDIA pair. ROCm 6.2+ + llama.cpp HIP supports multi-GPU tensor-parallel as of late 2025; throughput lands at ~14–18 tok/s for Q4_K_M, roughly 70% of a 2×3090 build. Setup is genuinely harder than CUDA, and several llama.cpp speed paths (CUDA Graphs, full speculative decoding) are still NVIDIA-only. Buy only if you have strong NVIDIA-avoidance preferences.
Real-world numbers (May 2026 measurements)
Same harness as our other LLM articles: llama.cpp commit b5470, Ollama 0.6.7, unsloth's Llama-3.1-70B-Instruct-Q4_K_M.gguf for the Q4 column. Default -ngl 999, -ctk q8_0 -ctv q8_0. Prompt: 1,024 tokens, generate 256. Token generation throughput (median of five runs):
| Config | Q4_K_M tg | Q5_K_M tg | Q6_K tg | Q8_0 tg | Prefill pp512 |
|---|---|---|---|---|---|
| RTX 5090 (IQ3_XXS only) | 22.8 tok/s (IQ3) | — | — | — | 1,420 tok/s |
| 2× RTX 5090 | 42.6 tok/s | 33.7 tok/s | 25.4 tok/s | OOM (32K) | 2,760 tok/s |
| 2× RTX 4090 | 28.9 tok/s | 23.1 tok/s | 17.4 tok/s | OOM | 1,920 tok/s |
| 2× RTX 3090 | 19.6 tok/s | 15.8 tok/s | OOM | OOM | 1,210 tok/s |
| RTX 6000 Blackwell (single) | 52.4 tok/s | 41.8 tok/s | 32.1 tok/s | 24.9 tok/s | 2,560 tok/s |
| RTX 6000 Ada (single) | 30.1 tok/s | 23.6 tok/s | OOM | OOM | 1,640 tok/s |
| 2× RX 7900 XTX | 17.2 tok/s | 13.9 tok/s | OOM | OOM | 880 tok/s |
| M3 Ultra (192 GB) | 16.1 tok/s | 14.2 tok/s | 11.8 tok/s | 9.4 tok/s | 480 tok/s |
OOM = out of memory at the listed quant + 8K context with the default KV cache type.
Step-by-step: set up a 2-GPU tensor-parallel build
Assumes Linux + 2× RTX 4090 / 5090 + a board with two x16 PCIe slots (X870E, TRX50, or Z890 Hero):
- Update NVIDIA drivers to the latest production branch (≥555.x for Ada, ≥570.x for Blackwell). Confirm
nvidia-smisees both cards. - Verify PCIe topology with
nvidia-smi topo -m. You want at least "PHB" (PCIe Host Bridge) between cards; "NODE" is fine but slightly slower for tensor-parallel. - Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh. Ollama auto-detects both GPUs and splits layers by default. - Pull the model:
ollama pull llama3.1:70b-instruct-q4_K_M. About 42 GB download. - For better control of the tensor split, use llama.cpp directly:
The --tensor-split 1,1 says "equal layer split between the two cards." Adjust if your cards are asymmetric (e.g. 4090 + 5090 → --tensor-split 4,5).
- For higher throughput per stream, add
--draft-modelpointed at a Llama 3.1 8B Q4 to enable speculative decoding. This can lift token generation by 30–50% on prompts the draft model gets right.
Common pitfalls
1. Buying one RTX 5090 thinking it'll handle a 70B model. It won't — at best you'll run a heavy IQ3 quant that benchmarks 10–15% below Qwen 3 32B Q5 on most reasoning tasks. The 70B class assumes ≥48 GB of pooled VRAM. Get two cards or step down to a 32B model.
2. Mismatched cards in tensor-parallel. A 3090 paired with a 4090 will see the 4090 throttled to roughly 3090-speed, because the slower card holds up every layer pass. Use matched pairs whenever possible.
3. PCIe lane starvation. Putting two cards in a board that runs them at x8/x8 or worse cuts prefill ~25%. Token generation is less affected (it's memory-bandwidth-bound per card), but if you care about long prompts, get a board with x16/x16.
4. PSU undersized for transient spikes. Two RTX 5090s under sustained load can pull 1,200+ W with 1,500 W transient spikes. A 1,200 W PSU will OCP trip on prompt processing. Plan 1,600 W minimum.
5. Defaulting to fp16 KV cache. At 32K context with Llama 3.1 70B, fp16 KV is 5+ GB per card just for the cache. Add -ctk q8_0 -ctv q8_0 and recover 2.5 GB with effectively no quality loss. Worth it on every build.
6. Running speculative decoding without verifying the draft accepts. A poor draft model (e.g. random Llama 3.1 8B finetune) can lower throughput because rejected drafts waste GPU cycles. Measure accept-rate; aim for ≥65%.
When NOT to run Llama 3.1 70B locally
- One GPU < 48 GB pooled. Use Qwen 3 32B or Llama 3.1 8B locally and a hosted 70B endpoint for the rare cases you need the larger model.
- Sub-10 tok/s feels unusable for your workflow. Multi-GPU prosumer setups land in the 20–40 tok/s range. If you need >50 tok/s for an interactive coding agent, rent an H100 or use a hosted endpoint.
- You only need it occasionally. $4,000 of GPUs idle a lot. Pay-per-token hosted Llama 3.1 70B is cents per million tokens — break-even is in the tens of billions of tokens / month.
- You need batched throughput for many users. A single H100 batched out-performs four consumer GPUs on $/Mtok by 3–5×.
Final shortlist
- Best overall: 2× RTX 5090 — full Q5_K_M / Q6_K at 16K context, fastest consumer-grade build.
- Best value: 2× RTX 3090 used — entry to 70B-class local LLM under $2,000 with NVLink.
- Single-card workstation: RTX 6000 Blackwell (96 GB) — runs Q8_0 in one slot, no multi-GPU complexity.
- Single-machine, no-PCIe answer: M3 Ultra Mac Studio (192 GB) — quiet, slower per-token, huge context.
- Pay-per-hour: rent an H100 for spikes.
Skip single-consumer-GPU 70B unless you have a specific reason to live with IQ3-grade quality. Skip the RTX 4060 Ti 16 GB → "split with CPU offload" trap (1–3 tok/s, you will not enjoy it). And skip mining-pulled RTX 3090s without repadding the VRAM — sustained LLM loads will throttle them in 20 minutes.
