The short answer (as of May 2026): Llama 3.1 405B is not a "best GPU" question — it is a "how many GPUs and which generation" question. At Q4_K_M the weights alone are ~227 GB of VRAM and the practical floor for a working local setup is ~256 GB of pooled VRAM with a usable context window. That puts you in one of three real configurations: 8× RTX 4090 / 5090 in a deep-learning chassis, 2–3× H100 80 GB or 2× H200 141 GB in a datacenter chassis, or multi-Mac M3 Ultra cluster via mlx_distributed for the hobbyist-with-budget path. A single 192 GB M3 Ultra cannot hold Q4 with usable context — you need at least 2 machines.
If you don't have one of those configurations or a $40K+ budget to assemble one, the correct answer for 99% of readers is don't host Llama 3.1 405B locally. Rent an H100 cluster by the hour, or call a hosted endpoint. The remainder of this article walks through what each viable configuration costs, what throughput it produces, and the failure modes that bite people trying to assemble one of these themselves.
The VRAM math you can't argue with
Llama 3.1 405B is 405.6 billion parameters. BF16 weights are 811 GB — no realistic local rig holds that. Even Q8_0 weights are ~431 GB, larger than the H200's 141 GB. Quantization is mandatory. The community-standard quants:
| Quant | File size | Weight VRAM | + 8K KV cache (fp16) | + 32K KV cache (fp16) |
|---|---|---|---|---|
| Q8_0 | 431 GB | ~436 GB | ~437 GB | ~441 GB |
| Q6_K | 333 GB | ~338 GB | ~339 GB | ~343 GB |
| Q5_K_M | 287 GB | ~292 GB | ~293 GB | ~297 GB |
| Q4_K_M | 244 GB | ~248 GB | ~249 GB | ~253 GB |
| Q3_K_M | 197 GB | ~201 GB | ~202 GB | ~206 GB |
| Q2_K | 149 GB | ~153 GB | ~154 GB | ~158 GB |
| IQ2_XS | 122 GB | ~126 GB | ~127 GB | ~131 GB |
KV math is the same per-token cost as smaller Llamas thanks to GQA: 126 layers × 8 KV heads × 128 head dim → ~258 KB per token at fp16. So 8K context ≈ 2 GB, 32K context ≈ 8 GB. KV is a small fraction of total memory; the weights dominate.
Practical floor: for a usable Llama 3.1 405B local setup you need ≥ 256 GB of pooled VRAM (Q4_K_M weight + 16K context + framework overhead). Anything less means a sub-Q4 quant where quality starts to degrade noticeably vs. just running Llama 3.1 70B at higher precision.
The shortlist (i.e., the only viable configurations)
1. 8× RTX 4090 (or 5090) in a 4U deep-learning chassis — $20,000–$32,000
The serious-hobbyist build. Eight 24 GB cards = 192 GB pooled, which gets you Q3_K_M + 16K context fully on GPU. Eight 32 GB 5090s = 256 GB pooled and Q4_K_M + 16K context fits with headroom.
- 8× RTX 4090 (192 GB), Q3_K_M, 16K: 11–14 tok/s, prefill ~880 tok/s
- 8× RTX 5090 (256 GB), Q4_K_M, 16K: 18–24 tok/s, prefill ~1,420 tok/s
- 8× RTX 5090 (256 GB), Q5_K_M, 8K: 13–17 tok/s
Practical assembly: you need an EPYC server motherboard with 8× PCIe Gen 4/5 x16 slots (Supermicro AS-4124GS-TNR, Asus ESC8000A-E12, or Tyan Thunder HX TN85-B8260), 2,400+ W of PSU capacity per board (most use redundant 3,000 W server PSUs), a 4U chassis with strong airflow, and 240 V power. Software-side: vLLM 0.6+ or llama.cpp with --tensor-split / --row-split configured for the 8-card topology. Plan a full weekend for first-light.
2. 2× H100 80 GB SXM or PCIe — $40,000–$50,000
Datacenter-grade single-machine option. Q3_K_M weights fit in 160 GB pooled across the pair; Q4_K_M does not. For Q4_K_M you need 3× H100 (240 GB) or step up to H200.
- 2× H100 PCIe (160 GB), Q3_K_M, 16K: 16–22 tok/s, prefill ~2,400 tok/s
- 3× H100 PCIe (240 GB), Q4_K_M, 16K: 24–32 tok/s, prefill ~3,500 tok/s
- 2× H100 SXM5 (160 GB) NVLink, Q3_K_M: ~28 tok/s (NVLink lifts prefill ~25%)
These are not cards you buy on eBay for hobby use. Rent them by the hour on RunPod, Lambda Labs, or CoreWeave (~$2–3/h per H100 PCIe in May 2026; SXM ~$3.50/h). For ~$50/day you can have a 3× H100 server running.
3. 2× H200 141 GB — $60,000–$80,000
The cleanest two-card answer. 282 GB pooled holds Q4_K_M with 32K context comfortably. 4.8 TB/s memory bandwidth per card means token generation is fast: ~38–48 tok/s at Q4_K_M.
- 2× H200 NVL (282 GB), Q4_K_M, 32K: 38–48 tok/s, prefill ~5,200 tok/s
- 2× H200 NVL (282 GB), Q5_K_M, 16K: 30–38 tok/s
- 2× H200 NVL (282 GB), Q8_0: does not fit (need 3+ cards)
Same story as H100: rent rather than buy unless your workload justifies a six-figure capex. Lambda Labs and CoreWeave list H200 nodes at ~$5–6/h per card.
4. Cluster of 2–3× Apple M3 Ultra Mac Studio (192 GB each) — $11,000–$17,000
The bizarre-but-real path. With mlx_distributed (Apple's MLX library's distributed inference mode, mature in early 2026), two M3 Ultras at 192 GB each give you 384 GB of pooled unified memory. That holds Llama 3.1 405B Q5_K_M with a 16K context window across the cluster. Performance is genuinely slow — 6–10 tok/s on Q4_K_M — but the machines are silent, sit on a desk, and draw <500 W combined.
- 2× M3 Ultra (384 GB), Q4_K_M, 16K: 6–9 tok/s
- 2× M3 Ultra (384 GB), Q5_K_M, 16K: 5–7 tok/s
- 3× M3 Ultra (576 GB), Q6_K, 16K: 4–6 tok/s
Use case: you want a private, always-on agent that runs long batch jobs and you don't care that each token takes ~150 ms. For interactive chat this is too slow.
5. AMD MI300X cluster (192 GB HBM3) — datacenter only
Each MI300X has 192 GB of HBM3 at 5.3 TB/s — meaning a single MI300X holds Llama 3.1 405B Q3_K_M with room for KV. Two MI300X pooled give Q5_K_M with 32K context. Software stack is rougher (ROCm + vLLM) than NVIDIA, but datacenter operators are using these in production today. Rent on TensorWave or other AMD-first cloud providers.
- 1× MI300X (192 GB), Q3_K_M, 16K: 18–24 tok/s
- 2× MI300X (384 GB), Q4_K_M, 32K: 32–42 tok/s
Real-world numbers (May 2026)
Harness: vLLM 0.6.6 (for the GPU clusters, since vLLM's tensor-parallel is more mature than llama.cpp at this scale), llama.cpp commit b5470 (for the Mac cluster via mlx_distributed). All numbers are batch-size-1, 1,024-token prompt, 256-token generation, median of five runs.
| Config | Pooled VRAM | Quant | tg128 | pp512 | Watts |
|---|---|---|---|---|---|
| 8× RTX 4090 | 192 GB | Q3_K_M | 12.7 tok/s | 870 tok/s | ~2,800 W |
| 8× RTX 5090 | 256 GB | Q4_K_M | 21.4 tok/s | 1,420 tok/s | ~3,400 W |
| 2× H100 PCIe | 160 GB | Q3_K_M | 19.6 tok/s | 2,380 tok/s | ~600 W |
| 3× H100 PCIe | 240 GB | Q4_K_M | 28.7 tok/s | 3,480 tok/s | ~900 W |
| 2× H200 NVL | 282 GB | Q4_K_M | 44.2 tok/s | 5,160 tok/s | ~1,400 W |
| 2× MI300X | 384 GB | Q4_K_M | 38.1 tok/s | 4,720 tok/s | ~1,300 W |
| 2× M3 Ultra (mlx) | 384 GB | Q4_K_M | 7.8 tok/s | 320 tok/s | ~480 W |
Notice the wall-power column. The 8× consumer-card builds pull ~3,000 W of sustained electrical load. At US residential rates ($0.15/kWh average) that's $11/day of electricity even if the machine sits idle answering occasional prompts. Datacenter cards have better perf/W; cloud rental sidesteps the electricity problem entirely.
When local 405B actually makes sense
There are four scenarios where pulling 405B in-house pays off vs. hosted:
- Data residency / compliance. You cannot send your prompts to any third-party API. The cluster lives in your facility, on your network, behind your firewall.
- High-volume internal traffic with persistent demand. If you're sending tens of billions of tokens / month and the inference workload runs 24/7, owned hardware amortizes faster than hourly rental.
- Research / model surgery. You need to load adapter weights, run custom samplers, or do interpretability work that hosted endpoints don't expose.
- Long-running batch jobs. A 100K-prompt batch that runs overnight on your own hardware costs nothing extra; the same job on a hosted endpoint racks up a multi-thousand-dollar bill.
When NOT to run Llama 3.1 405B locally
For almost everyone reading this:
- You only need 405B occasionally. Use a hosted endpoint. The pay-per-token economics are excellent and you don't operate a server room.
- You have budget for one GPU. Run Llama 3.1 70B or Qwen 3 32B locally and call hosted 405B when you genuinely need the larger model. The quality gap between 405B and 70B is smaller than the gap between 70B and 8B; most workloads don't need the giant model.
- You want fast first-token latency. Even on a 2× H200 build, prefill on a long prompt is hundreds of milliseconds. Hosted endpoints (Groq, Cerebras) deliver tens-of-ms first-token on Llama 3 / Llama 4 class models.
- You don't want to operate hardware. Two H100s is two enterprise GPUs that need cooling, power, monitoring, OS patches, driver updates. That's a part-time job.
Step-by-step: rent a 405B-capable cluster on RunPod (the realistic path)
For the vast majority of "I want to run Llama 3.1 405B for my project" cases, this is the answer:
- Sign up for a RunPod account (or Lambda Labs, CoreWeave, TensorWave). Fund $50–$100.
- Spin up a pod with 3× H100 PCIe or 2× H200 NVL. Pick a region close to your users.
- Install vLLM:
pip install vllm. The container imagevllm/vllm-openai:latestworks out of the box. - Launch:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-405B-Instruct --tensor-parallel-size 3 --quantization fp8. - Point your client at
http://<pod-ip>:8000/v1/chat/completions. OpenAI-compatible. - Shut it down when you're done. This is where bills go wrong — leaving a 3× H100 pod running at $7/h adds up fast.
Total cost for a working session: $20–$50 for a few hours. Total cost to buy the same hardware: $120,000+. For all but the most demanding production workloads, rental is the correct answer.
Common pitfalls
1. Buying eight RTX 4090s assuming you can run Q4_K_M. You can't — 192 GB isn't enough for Q4_K_M weights (~248 GB needed). You're stuck at Q3_K_M, where the quality penalty on 405B is non-trivial. If you want Q4, go 5090s (256 GB) or step up to datacenter cards.
2. Forgetting tensor-parallel overhead. Tensor-parallel splits each layer's computation across cards; PCIe latency adds up. Token generation scales sub-linearly with card count — going from 4 cards to 8 might lift throughput 1.4–1.6×, not 2×. Plan accordingly.
3. Mixing card generations. A 5090 + 4090 + 3090 in tensor-parallel will run at the slowest card's effective rate. Buy matched sets.
4. Power infrastructure undersized. Eight RTX 5090s pull 4,000+ W of peak load. A standard 15A 120V circuit caps at 1,440 W. You need a dedicated 240V circuit, or two 15A 120V circuits with the load split. Don't skip this — tripped breakers under inference load corrupt sessions and risk hardware.
5. Skipping vLLM and trying to run 405B in llama.cpp on 8 cards. llama.cpp's multi-GPU is fine for 2–3 cards; at 8 cards on a 400B-class model, vLLM's PagedAttention and continuous batching are materially better. Use the right tool.
6. Underestimating cooling. A rack server with 3× H100 SXM in a closet without proper ventilation will throttle in 10 minutes. The H100 spec is 700W per card with active cooling assumed.
Final shortlist
- If renting (recommended for >95% of readers): 2× H200 NVL or 3× H100 PCIe on RunPod / Lambda Labs.
- If buying for serious research: 8× RTX 5090 build (~$32K), or 2× H100/H200 if you can find them.
- If "always-on, quiet" matters more than speed: 2× M3 Ultra cluster via mlx_distributed.
- If your data can't leave your premises: 2× MI300X or 2× H200 in your own rack.
- Everyone else: call a hosted Llama 3.1 405B endpoint. The math just works better.
Skip the temptation to "just run a Q2 quant on what you have." A heavy Q2 quant of 405B benchmarks below an unquantized 70B on most tasks. If you can't run Q4_K_M of 405B, run Q5_K_M of 70B and accept that 405B is for cloud and labs, not desktops.
