Running Llama 3.1 405B locally is a VRAM problem first and a bandwidth problem second. At the community-standard q4_K_M quantization, Llama 3.1 405B needs roughly 220 GB of GPU memory for weights, plus 3-8 GB for the KV cache depending on context length. The cheapest card that fits it natively in 2026 is the — (— MSRP); the fastest is the —. This guide pulls real tokens-per-second numbers from the SpecPicks benchmark database for every option.
Frontier open model. 220GB+ VRAM at q4; only Mac Studio M3 Ultra 512GB or dual H100-class hardware.
Does the Llama 3.1 405B fit on my GPU? (quantization matrix)
Quantization is the lever that decides whether Llama 3.1 405B fits on your card. Each quant below lists the approximate VRAM for weights, the extra VRAM used by the KV cache at a 4K-token context window, and the quality tradeoff.
| Quant | Weights (VRAM) | + KV cache | Quality |
|---|---|---|---|
| q2_K_S | 126.6 GB | +12.7 GB @ 4K ctx | Severe — lose 15-25% on reasoning. Use only when desperate. |
| q3_K_M | 177.2 GB | +17.7 GB @ 4K ctx | Noticeable — lose 5-8% on HumanEval / MMLU. Fine for casual chat. |
| q4_K_M | 243 GB | +24.3 GB @ 4K ctx | Community default — 1-3% loss vs fp16. Almost free quality-wise. |
| q5_K_M | 278.4 GB | +27.8 GB @ 4K ctx | Minimal loss — <1%. Most users can't tell vs fp16. |
| q6_K | 329.1 GB | +32.9 GB @ 4K ctx | Effectively lossless. 10% more VRAM than q4_K_M. |
| q8_0 | 430.3 GB | +43 GB @ 4K ctx | Lossless at the inference level. 2x weight of q4_K_M. |
| fp16 | 810 GB | +81 GB @ 4K ctx | Original training precision. Baseline, rarely needed for inference. |
For nearly every user, q4_K_M is the right default. It costs you maybe 1-3% on benchmark scores versus fp16 but halves the memory footprint. Drop to q3_K_M only when VRAM is tight and you can tolerate a few percent more quality loss. q6_K and q8_0 are worth considering when you have the headroom and want to eliminate any question of quant damage.
What runs Llama 3.1 405B at q4_K_M? (the shortlist)
Every number in the table below comes from a live query against the SpecPicks benchmark database. Tok/s values are single-user generation speed ("output tokens per second after the first token"). Perf-per-dollar is tokens/sec per $1,000 of MSRP; perf-per-watt is tokens/sec per 100W of TDP.
| Hardware | VRAM | MSRP | TDP | Gen tok/s | tok/s/$1k | tok/s/100W |
|---|---|---|---|---|---|---|
| Apple M3 Ultra | 512 GB | — | 0W | — | — | — |
Note: no consumer GPU we track holds Llama 3.1 405B at q4_K_M natively in 2026. The "stretch" cards above can run it with CPU offload at 30-50% of the listed speed. Your best options at this size are either a Mac Studio with enough unified memory or an enterprise-class card like the RTX PRO 6000 Blackwell (96GB).
How does quantization change tok/s?
Smaller quants don't just save VRAM — they also run faster. Memory bandwidth is the dominant bottleneck for dense-weight inference, so halving the bytes per weight roughly doubles the throughput (up to a point where compute becomes the limit).
Community benchmarks on the RTX 4090 show approximate deltas:
- q8_0 → baseline (call it 100%)
- q5_K_M → ~1.4x faster than q8_0
- q4_K_M → ~1.7x faster than q8_0
- q3_K_M → ~2.0x faster than q8_0
Quality loss vs speed gain is not linear — q4 is the last point on the Pareto frontier for most users. Below q4 you lose quality faster than you gain speed.
Prefill vs generation speed
Two numbers matter for different workloads:
- Prefill (prompt-processing) — how fast the model ingests your input before the first token comes out. For a 4K-token prompt on the RTX 4090, expect ~150-400 tok/s prefill.
- Generation — sustained output speed after prefill, which is what the table above measures.
For chat you'll feel generation speed. For RAG where the model re-ingests a long retrieved context on every turn, prefill is often the bottleneck. Code completion sits in between — prompts are short so prefill is fast, generation dominates.
Context length and VRAM — the hidden cost
The KV cache grows linearly with context length. Here's the approximate overhead on top of 220 GB of weights for Llama 3.1 405B at q4_K_M:
| Context | KV cache | Total VRAM |
|---|---|---|
| 2K tokens | ~11.0 GB | ~231.0 GB |
| 4K tokens | ~22.0 GB | ~242.0 GB |
| 8K tokens | ~44.0 GB | ~264.0 GB |
| 32K tokens | ~176.0 GB | ~396.0 GB |
| 128K tokens | ~704.0 GB | ~924.0 GB |
For 128K-context workloads you need 2-4x more VRAM than you'd expect from just the weights. llama.cpp supports KV-cache quantization (-ctk q8_0 -ctv q8_0) which cuts cache size roughly in half with minimal quality loss — use it if you're pushing context limits.
Which runtime should I use?
At 405B, tensor parallelism across multiple GPUs is the only viable path for most users. vLLM is the current best choice for multi-GPU inference; llama.cpp can do it too but is less optimized for >2 GPUs. Rule of thumb:
- Ollama — easiest. Good for anyone who doesn't want to manage models manually. Wraps llama.cpp.
- llama.cpp — direct control over quantization, offload, KV-cache precision. Where the LocalLLaMA community benchmarks its numbers.
- vLLM — production serving. Tensor parallelism, PagedAttention, continuous batching. CUDA-only.
For more: Ollama vs llama.cpp vs vLLM →.
Multi-GPU — does it help?
Yes, strongly. Tensor parallelism across 2x or 4x 24GB+ cards is the mainstream approach. vLLM handles this natively; llama.cpp has layer-split support but is less mature.
Perf-per-dollar vs perf-per-watt
Shopping on pure tok/s is expensive. The "tok/s/$1k" column above is a better lens for budget-constrained builds. Apple Silicon dominates the perf-per-watt column by a wide margin — M4 Max at 12 tok/s / ~60W sustained is ~4x more efficient than an RTX 5090's 34 tok/s / 575W.
- Best price-performance under $1,500: RTX 4060 Ti 16GB (stretches with q3)
- Best max speed, cost ignored: dual 5090
- Best power efficiency: Apple M4 Max — silent, 60W, unified memory fits far more model than any consumer NVIDIA
Getting started — concrete commands
With Ollama:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama31:32b
ollama run llama-3-1-405b
With llama.cpp (more control):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j # NVIDIA; use GGML_METAL=1 for Mac, GGML_HIPBLAS=1 for AMD Linux
# Download q4_K_M from HuggingFace (bartowski or TheBloke maintain good GGUFs)
./llama-cli -m ~/models/llama-3-1-405b-q4_k_m.gguf -n 512 -c 4096 -ngl 999 \
-p "Write a haiku about GPUs"
Expect first-token latency of 1-3 seconds (prefill), then sustained generation at the numbers in the perf table.
Bottom line
For Llama 3.1 405B at q4_K_M in 2026, a 24GB+ GPU is the entry point, the RTX 5090 is the ceiling for single-card consumer builds, and Apple M3 Ultra is the quiet-and-efficient alternative if you care more about power draw than raw tok/s.
Buy more VRAM than you think you need. Context-window growth, longer conversation histories, and KV-cache pressure eat VRAM faster than model weights do. A 32GB card is a materially better long-term bet than a 24GB card for anything in this class.
