Running Llama 3.1 70B locally is a VRAM problem first and a bandwidth problem second. At the community-standard q4_K_M quantization, Llama 3.1 70B needs roughly 42 GB of GPU memory for weights, plus 3-8 GB for the KV cache depending on context length. The cheapest card that fits it natively in 2026 is the NVIDIA GeForce RTX 5090 ($1,999 MSRP); the fastest is the NVIDIA GeForce RTX 5090. This guide pulls real tokens-per-second numbers from the SpecPicks benchmark database for every option.
The 70B weight class — flagship open-weight quality. Needs ~42GB VRAM at q4_K_M.
Does the Llama 3.1 70B fit on my GPU? (quantization matrix)
Quantization is the lever that decides whether Llama 3.1 70B fits on your card. Each quant below lists the approximate VRAM for weights, the extra VRAM used by the KV cache at a 4K-token context window, and the quality tradeoff.
| Quant | Weights (VRAM) | + KV cache | Quality |
|---|---|---|---|
| q2_K_S | 21.9 GB | +2.2 GB @ 4K ctx | Severe — lose 15-25% on reasoning. Use only when desperate. |
| q3_K_M | 30.6 GB | +3.1 GB @ 4K ctx | Noticeable — lose 5-8% on HumanEval / MMLU. Fine for casual chat. |
| q4_K_M | 42 GB | +4.2 GB @ 4K ctx | Community default — 1-3% loss vs fp16. Almost free quality-wise. |
| q5_K_M | 48.1 GB | +4.8 GB @ 4K ctx | Minimal loss — <1%. Most users can't tell vs fp16. |
| q6_K | 56.9 GB | +5.7 GB @ 4K ctx | Effectively lossless. 10% more VRAM than q4_K_M. |
| q8_0 | 74.4 GB | +7.4 GB @ 4K ctx | Lossless at the inference level. 2x weight of q4_K_M. |
| fp16 | 140 GB | +14 GB @ 4K ctx | Original training precision. Baseline, rarely needed for inference. |
For nearly every user, q4_K_M is the right default. It costs you maybe 1-3% on benchmark scores versus fp16 but halves the memory footprint. Drop to q3_K_M only when VRAM is tight and you can tolerate a few percent more quality loss. q6_K and q8_0 are worth considering when you have the headroom and want to eliminate any question of quant damage.
What runs Llama 3.1 70B at q4_K_M? (the shortlist)
Every number in the table below comes from a live query against the SpecPicks benchmark database. Tok/s values are single-user generation speed ("output tokens per second after the first token"). Perf-per-dollar is tokens/sec per $1,000 of MSRP; perf-per-watt is tokens/sec per 100W of TDP.
| Hardware | VRAM | MSRP | TDP | Gen tok/s | tok/s/$1k | tok/s/100W |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 5090 | 32 GB | $1,999 | 575W | — | — | — |
| Apple M3 Ultra | 512 GB | — | 0W | — | — | — |
| Apple M4 Max | 128 GB | — | 0W | 16.9 tok/s | — | — |
| Apple M4 Pro | 64 GB | — | 0W | 16.9 tok/s | — | — |
Note: no consumer GPU we track holds Llama 3.1 70B at q4_K_M natively in 2026. The "stretch" cards above can run it with CPU offload at 30-50% of the listed speed. Your best options at this size are either a Mac Studio with enough unified memory or an enterprise-class card like the RTX PRO 6000 Blackwell (96GB).
How does quantization change tok/s?
Smaller quants don't just save VRAM — they also run faster. Memory bandwidth is the dominant bottleneck for dense-weight inference, so halving the bytes per weight roughly doubles the throughput (up to a point where compute becomes the limit).
Community benchmarks on the NVIDIA GeForce RTX 5090 show approximate deltas:
- q8_0 → baseline (call it 100%)
- q5_K_M → ~1.4x faster than q8_0
- q4_K_M → ~1.7x faster than q8_0
- q3_K_M → ~2.0x faster than q8_0
Quality loss vs speed gain is not linear — q4 is the last point on the Pareto frontier for most users. Below q4 you lose quality faster than you gain speed.
Prefill vs generation speed
Two numbers matter for different workloads:
- Prefill (prompt-processing) — how fast the model ingests your input before the first token comes out. For a 4K-token prompt on the NVIDIA GeForce RTX 5090, expect ~150-400 tok/s prefill.
- Generation — sustained output speed after prefill, which is what the table above measures.
For chat you'll feel generation speed. For RAG where the model re-ingests a long retrieved context on every turn, prefill is often the bottleneck. Code completion sits in between — prompts are short so prefill is fast, generation dominates.
Context length and VRAM — the hidden cost
The KV cache grows linearly with context length. Here's the approximate overhead on top of 42 GB of weights for Llama 3.1 70B at q4_K_M:
| Context | KV cache | Total VRAM |
|---|---|---|
| 2K tokens | ~2.1 GB | ~44.1 GB |
| 4K tokens | ~4.2 GB | ~46.2 GB |
| 8K tokens | ~8.4 GB | ~50.4 GB |
| 32K tokens | ~33.6 GB | ~75.6 GB |
| 128K tokens | ~134.4 GB | ~176.4 GB |
For 128K-context workloads you need 2-4x more VRAM than you'd expect from just the weights. llama.cpp supports KV-cache quantization (-ctk q8_0 -ctv q8_0) which cuts cache size roughly in half with minimal quality loss — use it if you're pushing context limits.
Which runtime should I use?
For single-user chat on one GPU, all three runtimes (Ollama, llama.cpp, vLLM) produce similar numbers within 10-15% of each other. Rule of thumb:
- Ollama — easiest. Good for anyone who doesn't want to manage models manually. Wraps llama.cpp.
- llama.cpp — direct control over quantization, offload, KV-cache precision. Where the LocalLLaMA community benchmarks its numbers.
- vLLM — production serving. Tensor parallelism, PagedAttention, continuous batching. CUDA-only.
For more: Ollama vs llama.cpp vs vLLM →.
Multi-GPU — does it help?
Sometimes. Two 24GB cards (48GB combined) comfortably run Llama 3.1 70B with room for a bigger KV cache, but tok/s does not double — expect 1.3-1.5x single-card speed because of PCIe overhead.
Perf-per-dollar vs perf-per-watt
Shopping on pure tok/s is expensive. The "tok/s/$1k" column above is a better lens for budget-constrained builds. Apple Silicon dominates the perf-per-watt column by a wide margin — M4 Max at 12 tok/s / ~60W sustained is ~4x more efficient than an RTX 5090's 34 tok/s / 575W.
- Best price-performance under $1,500: RTX 4060 Ti 16GB (stretches with q3)
- Best max speed, cost ignored: NVIDIA GeForce RTX 5090
- Best power efficiency: Apple M4 Max — silent, 60W, unified memory fits far more model than any consumer NVIDIA
Getting started — concrete commands
With Ollama:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama31:32b
ollama run llama-3-1-70b
With llama.cpp (more control):
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j # NVIDIA; use GGML_METAL=1 for Mac, GGML_HIPBLAS=1 for AMD Linux
# Download q4_K_M from HuggingFace (bartowski or TheBloke maintain good GGUFs)
./llama-cli -m ~/models/llama-3-1-70b-q4_k_m.gguf -n 512 -c 4096 -ngl 999 \
-p "Write a haiku about GPUs"
Expect first-token latency of 1-3 seconds (prefill), then sustained generation at the numbers in the perf table.
Bottom line
For Llama 3.1 70B at q4_K_M in 2026, NVIDIA GeForce RTX 5090 is the entry point, NVIDIA GeForce RTX 5090 is the ceiling for single-card consumer builds, and Apple M3 Ultra is the quiet-and-efficient alternative if you care more about power draw than raw tok/s.
Buy more VRAM than you think you need. Context-window growth, longer conversation histories, and KV-cache pressure eat VRAM faster than model weights do. A 32GB card is a materially better long-term bet than a 24GB card for anything in this class.
