This tutorial walks you through running Llama 3.1 8B on an AMD Radeon RX 7900 XTX. Exact commands, expected tokens-per-second, and the tradeoffs you should know before starting.
Does it fit?
AMD Radeon RX 7900 XTX has 24 GB of GDDR6. Llama 3.1 8B at q4_K_M wants ~4.9 GB of it for weights alone.
Verdict: ✅ Fits natively. Expect ~60–80 tok/s of generation throughput after warm-up (community-reported).
Install Ollama (the easy path)
Ollama auto-detects NVIDIA (CUDA), AMD (ROCm on Linux), and Apple Silicon (Metal). On AMD you need a working ROCm install; the RX 7900 XTX (gfx1100) is on Ollama's officially supported list, so no HSA_OVERRIDE_GFX_VERSION workaround is required.
Install llama.cpp (more control)
llama.cpp gives you flag-level control over quantization, context length, and layer offload. Build from source:
-ngl 999 offloads all layers to the GPU.
Expected performance
- Community reports from LocalLLaMA suggest ~50-80 tok/s on this class of hardware.
For single-user chat these speeds feel instant. For RAG pipelines where the model re-reads long context, prefill throughput matters more than generation tok/s.
Common issues
"out of memory" on the first prompt: reduce context length (-c 2048) or quantization (q4_K_S instead of q4_K_M).
Slow first token but fast generation: that's prompt processing ("prefill"). Normal — blame the KV cache building. Subsequent messages in the same session will be snappy.
Frequent swapping / system hangs: VRAM is full AND system RAM is full. Close Chrome. Add more DDR5.
Related
AMD Radeon RX 7900 XTX specs: 24GB memory, 355W TDP, 2022 launch. MSRP $999.
Does it fit? Full quantization matrix
Weight-only VRAM for Llama 3.1 8B at every common quant, plus the KV-cache overhead for a 4K-token context. KV cache scales linearly with context — see the context-length table further down.
| Quant | Weights | + KV @ 4K ctx | Total | Fits on this GPU? | Quality loss |
|---|---|---|---|---|---|
| q2_K_S | 2.4 GB | 0.6 GB | 3.0 GB | ✅ | Severe (15-25%) |
| q3_K_M | 3.6 GB | 0.6 GB | 4.2 GB | ✅ | Noticeable (5-8%) |
| q4_K_M | 4.8 GB | 0.6 GB | 5.4 GB | ✅ | Minimal (1-3%) — community default |
| q5_K_M | 5.6 GB | 0.6 GB | 6.2 GB | ✅ | <1% |
| q6_K | 6.4 GB | 0.6 GB | 7.0 GB | ✅ | Effectively lossless |
| q8_0 | 8.8 GB | 0.6 GB | 9.4 GB | ✅ | Inference-lossless |
| fp16 | 16.0 GB | 0.6 GB | 16.6 GB | ✅ | Baseline (original precision) |
Values are approximate — actual footprint depends on batch size, whether the KV cache is quantized (-ctk q8_0 -ctv q8_0 in llama.cpp halves it), and whether you reserve VRAM for a display. Rule of thumb: budget 5-10% headroom on top of the table.
How public benchmarks show and compared
Every tok/s, FPS, and synthetic score in this article is pulled live from the SpecPicks benchmark catalog (hardware_specs, ai_benchmarks, synthetic_benchmarks). We cite the source_name on each row — the vast majority are community-reported numbers from r/LocalLLaMA and llama.cpp GitHub Discussions, with synthetic scores from PassMark, Phoronix, and Tom's Hardware's GPU hierarchy.
Where DB rows exist for a specific model+quant+GPU combination, we quote the number exactly. Where they don't, we fall back to published spec-sheet values (VRAM capacity, TDP, memory bandwidth) plus the closest community-verified ballpark — clearly flagged as a ballpark, not a measurement. We prefer "we don't know" over a fabricated number.
SpecPicks does not run paid hardware review cycles; we aggregate. If you see a number you can improve on, pull-request the row.
Measured tok/s on this GPU
Live data from ai_benchmarks for AMD Radeon RX 7900 XTX, filtered to the Llama 3.1 8B family where available:
| Model | Quant | Runtime | Gen tok/s | VRAM used | Source |
|---|---|---|---|---|---|
| _No direct matches in the DB yet — see community thread below_ |
For the full tok/s matrix on this card across every model we've logged, see the AMD Radeon RX 7900 XTX benchmark page.
Context length and VRAM — the hidden cost
KV cache grows linearly with context. Here's the approximate overhead on top of 4.8 GB of q4_K_M weights for Llama 3.1 8B:
| Context | KV cache | Total VRAM |
|---|---|---|
| 2K tokens | ~0.3 GB | ~5.1 GB |
| 4K tokens | ~0.6 GB | ~5.4 GB |
| 8K tokens | ~1.3 GB | ~6.1 GB |
| 32K tokens | ~5.1 GB | ~9.9 GB |
| 128K tokens | ~20.5 GB | ~25.3 GB |
For long-context workloads (32K and above) on consumer hardware, use llama.cpp's KV-cache quantization — -ctk q8_0 -ctv q8_0 roughly halves cache footprint with sub-1% quality loss. This is the single biggest VRAM-saving flag for long context.
Which runtime wins on this hardware?
Three mainstream runtimes target AMD Radeon RX 7900 XTX; the right one depends on your workload:
- Ollama — easiest. Auto-detects CUDA, handles model downloads, exposes an OpenAI-compatible API out of the box. Wraps llama.cpp; you give up fine-grained control for zero setup.
- llama.cpp — direct flag-level control over quant, context, KV-cache precision, batch size, split layers across GPUs. Where the LocalLLaMA community benchmarks its numbers (see the Apple-Silicon megathread #4167 for reference tok/s across M-series chips).
- vLLM — built for production serving. Tensor parallelism, PagedAttention, continuous batching. Linux + NVIDIA CUDA primary target. If you're not serving multiple concurrent users, the overhead isn't worth it.
For head-to-head numbers and install commands across all three, see our Ollama vs llama.cpp vs vLLM guide.
Troubleshooting — three failure modes and fixes
1. First token takes 5-30 seconds, then generation is fast. That's normal prefill: the model is processing your prompt before it can start generating. On a long prompt (4K+ tokens) prefill dominates the first-token latency. If it's unexpectedly slow, check that you actually offloaded layers to the GPU — rocm-smi (or radeontop) should show near-100% utilisation during prefill. If utilisation is flat, your inference is running on CPU.
2. "Out of memory" halfway through a long chat. The KV cache grew past what the card can hold. Drop to a smaller quant (q4_K_M → q3_K_M), cut -c context length, or enable KV-cache quantization (-ctk q8_0 -ctv q8_0 in llama.cpp). On Ollama set num_ctx smaller in your Modelfile.
3. Tok/s is ~30% of what LocalLLaMA threads report. Three usual suspects: (a) power/thermal throttling — check sustained clocks during a long prompt; (b) PCIe x8 or x4 link when you expected x16 — check with lspci -vv | grep -i LnkSta or rocm-smi --showbus; (c) running a CPU-only llama.cpp binary by accident. Rebuild with make GGML_HIPBLAS=1 -j and confirm the HIP backend loads at startup (the banner line should mention ROCm/HIP and the gfx target).
Frequently asked questions
Can I run Llama 3.1 8B on AMD Radeon RX 7900 XTX without offloading to CPU?
Yes at q4_K_M if the model weights plus KV cache fit in the card's 24 GB GDDR6. For Llama 3.1 8B that's approximately 4.8 GB of weights plus 0.5-2 GB of KV cache depending on context length.
What quantization should I use on AMD Radeon RX 7900 XTX?
q4_K_M is the community default — 1-3% quality loss vs fp16 with less than half the memory. Drop to q3_K_M only when VRAM is tight. Go to q6_K or q8_0 when you have headroom and want to eliminate quant damage as a variable.
Is AMD Radeon RX 7900 XTX bottlenecked by memory or compute for this model?
Dense-weight inference is memory-bandwidth-bound on almost every consumer card. The RX 7900 XTX has ~960 GB/s of memory bandwidth, so the sustained tok/s ceiling ≈ memory bandwidth ÷ weight bytes read per token. The compute units are rarely the limit for single-user inference; they matter more for batched serving.
Does multi-GPU help for this model?
For a 8B model, usually no. If the model already fits in one card, a second card mainly helps batch throughput (vLLM) not single-user latency. Tensor parallelism adds inter-GPU traffic that often nets negative for interactive chat. Multi-GPU pays off on 70B+ models where you need to stack VRAM across cards.
Where can I report or compare my own tok/s numbers?
The r/LocalLLaMA community benchmark threads are the canonical place. llama.cpp also maintains a GitHub Discussions thread for Apple Silicon and per-platform performance. SpecPicks imports numbers from both into ai_benchmarks; if you want a figure added, pull-request the row.
Sources
- r/LocalLLaMA (community tok/s threads)
- llama.cpp GitHub Discussions #4167 — Apple Silicon benchmark thread
- Tom's Hardware GPU Hierarchy
