Skip to main content
How to run Llama 3.1 70B on NVIDIA GeForce RTX 3090

How to run Llama 3.1 70B on NVIDIA GeForce RTX 3090

Exact commands, expected tok/s, VRAM math for this specific combination.

Requires CPU offload — step-by-step Ollama and llama.cpp setup plus real tok/s numbers for Llama 3.1 70B on NVIDIA GeForce RTX 3090.

This tutorial walks you through running Llama 3.1 70B on an NVIDIA GeForce RTX 3090. Exact commands, expected tokens-per-second, and the tradeoffs you should know before starting.

Does it fit?

NVIDIA GeForce RTX 3090 has 24 GB of GDDR6X. Llama 3.1 70B at q4_K_M wants ~42 GB of it for weights alone.

Verdict: ⚠️ Tight fit. You'll need heavy CPU offload (llama.cpp with roughly 20 of 80 layers on GPU, the rest on CPU) or drop to a smaller quant like q3_K_S to fit more on the card. Expect 1-3 tok/s with offload.

Install Ollama (the easy path)

bash
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:70b
ollama run llama3.1:70b

Ollama handles GPU detection automatically on NVIDIA (via CUDA), AMD (via ROCm on Linux), and Apple Silicon (via Metal). No manual configuration needed.

Install llama.cpp (more control)

llama.cpp gives you flag-level control over quantization, context length, and layer offload. Build from source:

bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j

# Download a quantized GGUF (community favorite: bartowski or TheBloke on HuggingFace)
./llama-cli -m ~/models/llama-3-1-70b-q4_k_m.gguf \
  -n 512 -c 4096 \
  -ngl 20 \
  -p "Write a haiku about GPUs"

With 24GB VRAM, offload ~20 layers; the rest stays on CPU. Expect roughly half the all-GPU speed but much cleaner than running at smaller quants.

Expected performance

  • Community reports from LocalLLaMA for a single 3090 with CPU offload on 70B q4_K_M cluster around 1-3 tok/s. To reach 10+ tok/s on this model you generally need two 3090s (or a card with ≥48 GB VRAM).

If you need more speed, bigger VRAM is the answer. Consider upgrading to a card with more headroom →.

Common issues

"out of memory" on the first prompt: reduce context length (-c 2048) or quantization (q4_K_S instead of q4_K_M).

Slow first token but fast generation: that's prompt processing ("prefill"). Normal — blame the KV cache building. Subsequent messages in the same session will be snappy.

Frequent swapping / system hangs: VRAM is full AND system RAM is full. Close Chrome. Add more DDR5.

Related


NVIDIA GeForce RTX 3090 specs: 24GB memory, 350W TDP, 2020 launch. MSRP $1,499.

Does it fit? Full quantization matrix

Weight-only VRAM for Llama 3.1 70B at every common quant, plus the KV-cache overhead for a 4K-token context. KV cache scales linearly with context — see the context-length table further down.

QuantWeights+ KV @ 4K ctxTotalFits on this GPU?Quality loss
q2_K_S21.0 GB5.6 GB26.6 GBSevere (15-25%)
q3_K_M31.5 GB5.6 GB37.1 GBNoticeable (5-8%)
q4_K_M42.0 GB5.6 GB47.6 GBMinimal (1-3%) — community default
q5_K_M49.0 GB5.6 GB54.6 GB<1%
q6_K56.0 GB5.6 GB61.6 GBEffectively lossless
q8_077.0 GB5.6 GB82.6 GBInference-lossless
fp16140.0 GB5.6 GB145.6 GBBaseline (original precision)

Values are approximate — actual footprint depends on batch size, whether the KV cache is quantized (-ctk q8_0 -ctv q8_0 in llama.cpp halves it), and whether you reserve VRAM for a display. Rule of thumb: budget 5-10% headroom on top of the table.

How public benchmarks show and compared

Every tok/s, FPS, and synthetic score in this article is pulled live from the SpecPicks benchmark catalog (hardware_specs, ai_benchmarks, synthetic_benchmarks). We cite the source_name on each row — the vast majority are community-reported numbers from r/LocalLLaMA and llama.cpp GitHub Discussions, with synthetic scores from PassMark, Phoronix, and Tom's Hardware's GPU hierarchy.

Where DB rows exist for a specific model+quant+GPU combination, we quote the number exactly. Where they don't, we fall back to published spec-sheet values (VRAM capacity, TDP, memory bandwidth) plus the closest community-verified ballpark — clearly flagged as a ballpark, not a measurement. We prefer "we don't know" over a fabricated number.

SpecPicks does not run paid hardware review cycles; we aggregate. If you see a number you can improve on, pull-request the row.

Measured tok/s on this GPU

Live data from ai_benchmarks for NVIDIA GeForce RTX 3090, filtered to the Llama 3.1 70B family where available:

ModelQuantRuntimeGen tok/sVRAM usedSource
_No direct matches in the DB yet — see community thread below_

For the full tok/s matrix on this card across every model we've logged, see the NVIDIA GeForce RTX 3090 benchmark page.

Context length and VRAM — the hidden cost

KV cache grows linearly with context. Here's the approximate overhead on top of 42.0 GB of q4_K_M weights for Llama 3.1 70B:

ContextKV cacheTotal VRAM
2K tokens~2.8 GB~44.8 GB
4K tokens~5.6 GB~47.6 GB
8K tokens~11.2 GB~53.2 GB
32K tokens~44.8 GB~86.8 GB
128K tokens~179.2 GB~221.2 GB

For long-context workloads (32K-128K) on consumer hardware, use llama.cpp's KV-cache quantization — -ctk q8_0 -ctv q8_0 roughly halves cache footprint with sub-1% quality loss. This is the single biggest VRAM-saving flag for long context.

Which runtime wins on this hardware?

Three mainstream runtimes target NVIDIA GeForce RTX 3090; the right one depends on your workload:

  • Ollama — easiest. Auto-detects CUDA, handles model downloads, exposes an OpenAI-compatible API out of the box. Wraps llama.cpp; you give up fine-grained control for zero setup.
  • llama.cpp — direct flag-level control over quant, context, KV-cache precision, batch size, split layers across GPUs. Where the LocalLLaMA community benchmarks its numbers (see the Apple-Silicon megathread #4167 for reference tok/s across M-series chips).
  • vLLM — built for production serving. Tensor parallelism, PagedAttention, continuous batching. Linux + NVIDIA CUDA primary target. If you're not serving multiple concurrent users, the overhead isn't worth it.

For head-to-head numbers and install commands across all three, see our Ollama vs llama.cpp vs vLLM guide.

Troubleshooting — three failure modes and fixes

1. First token takes 5-30 seconds, then generation is fast. That's normal prefill: the model is processing your prompt before it can start generating. On a long prompt (4K+ tokens) prefill dominates the first-token latency. If it's unexpectedly slow, check that you actually offloaded layers to the GPU — nvidia-smi dmon should show near-100% GPU utilisation during prefill. If utilisation is flat, your inference is running on CPU.

2. "Out of memory" halfway through a long chat. The KV cache grew past what the card can hold. Drop to a smaller quant (q4_K_M → q3_K_M), cut -c context length, or enable KV-cache quantization (-ctk q8_0 -ctv q8_0 in llama.cpp). On Ollama set num_ctx smaller in your Modelfile.

3. Tok/s is ~30% of what LocalLLaMA threads report. Three usual suspects: (a) power/thermal throttling — check sustained clocks during a long prompt; (b) PCIe x8 or x4 link when you expected x16 — nvidia-smi --query-gpu=pcie.link.width.current --format=csv; (c) running the wrong binary, e.g. a non-CUDA llama.cpp build on an NVIDIA card. Rebuild with GGML_CUDA=1 and confirm.

Frequently asked questions

Can I run Llama 3.1 70B on NVIDIA GeForce RTX 3090 without offloading to CPU?

No — 24 GB is not enough for q4_K_M weights alone (~42 GB) plus any KV cache. Full-GPU inference requires either a smaller quant that still won't fit, or a card with substantially more VRAM. On the 3090 you must use CPU offload or step down to a much smaller model.

What quantization should I use on NVIDIA GeForce RTX 3090?

q4_K_M is the community default — 1-3% quality loss vs fp16 with less than half the memory. Drop to q3_K_M only when VRAM is tight. Go to q6_K or q8_0 when you have headroom and want to eliminate quant damage as a variable.

Is NVIDIA GeForce RTX 3090 bottlenecked by memory or compute for this model?

Dense-weight inference is memory-bandwidth-bound on almost every consumer card. The RTX 3090's memory bandwidth is 936 GB/s, so the sustained tok/s ceiling ≈ memory bandwidth ÷ weight bytes per token. The compute units are rarely the limit for single-user inference; they matter more for batched serving.

Does multi-GPU help for this model?

For a 70B model, usually no. If the model already fits in one card, a second card mainly helps batch throughput (vLLM) not single-user latency. Tensor parallelism adds inter-GPU traffic that often nets negative for interactive chat. Multi-GPU pays off on 70B+ models where you need to stack VRAM across cards.

Where can I report or compare my own tok/s numbers?

The r/LocalLLaMA community benchmark threads are the canonical place. llama.cpp also maintains a GitHub Discussions thread for Apple Silicon and per-platform performance. SpecPicks imports numbers from both into ai_benchmarks; if you want a figure added, pull-request the row.

Sources

  1. r/LocalLLaMA (community tok/s threads)
  2. llama.cpp GitHub Discussions #4167 — Apple Silicon benchmark thread
  3. Tom's Hardware GPU Hierarchy

Related guides

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What are the main tradeoffs of running Llama 3.1 70B on an NVIDIA GeForce RTX 3090?
Running Llama 3.1 70B on an RTX 3090 requires compromises due to its 24 GB VRAM. You may need to offload layers to the CPU or use lower quantization levels like q3_K_S. These adjustments reduce memory usage but can impact performance and model quality. Expect slower token generation speeds compared to GPUs with higher VRAM.
How does context length affect VRAM usage for Llama 3.1 70B?
VRAM usage increases linearly with context length due to the KV cache. For example, a 4K-token context adds ~5.6 GB to the model's base VRAM requirement. Longer contexts, such as 32K or 128K tokens, can push total VRAM usage beyond 86 GB or 221 GB, making them impractical for consumer GPUs without optimizations like KV-cache quantization.
What are the advantages of using Ollama over llama.cpp for this setup?
Ollama simplifies setup by automatically detecting hardware, managing model downloads, and providing an OpenAI-compatible API. It is ideal for users prioritizing ease of use over fine-grained control. In contrast, llama.cpp offers detailed control over quantization, context length, and layer offloading, making it better suited for advanced users or benchmarking.
Why is token generation slower for the first token compared to subsequent ones?
The slower first token generation is due to the prefill phase, where the model processes the input prompt and builds the KV cache. This is computationally intensive and can take several seconds, especially for long prompts. Subsequent tokens are faster because the KV cache is reused, reducing processing overhead.
What should I do if I encounter 'out of memory' errors during inference?
To address 'out of memory' errors, reduce the context length (e.g., from 4K to 2K tokens), switch to a smaller quantization level (e.g., q4_K_M to q3_K_M), or enable KV-cache quantization in llama.cpp. These adjustments lower the memory footprint, allowing the model to run within the GPU's VRAM limits.

Sources

— SpecPicks Editorial · Last verified 2026-05-18

NVIDIA GeForce RTX 3090
NVIDIA GeForce RTX 3090
$1999.99
View on Amazon →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →