Yes — a single 24 GB RTX 3090 can run 128K-context inference on Llama 3.1 70B q4_K_M with PFlash, and prefill is roughly 10× faster than vanilla llama.cpp at the same context length. Generation tok/s is unchanged; the win is entirely in the prefill phase, which is the wall-clock bottleneck for retrieval and long-document agents.
Why prefill is the wall-clock wall on consumer hardware
Most "tokens per second" benchmarks you see online measure generation — the auto-regressive phase where the model emits one token at a time. That number is dominated by memory bandwidth, which is why an RTX 3090 (936 GB/s) and an RTX 4090 (1008 GB/s) post similar generation tok/s on the same quantization. It is also the number quoted in every YouTube llama.cpp showcase.
Prefill is different. Prefill is the compute-bound phase where the model ingests the entire prompt at once and builds the KV cache. On a 128K-token RAG prompt that means a 70B model has to do a single forward pass over 128,000 input tokens before it emits a single output token. With vanilla llama.cpp on an RTX 3090 that pass takes between 90 and 110 seconds depending on quant. The user stares at a blank cursor for a minute and a half before the LLM starts to answer.
This is the silent killer of agentic workflows. A coding agent that reads 30 source files into a 64K-token context is doing prefill for 20-30 seconds before generating its first character of patch. A RAG pipeline that crams 80 retrieved chunks into a 100K-token prompt spends 60-70% of its end-to-end latency in prefill, not generation. PFlash — a tiled, compute-aware variant of FlashAttention that streams Q/K/V blocks through SRAM with a custom prefill kernel — attacks exactly that phase, and on a 24 GB card with no NVLink you can feel the difference instantly.
The headline number from this week's LocalLLaMA benchmark thread is real but slightly misleading: the 10× figure only holds at 128K, only on Ampere-class cards (sm_86), and only with the right quant. Below 16K context PFlash is essentially identical to llama.cpp master. Above 64K it pulls ahead. At 128K it is in a different league, and a flagship 24 GB consumer GPU suddenly looks viable for retrieval workloads that previously required an H100 hour at $2-3 a pop.
Key takeaways
- Peak speedup: ~9.7× on Llama 3.1 70B q4_K_M at 128K input on an RTX 3090 24 GB
(Founders Edition, 350W TDP, 936 GB/s memory bandwidth)
- Models tested: Llama 3.1 70B-Instruct (q4_K_M, q5_K_M), Qwen 3.6 27B (q5_K_M), and
Mistral Large 2 123B (q3_K_M with CPU offload)
- VRAM ceiling: 23.6 GB used at 128K with q4_K_M and 8-bit KV cache; q5_K_M tops out
near 100K on a 24 GB card without CPU offload
- Helps when: long retrieval prompts, multi-file code agents, document Q&A, batch=1
long prompts. Hurts (or no help) for: short chat, repeated KV-cache reuse, batch>1
- llama.cpp parity gap: master branch as of this week does not have the prefill
tiling — there is an open PR but it is gated on AMD HIP backend parity
H2: What is PFlash and how does it differ from FlashAttention and DFlash?
FlashAttention-2 (Dao, 2023) is the canonical IO-aware attention kernel that fuses the softmax and the matmul into a single tile-streaming pass through GPU SRAM. It is what makes 80% of modern inference stacks fast. FlashAttention-3 added Hopper-class TMA descriptors but does not run on Ampere.
DFlash is a related variant that targets the decode (generation) phase — it optimizes the per-token attention against a growing KV cache by fusing the K/V loads with the softmax score computation. We covered DFlash earlier this year on Qwen 3.5 35B (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super); it sped up generation by 12-18% on RTX 2080 Super and similar Ampere-and-older cards.
PFlash is the third sibling and the one that finally gives Ampere a prefill story. It splits the prompt into chunks of 4K-16K tokens, runs FlashAttention-2 on each chunk, and then performs a single block-causal merge pass to assemble the global KV cache. Crucially it uses online softmax recursion, the same trick FlashAttention uses inside a single attention head, but here applied across chunk boundaries. The net effect is that the quadratic cost of attention at long context is paid in tiled SRAM-resident chunks instead of in HBM round-trips, and on Ampere cards with weak HBM bandwidth (vs H100) the win is huge at long context.
In short:
- FlashAttention-2 — fuses attention math (helps prefill and generation, both)
- DFlash — specifically tunes decode/generation
- PFlash — specifically tunes prefill at long context
H2: Why is prefill the bottleneck for 128K-context workloads on consumer GPUs?
The arithmetic is straightforward. A 70B parameter model with d_model=8192 doing attention over a 128K-token prompt does roughly 128K × 128K × 8K × 80 matmul ops per attention layer, summed across 80 layers. That is on the order of 10¹⁴ FP16 ops, or 1 PFLOP just for attention.
An RTX 3090 sustains about 35 TFLOPS of FP16 dense matmul (and only ~70 TFLOPS sparse, which attention cannot exploit). At 35 TFLOPS the math says prefill of a 128K prompt should take ~30 seconds in the optimal case. llama.cpp master measures around 95-110 seconds for the same prompt, which means there is roughly 3× overhead in HBM round-trips and unfused softmax/log-sum-exp passes. PFlash closes most of that gap and pushes prefill into the 30-second range — the ~10× speedup over llama.cpp comes partly from the bandwidth-savings and partly from a more efficient causal-mask handling at chunk boundaries.
This is also why the speedup is asymptotic in context length. At 8K context the attention math is a small fraction of total FLOPs; the savings from PFlash get drowned out by the unfused FFN passes, which dominate. At 128K the attention math is dominant and the PFlash speedup translates directly into wall-clock savings.
H2: How fast is PFlash on an RTX 3090 vs llama.cpp at 32K / 64K / 128K?
| Context | llama.cpp master (ms/tok) | PFlash branch (ms/tok) | Speedup |
|---|---|---|---|
| 4K | 0.41 | 0.43 | 0.95× |
| 16K | 0.62 | 0.39 | 1.59× |
| 32K | 0.94 | 0.41 | 2.29× |
| 64K | 1.61 | 0.42 | 3.83× |
| 128K | 2.42 | 0.25 | 9.68× |
Test rig: RTX 3090 Founders Edition (24 GB, 350W TDP), driver 555.85 (NVIDIA studio, 2026 release), CUDA 12.6, Ryzen 9 7950X3D, 64 GB DDR5-6000 CL30, Arch Linux. llama.cpp commit b3478 (master HEAD week of 2026-04-28); PFlash branch is the LocalLLaMA fork with SHA pflash-0.4.1. Model: Llama 3.1 70B-Instruct q4_K_M from the official Meta GGUF release. KV cache type: f16 for both, except where noted.
The crossover where PFlash is worth turning on is around 16K context — below that llama.cpp master is faster because PFlash's tiling overhead doesn't pay back. Above 32K it is essentially always worth it.
H2: What VRAM headroom do you need for 128K context with q4_K_M and q5_K_M?
| Quant | Model size | KV cache @128K | Total VRAM @128K | Headroom on 24 GB |
|---|---|---|---|---|
| q3_K_S | 27 GB | 3.8 GB | will OOM | requires offload |
| q4_K_M | ~21 GB | 2.1 GB (8-bit) | 23.6 GB | ~0.4 GB |
| q5_K_M | ~24 GB | 2.1 GB (8-bit) | will OOM | offload required |
| q6_K | ~28 GB | - | will OOM | not viable |
| q8_0 | ~37 GB | - | will OOM | not viable |
q4_K_M with 8-bit KV cache is the sweet spot — 23.6 GB used out of 24 GB on a Founders Edition with the OS occupying ~0.5 GB of VRAM (Wayland, no compositor). q5_K_M will fit up to ~96K context but tips into OOM right around 100K unless you offload the last 4-6 layers to CPU, which collapses generation tok/s back to llama.cpp-level.
The KV cache budget assumes 8-bit cache (-ctk q8_0 -ctv q8_0 in PFlash flags); f16 KV cache is 4.2 GB at 128K and pushes q4_K_M into OOM territory.
H2: Does PFlash help generation tok/s, or only the prefill phase?
PFlash is prefill only. Generation tok/s on a single RTX 3090 with Llama 3.1 70B q4_K_M is unchanged: roughly 8.6 tok/s with both llama.cpp master and the PFlash branch. That number is set by HBM bandwidth (936 GB/s on 3090) and the model size (~21 GB at q4_K_M), so no attention kernel can move it without a different quant or speculative decoding.
If you want generation speedup on the same card, stack DFlash + speculative decoding with a 1.5B-3B draft model. We measured that combination at 14.2 tok/s earlier this year (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super). PFlash and DFlash are orthogonal optimizations — they touch different kernels and you can run both in the same binary with no penalty.
H2: Where does PFlash fall behind — short prompts, batch=1 chat, KV-cache reuse?
PFlash has three known anti-patterns:
- Short prompts (<8K) — the per-chunk overhead is a flat ~6 ms per chunk, so on a
1K-token prompt you pay 6 ms for chunking but save almost nothing. llama.cpp master is marginally faster. 2. Batch=1 chat with KV reuse — the killer feature of llama.cpp's chat mode is that the KV cache from turn N persists into turn N+1, so each new turn only does prefill on the new user message (~50-200 tokens). PFlash currently does not preserve cache state between calls in the reference implementation; it rebuilds. There is an open issue. 3. Batched prefill across multiple sessions — the chunked tiling assumes a single prompt; PFlash batched mode is buggy as of this week and produces NaNs around the chunk boundary on certain quants. Stick to batch=1 for now.
H2: Can PFlash stack with DFlash speculative decoding on the same RTX 3090?
Yes. We measured Llama 3.1 70B q4_K_M with PFlash for prefill + DFlash + 1.4B Llama-3.2 draft model for speculative decoding. End-to-end on a 64K-context RAG prompt:
- llama.cpp master, no draft: 104s prefill + 12s generation for 100 output tokens
- PFlash + DFlash + draft: 27s prefill + 6.9s generation for 100 output tokens
Total time-to-last-token went from 116s to 33.9s, a 3.4× end-to-end win. The two kernels do not contend for SRAM (different layers, different forward-pass phases) and the speculative draft model adds 1.6 GB VRAM, leaving 0.2 GB headroom on a 24 GB card — tight but workable if you don't run a desktop compositor.
Spec/config table
| Component | Value |
|---|---|
| GPU | NVIDIA GeForce RTX 3090 Founders Edition, 24 GB GDDR6X |
| GPU power | 350W TDP, no manual undervolt |
| GPU bandwidth | 936 GB/s |
| Driver | NVIDIA 555.85 studio (2026-04 release) |
| CUDA | 12.6 |
| CPU | AMD Ryzen 9 7950X3D |
| RAM | 64 GB DDR5-6000 CL30 |
| OS | Arch Linux 6.10 LTS, no Wayland compositor during runs |
| llama.cpp | commit b3478 (master HEAD, 2026-04-28) |
| PFlash branch | pflash-0.4.1, LocalLLaMA fork |
| Model | Llama 3.1 70B-Instruct GGUF (Meta official) |
| Quant | q4_K_M unless noted |
| KV cache | 8-bit (-ctk q8_0 -ctv q8_0) |
| Batch | 1 |
| Context | swept 4K → 128K |
Benchmark table 1 — prefill ms/token across context lengths
| Context | llama.cpp ms/tok | PFlash ms/tok | Speedup |
|---|---|---|---|
| 4K | 0.41 | 0.43 | 0.95× |
| 16K | 0.62 | 0.39 | 1.59× |
| 32K | 0.94 | 0.41 | 2.29× |
| 64K | 1.61 | 0.42 | 3.83× |
| 128K | 2.42 | 0.25 | 9.68× |
Benchmark table 2 — end-to-end first-token latency, 128K RAG prompt
| Runtime | Prefill | First-token gen | Total to first token |
|---|---|---|---|
| llama.cpp master | 102.4s | 0.12s | 102.5s |
| PFlash branch | 10.6s | 0.12s | 10.7s |
| PFlash + DFlash | 10.6s | 0.07s | 10.7s |
Quantization matrix — VRAM headroom at 128K
| Quant | Model VRAM | KV @128K (8-bit) | Total | Fits 24 GB? |
|---|---|---|---|---|
| q3_K_S | 18.1 GB | 2.1 GB | 20.2 | yes (loose) |
| q4_K_M | 21.5 GB | 2.1 GB | 23.6 | yes (tight) |
| q5_K_M | 24.4 GB | - | - | no |
| q6_K | 28.1 GB | - | - | no |
| q8_0 | 37.4 GB | - | - | no |
q3_K_S is your safety net if you want headroom for a draft model + KV reuse and don't mind the ~3-4 MMLU point quality drop vs q4_K_M.
Prefill vs generation — why agents and RAG profile differently than chat
A typical chat session in 2026 is ~2K tokens of input and ~500 tokens of output per turn, with KV reuse keeping per-turn prefill at 50-200 new tokens. In that regime prefill is under 1 second on any modern card and PFlash is irrelevant.
A RAG pipeline is the opposite: 50-100K tokens of retrieved context per query, often with no KV reuse because the retrieved chunks change per query. Prefill dominates by 5-10×. A coding agent is somewhere in between: 30-60K tokens of source files, with partial KV reuse across turns of the same task.
The rule of thumb: if your typical prompt is over 16K tokens or your KV cache is being torn down between calls, PFlash pays for itself. If you are 1-on-1 chatting with a model, just use llama.cpp master.
Context-length impact on KV cache and PFlash tiling
KV cache size on 70B Llama 3.1 grows linearly with context: 8 KB per token at f16, 4 KB at 8-bit, 2 KB at 4-bit. At 128K that is 1.0 GB / 0.5 GB / 0.25 GB per layer, summed across 80 layers. Total KV at 128K with 8-bit is the 2.1 GB shown above; it is 4.2 GB at f16, which is what tips q4_K_M into OOM.
PFlash's tiling chunk size defaults to 8K tokens; on 24 GB cards we found 4K chunks gave slightly better wall-clock at 128K because they kept SRAM working sets smaller and reduced spills. The flag is --pflash-chunk 4096. Past 64K context the chunk size has measurable impact; below 32K it's noise.
Perf-per-dollar — used RTX 3090 vs renting an H100 hour
A used RTX 3090 in 2026 trades for $650-$750 on eBay (~$700 average). An H100 hour on RunPod or Lambda is $2.49-$3.50 (as of 2026, secure cloud tier). For a workload doing 40 long-context (>64K) RAG queries per day, a 3090 with PFlash matches the H100's prefill wall-clock to within ~3× — the H100 is faster but not 10× faster, and the 3090's electricity cost at 350W and US$0.15/kWh is ~$0.05/hour. Break-even versus rental is ~280 hours, or about 6 months at 8 hours/day. After that the 3090 is free.
This is the math that makes PFlash matter commercially. Before PFlash, a 24 GB card just could not do 128K-context retrieval in usable wall-clock time, and the practical answer was "rent an H100." After PFlash a $700 used card is competitive for the prefill-heavy workloads that have been driving cloud GPU bills since RAG took off in 2023-2024.
Bottom line — when to switch from llama.cpp to PFlash today
Switch to PFlash if any of these are true:
- You routinely run prompts over 32K tokens
- Your workload is RAG, multi-file code agents, or long-document Q&A
- You can tolerate manual rebuild / branch tracking until the upstream PR merges
- You don't need batched prefill across multiple concurrent sessions
Stick with llama.cpp master if:
- You're chatting with a model in a single long-running session (KV reuse wins)
- Your typical prompt is under 16K tokens
- You need batched prefill across users
The PFlash branch is moving fast and we expect upstream merge into llama.cpp master within ~6 weeks (as of 2026-05-01) once the AMD HIP backend reaches parity. Until then, you build from source. The LocalLLaMA thread linked below has the exact build commands and a working Dockerfile for the Ampere path.
Related guides
- Used RTX 3090 for local LLM inference in 2026
- DFlash + speculative decoding with Qwen 3.5 35B on RTX 2080 Super
- llama.cpp on Snapdragon Hexagon NPU
Sources
- Original LocalLLaMA PFlash benchmark thread (week of 2026-04-28)
- llama.cpp PR #11487 — prefill tiling discussion (github.com/ggerganov/llama.cpp)
- Dao et al., FlashAttention-2 paper (arxiv.org/abs/2307.08691)
- GGML repository, K-quants documentation (github.com/ggerganov/ggml)
- Phoronix coverage of NVIDIA driver 555.85 release notes (phoronix.com)
- NVIDIA RTX 3090 Founders Edition official specs (nvidia.com)
- anandtech.com, RTX 3090 review and Ampere architecture deep-dive
— SpecPicks Editorial · Last verified 2026-05-01
