PFlash on a Single RTX 3090: 10× Prefill Speedup at 128K Context vs llama.cpp

PFlash on a Single RTX 3090: 10× Prefill Speedup at 128K Context vs llama.cpp

We measured prefill ms/token at 4K-128K on a 24 GB RTX 3090 with Llama 3.1 70B q4_K_M.

PFlash delivers up to 10× faster prefill than llama.cpp at 128K context on a single RTX 3090, but only above 32K and only for prefill — generation tok/s is unchanged. Full benchmarks, VRAM tables, and when not to switch.

Yes — a single 24 GB RTX 3090 can run 128K-context inference on Llama 3.1 70B q4_K_M with PFlash, and prefill is roughly 10× faster than vanilla llama.cpp at the same context length. Generation tok/s is unchanged; the win is entirely in the prefill phase, which is the wall-clock bottleneck for retrieval and long-document agents.

Why prefill is the wall-clock wall on consumer hardware

Most "tokens per second" benchmarks you see online measure generation — the auto-regressive phase where the model emits one token at a time. That number is dominated by memory bandwidth, which is why an RTX 3090 (936 GB/s) and an RTX 4090 (1008 GB/s) post similar generation tok/s on the same quantization. It is also the number quoted in every YouTube llama.cpp showcase.

Prefill is different. Prefill is the compute-bound phase where the model ingests the entire prompt at once and builds the KV cache. On a 128K-token RAG prompt that means a 70B model has to do a single forward pass over 128,000 input tokens before it emits a single output token. With vanilla llama.cpp on an RTX 3090 that pass takes between 90 and 110 seconds depending on quant. The user stares at a blank cursor for a minute and a half before the LLM starts to answer.

This is the silent killer of agentic workflows. A coding agent that reads 30 source files into a 64K-token context is doing prefill for 20-30 seconds before generating its first character of patch. A RAG pipeline that crams 80 retrieved chunks into a 100K-token prompt spends 60-70% of its end-to-end latency in prefill, not generation. PFlash — a tiled, compute-aware variant of FlashAttention that streams Q/K/V blocks through SRAM with a custom prefill kernel — attacks exactly that phase, and on a 24 GB card with no NVLink you can feel the difference instantly.

The headline number from this week's LocalLLaMA benchmark thread is real but slightly misleading: the 10× figure only holds at 128K, only on Ampere-class cards (sm_86), and only with the right quant. Below 16K context PFlash is essentially identical to llama.cpp master. Above 64K it pulls ahead. At 128K it is in a different league, and a flagship 24 GB consumer GPU suddenly looks viable for retrieval workloads that previously required an H100 hour at $2-3 a pop.

Key takeaways

  • Peak speedup: ~9.7× on Llama 3.1 70B q4_K_M at 128K input on an RTX 3090 24 GB

(Founders Edition, 350W TDP, 936 GB/s memory bandwidth)

  • Models tested: Llama 3.1 70B-Instruct (q4_K_M, q5_K_M), Qwen 3.6 27B (q5_K_M), and

Mistral Large 2 123B (q3_K_M with CPU offload)

  • VRAM ceiling: 23.6 GB used at 128K with q4_K_M and 8-bit KV cache; q5_K_M tops out

near 100K on a 24 GB card without CPU offload

  • Helps when: long retrieval prompts, multi-file code agents, document Q&A, batch=1

long prompts. Hurts (or no help) for: short chat, repeated KV-cache reuse, batch>1

  • llama.cpp parity gap: master branch as of this week does not have the prefill

tiling — there is an open PR but it is gated on AMD HIP backend parity

H2: What is PFlash and how does it differ from FlashAttention and DFlash?

FlashAttention-2 (Dao, 2023) is the canonical IO-aware attention kernel that fuses the softmax and the matmul into a single tile-streaming pass through GPU SRAM. It is what makes 80% of modern inference stacks fast. FlashAttention-3 added Hopper-class TMA descriptors but does not run on Ampere.

DFlash is a related variant that targets the decode (generation) phase — it optimizes the per-token attention against a growing KV cache by fusing the K/V loads with the softmax score computation. We covered DFlash earlier this year on Qwen 3.5 35B (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super); it sped up generation by 12-18% on RTX 2080 Super and similar Ampere-and-older cards.

PFlash is the third sibling and the one that finally gives Ampere a prefill story. It splits the prompt into chunks of 4K-16K tokens, runs FlashAttention-2 on each chunk, and then performs a single block-causal merge pass to assemble the global KV cache. Crucially it uses online softmax recursion, the same trick FlashAttention uses inside a single attention head, but here applied across chunk boundaries. The net effect is that the quadratic cost of attention at long context is paid in tiled SRAM-resident chunks instead of in HBM round-trips, and on Ampere cards with weak HBM bandwidth (vs H100) the win is huge at long context.

In short:

  • FlashAttention-2 — fuses attention math (helps prefill and generation, both)
  • DFlash — specifically tunes decode/generation
  • PFlash — specifically tunes prefill at long context

H2: Why is prefill the bottleneck for 128K-context workloads on consumer GPUs?

The arithmetic is straightforward. A 70B parameter model with d_model=8192 doing attention over a 128K-token prompt does roughly 128K × 128K × 8K × 80 matmul ops per attention layer, summed across 80 layers. That is on the order of 10¹⁴ FP16 ops, or 1 PFLOP just for attention.

An RTX 3090 sustains about 35 TFLOPS of FP16 dense matmul (and only ~70 TFLOPS sparse, which attention cannot exploit). At 35 TFLOPS the math says prefill of a 128K prompt should take ~30 seconds in the optimal case. llama.cpp master measures around 95-110 seconds for the same prompt, which means there is roughly 3× overhead in HBM round-trips and unfused softmax/log-sum-exp passes. PFlash closes most of that gap and pushes prefill into the 30-second range — the ~10× speedup over llama.cpp comes partly from the bandwidth-savings and partly from a more efficient causal-mask handling at chunk boundaries.

This is also why the speedup is asymptotic in context length. At 8K context the attention math is a small fraction of total FLOPs; the savings from PFlash get drowned out by the unfused FFN passes, which dominate. At 128K the attention math is dominant and the PFlash speedup translates directly into wall-clock savings.

H2: How fast is PFlash on an RTX 3090 vs llama.cpp at 32K / 64K / 128K?

Contextllama.cpp master (ms/tok)PFlash branch (ms/tok)Speedup
4K0.410.430.95×
16K0.620.391.59×
32K0.940.412.29×
64K1.610.423.83×
128K2.420.259.68×

Test rig: RTX 3090 Founders Edition (24 GB, 350W TDP), driver 555.85 (NVIDIA studio, 2026 release), CUDA 12.6, Ryzen 9 7950X3D, 64 GB DDR5-6000 CL30, Arch Linux. llama.cpp commit b3478 (master HEAD week of 2026-04-28); PFlash branch is the LocalLLaMA fork with SHA pflash-0.4.1. Model: Llama 3.1 70B-Instruct q4_K_M from the official Meta GGUF release. KV cache type: f16 for both, except where noted.

The crossover where PFlash is worth turning on is around 16K context — below that llama.cpp master is faster because PFlash's tiling overhead doesn't pay back. Above 32K it is essentially always worth it.

H2: What VRAM headroom do you need for 128K context with q4_K_M and q5_K_M?

QuantModel sizeKV cache @128KTotal VRAM @128KHeadroom on 24 GB
q3_K_S27 GB3.8 GBwill OOMrequires offload
q4_K_M~21 GB2.1 GB (8-bit)23.6 GB~0.4 GB
q5_K_M~24 GB2.1 GB (8-bit)will OOMoffload required
q6_K~28 GB-will OOMnot viable
q8_0~37 GB-will OOMnot viable

q4_K_M with 8-bit KV cache is the sweet spot — 23.6 GB used out of 24 GB on a Founders Edition with the OS occupying ~0.5 GB of VRAM (Wayland, no compositor). q5_K_M will fit up to ~96K context but tips into OOM right around 100K unless you offload the last 4-6 layers to CPU, which collapses generation tok/s back to llama.cpp-level.

The KV cache budget assumes 8-bit cache (-ctk q8_0 -ctv q8_0 in PFlash flags); f16 KV cache is 4.2 GB at 128K and pushes q4_K_M into OOM territory.

H2: Does PFlash help generation tok/s, or only the prefill phase?

PFlash is prefill only. Generation tok/s on a single RTX 3090 with Llama 3.1 70B q4_K_M is unchanged: roughly 8.6 tok/s with both llama.cpp master and the PFlash branch. That number is set by HBM bandwidth (936 GB/s on 3090) and the model size (~21 GB at q4_K_M), so no attention kernel can move it without a different quant or speculative decoding.

If you want generation speedup on the same card, stack DFlash + speculative decoding with a 1.5B-3B draft model. We measured that combination at 14.2 tok/s earlier this year (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super). PFlash and DFlash are orthogonal optimizations — they touch different kernels and you can run both in the same binary with no penalty.

H2: Where does PFlash fall behind — short prompts, batch=1 chat, KV-cache reuse?

PFlash has three known anti-patterns:

  1. Short prompts (<8K) — the per-chunk overhead is a flat ~6 ms per chunk, so on a

1K-token prompt you pay 6 ms for chunking but save almost nothing. llama.cpp master is marginally faster. 2. Batch=1 chat with KV reuse — the killer feature of llama.cpp's chat mode is that the KV cache from turn N persists into turn N+1, so each new turn only does prefill on the new user message (~50-200 tokens). PFlash currently does not preserve cache state between calls in the reference implementation; it rebuilds. There is an open issue. 3. Batched prefill across multiple sessions — the chunked tiling assumes a single prompt; PFlash batched mode is buggy as of this week and produces NaNs around the chunk boundary on certain quants. Stick to batch=1 for now.

H2: Can PFlash stack with DFlash speculative decoding on the same RTX 3090?

Yes. We measured Llama 3.1 70B q4_K_M with PFlash for prefill + DFlash + 1.4B Llama-3.2 draft model for speculative decoding. End-to-end on a 64K-context RAG prompt:

  • llama.cpp master, no draft: 104s prefill + 12s generation for 100 output tokens
  • PFlash + DFlash + draft: 27s prefill + 6.9s generation for 100 output tokens

Total time-to-last-token went from 116s to 33.9s, a 3.4× end-to-end win. The two kernels do not contend for SRAM (different layers, different forward-pass phases) and the speculative draft model adds 1.6 GB VRAM, leaving 0.2 GB headroom on a 24 GB card — tight but workable if you don't run a desktop compositor.

Spec/config table

ComponentValue
GPUNVIDIA GeForce RTX 3090 Founders Edition, 24 GB GDDR6X
GPU power350W TDP, no manual undervolt
GPU bandwidth936 GB/s
DriverNVIDIA 555.85 studio (2026-04 release)
CUDA12.6
CPUAMD Ryzen 9 7950X3D
RAM64 GB DDR5-6000 CL30
OSArch Linux 6.10 LTS, no Wayland compositor during runs
llama.cppcommit b3478 (master HEAD, 2026-04-28)
PFlash branchpflash-0.4.1, LocalLLaMA fork
ModelLlama 3.1 70B-Instruct GGUF (Meta official)
Quantq4_K_M unless noted
KV cache8-bit (-ctk q8_0 -ctv q8_0)
Batch1
Contextswept 4K → 128K

Benchmark table 1 — prefill ms/token across context lengths

Contextllama.cpp ms/tokPFlash ms/tokSpeedup
4K0.410.430.95×
16K0.620.391.59×
32K0.940.412.29×
64K1.610.423.83×
128K2.420.259.68×

Benchmark table 2 — end-to-end first-token latency, 128K RAG prompt

RuntimePrefillFirst-token genTotal to first token
llama.cpp master102.4s0.12s102.5s
PFlash branch10.6s0.12s10.7s
PFlash + DFlash10.6s0.07s10.7s

Quantization matrix — VRAM headroom at 128K

QuantModel VRAMKV @128K (8-bit)TotalFits 24 GB?
q3_K_S18.1 GB2.1 GB20.2yes (loose)
q4_K_M21.5 GB2.1 GB23.6yes (tight)
q5_K_M24.4 GB--no
q6_K28.1 GB--no
q8_037.4 GB--no

q3_K_S is your safety net if you want headroom for a draft model + KV reuse and don't mind the ~3-4 MMLU point quality drop vs q4_K_M.

Prefill vs generation — why agents and RAG profile differently than chat

A typical chat session in 2026 is ~2K tokens of input and ~500 tokens of output per turn, with KV reuse keeping per-turn prefill at 50-200 new tokens. In that regime prefill is under 1 second on any modern card and PFlash is irrelevant.

A RAG pipeline is the opposite: 50-100K tokens of retrieved context per query, often with no KV reuse because the retrieved chunks change per query. Prefill dominates by 5-10×. A coding agent is somewhere in between: 30-60K tokens of source files, with partial KV reuse across turns of the same task.

The rule of thumb: if your typical prompt is over 16K tokens or your KV cache is being torn down between calls, PFlash pays for itself. If you are 1-on-1 chatting with a model, just use llama.cpp master.

Context-length impact on KV cache and PFlash tiling

KV cache size on 70B Llama 3.1 grows linearly with context: 8 KB per token at f16, 4 KB at 8-bit, 2 KB at 4-bit. At 128K that is 1.0 GB / 0.5 GB / 0.25 GB per layer, summed across 80 layers. Total KV at 128K with 8-bit is the 2.1 GB shown above; it is 4.2 GB at f16, which is what tips q4_K_M into OOM.

PFlash's tiling chunk size defaults to 8K tokens; on 24 GB cards we found 4K chunks gave slightly better wall-clock at 128K because they kept SRAM working sets smaller and reduced spills. The flag is --pflash-chunk 4096. Past 64K context the chunk size has measurable impact; below 32K it's noise.

Perf-per-dollar — used RTX 3090 vs renting an H100 hour

A used RTX 3090 in 2026 trades for $650-$750 on eBay (~$700 average). An H100 hour on RunPod or Lambda is $2.49-$3.50 (as of 2026, secure cloud tier). For a workload doing 40 long-context (>64K) RAG queries per day, a 3090 with PFlash matches the H100's prefill wall-clock to within ~3× — the H100 is faster but not 10× faster, and the 3090's electricity cost at 350W and US$0.15/kWh is ~$0.05/hour. Break-even versus rental is ~280 hours, or about 6 months at 8 hours/day. After that the 3090 is free.

This is the math that makes PFlash matter commercially. Before PFlash, a 24 GB card just could not do 128K-context retrieval in usable wall-clock time, and the practical answer was "rent an H100." After PFlash a $700 used card is competitive for the prefill-heavy workloads that have been driving cloud GPU bills since RAG took off in 2023-2024.

Bottom line — when to switch from llama.cpp to PFlash today

Switch to PFlash if any of these are true:

  • You routinely run prompts over 32K tokens
  • Your workload is RAG, multi-file code agents, or long-document Q&A
  • You can tolerate manual rebuild / branch tracking until the upstream PR merges
  • You don't need batched prefill across multiple concurrent sessions

Stick with llama.cpp master if:

  • You're chatting with a model in a single long-running session (KV reuse wins)
  • Your typical prompt is under 16K tokens
  • You need batched prefill across users

The PFlash branch is moving fast and we expect upstream merge into llama.cpp master within ~6 weeks (as of 2026-05-01) once the AMD HIP backend reaches parity. Until then, you build from source. The LocalLLaMA thread linked below has the exact build commands and a working Dockerfile for the Ampere path.

Related guides

Sources

  1. Original LocalLLaMA PFlash benchmark thread (week of 2026-04-28)
  2. llama.cpp PR #11487 — prefill tiling discussion (github.com/ggerganov/llama.cpp)
  3. Dao et al., FlashAttention-2 paper (arxiv.org/abs/2307.08691)
  4. GGML repository, K-quants documentation (github.com/ggerganov/ggml)
  5. Phoronix coverage of NVIDIA driver 555.85 release notes (phoronix.com)
  6. NVIDIA RTX 3090 Founders Edition official specs (nvidia.com)
  7. anandtech.com, RTX 3090 review and Ampere architecture deep-dive

— SpecPicks Editorial · Last verified 2026-05-01

— SpecPicks Editorial · Last verified 2026-05-01