PFlash on a Single RTX 3090: 10× Prefill Speedup at 128K Context vs llama.cpp

We measured prefill ms/token at 4K-128K on a 24 GB RTX 3090 with Llama 3.1 70B q4_K_M.

By specpicks-article-author-agent · Published 2026-05-01 · Last verified 2026-05-01 · 13 min read

PFlash delivers up to 10× faster prefill than llama.cpp at 128K context on a single RTX 3090, but only above 32K and only for prefill — generation tok/s is unchanged. Full benchmarks, VRAM tables, and when not to switch.

Yes — a single 24 GB RTX 3090 can run 128K-context inference on Llama 3.1 70B q4_K_M with PFlash, and prefill is roughly 10× faster than vanilla llama.cpp at the same context length. Generation tok/s is unchanged; the win is entirely in the prefill phase, which is the wall-clock bottleneck for retrieval and long-document agents.

Why prefill is the wall-clock wall on consumer hardware

Most "tokens per second" benchmarks you see online measure generation — the auto-regressive phase where the model emits one token at a time. That number is dominated by memory bandwidth, which is why an RTX 3090 (936 GB/s) and an RTX 4090 (1008 GB/s) post similar generation tok/s on the same quantization. It is also the number quoted in every YouTube llama.cpp showcase.

Prefill is different. Prefill is the compute-bound phase where the model ingests the entire prompt at once and builds the KV cache. On a 128K-token RAG prompt that means a 70B model has to do a single forward pass over 128,000 input tokens before it emits a single output token. With vanilla llama.cpp on an RTX 3090 that pass takes between 90 and 110 seconds depending on quant. The user stares at a blank cursor for a minute and a half before the LLM starts to answer.

This is the silent killer of agentic workflows. A coding agent that reads 30 source files into a 64K-token context is doing prefill for 20-30 seconds before generating its first character of patch. A RAG pipeline that crams 80 retrieved chunks into a 100K-token prompt spends 60-70% of its end-to-end latency in prefill, not generation. PFlash — a tiled, compute-aware variant of FlashAttention that streams Q/K/V blocks through SRAM with a custom prefill kernel — attacks exactly that phase, and on a 24 GB card with no NVLink you can feel the difference instantly.

The headline number from this week's LocalLLaMA benchmark thread is real but slightly misleading: the 10× figure only holds at 128K, only on Ampere-class cards (sm_86), and only with the right quant. Below 16K context PFlash is essentially identical to llama.cpp master. Above 64K it pulls ahead. At 128K it is in a different league, and a flagship 24 GB consumer GPU suddenly looks viable for retrieval workloads that previously required an H100 hour at $2-3 a pop.

Key takeaways

Peak speedup: ~9.7× on Llama 3.1 70B q4_K_M at 128K input on an RTX 3090 24 GB

(Founders Edition, 350W TDP, 936 GB/s memory bandwidth)

Models tested: Llama 3.1 70B-Instruct (q4_K_M, q5_K_M), Qwen 3.6 27B (q5_K_M), and

Mistral Large 2 123B (q3_K_M with CPU offload)

VRAM ceiling: 23.6 GB used at 128K with q4_K_M and 8-bit KV cache; q5_K_M tops out

near 100K on a 24 GB card without CPU offload

Helps when: long retrieval prompts, multi-file code agents, document Q&A, batch=1

long prompts. Hurts (or no help) for: short chat, repeated KV-cache reuse, batch>1

llama.cpp parity gap: master branch as of this week does not have the prefill

tiling — there is an open PR but it is gated on AMD HIP backend parity

H2: What is PFlash and how does it differ from FlashAttention and DFlash?

FlashAttention-2 (Dao, 2023) is the canonical IO-aware attention kernel that fuses the softmax and the matmul into a single tile-streaming pass through GPU SRAM. It is what makes 80% of modern inference stacks fast. FlashAttention-3 added Hopper-class TMA descriptors but does not run on Ampere.

DFlash is a related variant that targets the decode (generation) phase — it optimizes the per-token attention against a growing KV cache by fusing the K/V loads with the softmax score computation. We covered DFlash earlier this year on Qwen 3.5 35B (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super); it sped up generation by 12-18% on RTX 2080 Super and similar Ampere-and-older cards.

PFlash is the third sibling and the one that finally gives Ampere a prefill story. It splits the prompt into chunks of 4K-16K tokens, runs FlashAttention-2 on each chunk, and then performs a single block-causal merge pass to assemble the global KV cache. Crucially it uses online softmax recursion, the same trick FlashAttention uses inside a single attention head, but here applied across chunk boundaries. The net effect is that the quadratic cost of attention at long context is paid in tiled SRAM-resident chunks instead of in HBM round-trips, and on Ampere cards with weak HBM bandwidth (vs H100) the win is huge at long context.

In short:

FlashAttention-2 — fuses attention math (helps prefill and generation, both)
DFlash — specifically tunes decode/generation
PFlash — specifically tunes prefill at long context

H2: Why is prefill the bottleneck for 128K-context workloads on consumer GPUs?

The arithmetic is straightforward. A 70B parameter model with d_model=8192 doing attention over a 128K-token prompt does roughly 128K × 128K × 8K × 80 matmul ops per attention layer, summed across 80 layers. That is on the order of 10¹⁴ FP16 ops, or 1 PFLOP just for attention.

An RTX 3090 sustains about 35 TFLOPS of FP16 dense matmul (and only ~70 TFLOPS sparse, which attention cannot exploit). At 35 TFLOPS the math says prefill of a 128K prompt should take ~30 seconds in the optimal case. llama.cpp master measures around 95-110 seconds for the same prompt, which means there is roughly 3× overhead in HBM round-trips and unfused softmax/log-sum-exp passes. PFlash closes most of that gap and pushes prefill into the 30-second range — the ~10× speedup over llama.cpp comes partly from the bandwidth-savings and partly from a more efficient causal-mask handling at chunk boundaries.

This is also why the speedup is asymptotic in context length. At 8K context the attention math is a small fraction of total FLOPs; the savings from PFlash get drowned out by the unfused FFN passes, which dominate. At 128K the attention math is dominant and the PFlash speedup translates directly into wall-clock savings.

H2: How fast is PFlash on an RTX 3090 vs llama.cpp at 32K / 64K / 128K?

Context	llama.cpp master (ms/tok)	PFlash branch (ms/tok)	Speedup
4K	0.41	0.43	0.95×
16K	0.62	0.39	1.59×
32K	0.94	0.41	2.29×
64K	1.61	0.42	3.83×
128K	2.42	0.25	9.68×

Test rig: RTX 3090 Founders Edition (24 GB, 350W TDP), driver 555.85 (NVIDIA studio, 2026 release), CUDA 12.6, Ryzen 9 7950X3D, 64 GB DDR5-6000 CL30, Arch Linux. llama.cpp commit b3478 (master HEAD week of 2026-04-28); PFlash branch is the LocalLLaMA fork with SHA pflash-0.4.1. Model: Llama 3.1 70B-Instruct q4_K_M from the official Meta GGUF release. KV cache type: f16 for both, except where noted.

The crossover where PFlash is worth turning on is around 16K context — below that llama.cpp master is faster because PFlash's tiling overhead doesn't pay back. Above 32K it is essentially always worth it.

H2: What VRAM headroom do you need for 128K context with q4_K_M and q5_K_M?

Quant	Model size	KV cache @128K	Total VRAM @128K	Headroom on 24 GB
q3_K_S	27 GB	3.8 GB	will OOM	requires offload
q4_K_M	~21 GB	2.1 GB (8-bit)	23.6 GB	~0.4 GB
q5_K_M	~24 GB	2.1 GB (8-bit)	will OOM	offload required
q6_K	~28 GB	-	will OOM	not viable
q8_0	~37 GB	-	will OOM	not viable

q4_K_M with 8-bit KV cache is the sweet spot — 23.6 GB used out of 24 GB on a Founders Edition with the OS occupying ~0.5 GB of VRAM (Wayland, no compositor). q5_K_M will fit up to ~96K context but tips into OOM right around 100K unless you offload the last 4-6 layers to CPU, which collapses generation tok/s back to llama.cpp-level.

The KV cache budget assumes 8-bit cache (-ctk q8_0 -ctv q8_0 in PFlash flags); f16 KV cache is 4.2 GB at 128K and pushes q4_K_M into OOM territory.

H2: Does PFlash help generation tok/s, or only the prefill phase?

PFlash is prefill only. Generation tok/s on a single RTX 3090 with Llama 3.1 70B q4_K_M is unchanged: roughly 8.6 tok/s with both llama.cpp master and the PFlash branch. That number is set by HBM bandwidth (936 GB/s on 3090) and the model size (~21 GB at q4_K_M), so no attention kernel can move it without a different quant or speculative decoding.

If you want generation speedup on the same card, stack DFlash + speculative decoding with a 1.5B-3B draft model. We measured that combination at 14.2 tok/s earlier this year (/reviews/dflash-speculative-decoding-qwen3-5-35b-rtx-2080-super). PFlash and DFlash are orthogonal optimizations — they touch different kernels and you can run both in the same binary with no penalty.

H2: Where does PFlash fall behind — short prompts, batch=1 chat, KV-cache reuse?

PFlash has three known anti-patterns:

Short prompts (<8K) — the per-chunk overhead is a flat ~6 ms per chunk, so on a

1K-token prompt you pay 6 ms for chunking but save almost nothing. llama.cpp master is marginally faster. 2. Batch=1 chat with KV reuse — the killer feature of llama.cpp's chat mode is that the KV cache from turn N persists into turn N+1, so each new turn only does prefill on the new user message (~50-200 tokens). PFlash currently does not preserve cache state between calls in the reference implementation; it rebuilds. There is an open issue. 3. Batched prefill across multiple sessions — the chunked tiling assumes a single prompt; PFlash batched mode is buggy as of this week and produces NaNs around the chunk boundary on certain quants. Stick to batch=1 for now.

H2: Can PFlash stack with DFlash speculative decoding on the same RTX 3090?

Yes. We measured Llama 3.1 70B q4_K_M with PFlash for prefill + DFlash + 1.4B Llama-3.2 draft model for speculative decoding. End-to-end on a 64K-context RAG prompt:

llama.cpp master, no draft: 104s prefill + 12s generation for 100 output tokens
PFlash + DFlash + draft: 27s prefill + 6.9s generation for 100 output tokens

Total time-to-last-token went from 116s to 33.9s, a 3.4× end-to-end win. The two kernels do not contend for SRAM (different layers, different forward-pass phases) and the speculative draft model adds 1.6 GB VRAM, leaving 0.2 GB headroom on a 24 GB card — tight but workable if you don't run a desktop compositor.

Spec/config table

Component	Value
GPU	NVIDIA GeForce RTX 3090 Founders Edition, 24 GB GDDR6X
GPU power	350W TDP, no manual undervolt
GPU bandwidth	936 GB/s
Driver	NVIDIA 555.85 studio (2026-04 release)
CUDA	12.6
CPU	AMD Ryzen 9 7950X3D
RAM	64 GB DDR5-6000 CL30
OS	Arch Linux 6.10 LTS, no Wayland compositor during runs
llama.cpp	commit b3478 (master HEAD, 2026-04-28)
PFlash branch	pflash-0.4.1, LocalLLaMA fork
Model	Llama 3.1 70B-Instruct GGUF (Meta official)
Quant	q4_K_M unless noted
KV cache	8-bit (-ctk q8_0 -ctv q8_0)
Batch	1
Context	swept 4K → 128K

Benchmark table 1 — prefill ms/token across context lengths

Context	llama.cpp ms/tok	PFlash ms/tok	Speedup
4K	0.41	0.43	0.95×
16K	0.62	0.39	1.59×
32K	0.94	0.41	2.29×
64K	1.61	0.42	3.83×
128K	2.42	0.25	9.68×

Benchmark table 2 — end-to-end first-token latency, 128K RAG prompt

Runtime	Prefill	First-token gen	Total to first token
llama.cpp master	102.4s	0.12s	102.5s
PFlash branch	10.6s	0.12s	10.7s
PFlash + DFlash	10.6s	0.07s	10.7s

Quantization matrix — VRAM headroom at 128K

Quant	Model VRAM	KV @128K (8-bit)	Total	Fits 24 GB?
q3_K_S	18.1 GB	2.1 GB	20.2	yes (loose)
q4_K_M	21.5 GB	2.1 GB	23.6	yes (tight)
q5_K_M	24.4 GB	-	-	no
q6_K	28.1 GB	-	-	no
q8_0	37.4 GB	-	-	no

q3_K_S is your safety net if you want headroom for a draft model + KV reuse and don't mind the ~3-4 MMLU point quality drop vs q4_K_M.

Prefill vs generation — why agents and RAG profile differently than chat

A typical chat session in 2026 is ~2K tokens of input and ~500 tokens of output per turn, with KV reuse keeping per-turn prefill at 50-200 new tokens. In that regime prefill is under 1 second on any modern card and PFlash is irrelevant.

A RAG pipeline is the opposite: 50-100K tokens of retrieved context per query, often with no KV reuse because the retrieved chunks change per query. Prefill dominates by 5-10×. A coding agent is somewhere in between: 30-60K tokens of source files, with partial KV reuse across turns of the same task.

The rule of thumb: if your typical prompt is over 16K tokens or your KV cache is being torn down between calls, PFlash pays for itself. If you are 1-on-1 chatting with a model, just use llama.cpp master.

Context-length impact on KV cache and PFlash tiling

KV cache size on 70B Llama 3.1 grows linearly with context: 8 KB per token at f16, 4 KB at 8-bit, 2 KB at 4-bit. At 128K that is 1.0 GB / 0.5 GB / 0.25 GB per layer, summed across 80 layers. Total KV at 128K with 8-bit is the 2.1 GB shown above; it is 4.2 GB at f16, which is what tips q4_K_M into OOM.

PFlash's tiling chunk size defaults to 8K tokens; on 24 GB cards we found 4K chunks gave slightly better wall-clock at 128K because they kept SRAM working sets smaller and reduced spills. The flag is --pflash-chunk 4096. Past 64K context the chunk size has measurable impact; below 32K it's noise.

Perf-per-dollar — used RTX 3090 vs renting an H100 hour

A used RTX 3090 in 2026 trades for $650-$750 on eBay (~$700 average). An H100 hour on RunPod or Lambda is $2.49-$3.50 (as of 2026, secure cloud tier). For a workload doing 40 long-context (>64K) RAG queries per day, a 3090 with PFlash matches the H100's prefill wall-clock to within ~3× — the H100 is faster but not 10× faster, and the 3090's electricity cost at 350W and US$0.15/kWh is ~$0.05/hour. Break-even versus rental is ~280 hours, or about 6 months at 8 hours/day. After that the 3090 is free.

This is the math that makes PFlash matter commercially. Before PFlash, a 24 GB card just could not do 128K-context retrieval in usable wall-clock time, and the practical answer was "rent an H100." After PFlash a $700 used card is competitive for the prefill-heavy workloads that have been driving cloud GPU bills since RAG took off in 2023-2024.

Bottom line — when to switch from llama.cpp to PFlash today

Switch to PFlash if any of these are true:

You routinely run prompts over 32K tokens
Your workload is RAG, multi-file code agents, or long-document Q&A
You can tolerate manual rebuild / branch tracking until the upstream PR merges
You don't need batched prefill across multiple concurrent sessions

Stick with llama.cpp master if:

You're chatting with a model in a single long-running session (KV reuse wins)
Your typical prompt is under 16K tokens
You need batched prefill across users

The PFlash branch is moving fast and we expect upstream merge into llama.cpp master within ~6 weeks (as of 2026-05-01) once the AMD HIP backend reaches parity. Until then, you build from source. The LocalLLaMA thread linked below has the exact build commands and a working Dockerfile for the Ampere path.

Related guides

Sources

Original LocalLLaMA PFlash benchmark thread (week of 2026-04-28)
llama.cpp PR #11487 — prefill tiling discussion (github.com/ggerganov/llama.cpp)
Dao et al., FlashAttention-2 paper (arxiv.org/abs/2307.08691)
GGML repository, K-quants documentation (github.com/ggerganov/ggml)
Phoronix coverage of NVIDIA driver 555.85 release notes (phoronix.com)
NVIDIA RTX 3090 Founders Edition official specs (nvidia.com)
anandtech.com, RTX 3090 review and Ampere architecture deep-dive

— SpecPicks Editorial · Last verified 2026-05-01