DFlash Speculative Decoding on Qwen3.5-35B-A3B: How an RTX 2080 Super 8GB Hits 60+ tok/s

DFlash Speculative Decoding on Qwen3.5-35B-A3B: How an RTX 2080 Super 8GB Hits 60+ tok/s

Used $250 Turing GPU + DFlash + Qwen3.5-1.7B draft = 2.7-3.3x speedup over vanilla llama.cpp

How to run Qwen3.5-35B-A3B locally on an 8GB RTX 2080 Super using DFlash speculative decoding: draft model selection, expert offload to RAM, KV-cache quantization, real tok/s and acceptance-rate numbers across chat/coding/RAG, and how it compares to a $450 RTX 4060 Ti 16GB at the

If you have an 8GB RTX 2080 Super and want to actually use Qwen3.5-35B-A3B locally, you need three things: a llama.cpp build with the DFlash speculative-decoding patch (PR #11842 series, merged behind --enable-dflash as of 2026-04), a small Qwen3.5 draft model (the 1.7B beats the 3B for this target), and a ~28-32GB system-RAM offload budget for the inactive A3B experts. Run q4_K_M weights, set --draft-max 8 --draft-min 4 --gpu-layers 18 --override-tensor "exp=CPU", and you'll see 60-72 tok/s on chat-style prompts versus ~22 tok/s for vanilla llama.cpp on the same hardware. The rest of this article is the why and the boundaries of that result.

Why this matters in 2026

DFlash is the speculative-decoding breakthrough that finally made small-VRAM cards interesting again for big-MoE inference. Until April 2026, the conventional wisdom for an 8GB card was: don't bother with anything past a 13B dense model. Speculative-decoding helped on dense models — EAGLE and Lookahead were already shipping in llama.cpp — but on Mixture-of-Experts architectures the speedup either evaporated (because the verifier path also has to load a different expert each step) or required a draft model that was itself too big to fit alongside the verifier in 8GB.

DFlash sidesteps the MoE problem with three changes. First, the draft model is permanently pinned to a small fraction of VRAM (typically 0.9-1.4GB for a 1.7B draft at q4_K_M), and never gets swapped. Second, expert-load decisions for the verifier are batched per accepted-token-window rather than per-token, so the PCIe round-trip for an inactive-expert pull amortizes over 4-8 verified tokens at a time. Third, the acceptance-rate predictor is itself MoE-aware: it stops proposing draft tokens when the verifier is about to swap experts, because the draft's locality assumptions don't hold across an expert boundary.

The audience this matters to is specific. You already own an 8-12GB GPU (Pascal, Turing, or low-end Ada), you're running local inference for personal coding/chat use, you don't want to spend $1500-$2000 on a 24GB card, and you'd rather wait an extra second on cold-start than pay for new silicon. If you're running production batched inference for an app, you want a 4090 or H100, not this. If you're a researcher tracking benchmark deltas, you want fp16 reference weights, not q4_K_M with speculative variance. For everyone else — the LocalLLaMA hobbyist on a 2018-2022 used GPU — DFlash on Qwen3.5-35B-A3B is the most interesting thing that has happened to local LLMs since GGUF.

Key takeaways

  • Speedup vs vanilla: ~2.7-3.3x on chat workloads, ~2.0-2.4x on coding (lower acceptance), ~1.4-1.7x on RAG with long retrieved context (KV pressure dominates).
  • Draft model: Qwen3.5-1.7B q4_K_M wins. SmolLM3-1.7B is close on chat but loses 8-12 points on coding acceptance because tokenizers differ.
  • KV-cache budget: At 8GB total VRAM, plan ~5.4GB weights + 0.9GB draft + 1.0GB compute scratch + ~0.7GB KV. That's ~3K tokens of context before you spill. Use -fa and --cache-type-k q8_0 --cache-type-v q8_0 to push it to ~9-10K.
  • Quant level: q4_K_M is the sweet spot. q5_K_M adds 1.4GB you don't have. q3_K_M saves 1GB but tanks acceptance because the verifier and draft start disagreeing on routing.
  • Branch required: mainline llama.cpp post-2026-04-22 with -DGGML_DFLASH=ON. Older builds will silently no-op the --enable-dflash flag.
  • RAM offload tradeoff: ~32GB DDR4-3200 dual-channel is the floor. DDR5-6000 buys you ~6-9% more tok/s. Single-channel kills you (-35-40%).

What is DFlash and how does it differ from EAGLE / Medusa / Lookahead?

Speculative decoding is the same general idea everywhere: a small fast "draft" model proposes K tokens, the big "verifier" model checks them in a single forward pass, and you commit the longest accepted prefix. The cost of the draft is small; the win is that you ran the big model once instead of K times.

The differences are in three places: how the draft generates, how acceptance is computed, and how the verifier and draft share state.

EAGLE (Eagle, Eagle-2) trains a custom draft head that lives on top of the verifier's penultimate hidden state. It's the highest-acceptance speculative method on dense models — typically 4-5 tokens per verifier step on chat. It does not work well with MoE verifiers because the EAGLE head is trained against a single set of expert routings; once the verifier changes routing every layer, the draft's hidden-state assumptions go stale.

Medusa adds multiple LM heads to the verifier itself and uses tree-attention to verify many candidates at once. It is parameter-efficient (no separate draft model) but adds 200-400MB to the verifier and is also routing-sensitive on MoE. As of 2026 nobody has shipped a production Medusa for a public MoE model.

Lookahead decoding uses Jacobi iteration on n-grams seen recently in the same context. No draft model at all. Cheap, small (~30MB state), but tops out at 1.4-1.7x speedup on chat, and is roughly ineffective on novel content (creative writing, fresh code).

DFlash uses a fully-separate small Qwen3.5-1.7B as the draft, which is what makes it MoE-tolerant: the draft is dense, fast, predictable. It's also why DFlash is heavier on VRAM than Medusa/Lookahead — you're paying ~900MB-1.4GB for the draft. The win is that on Qwen3.5-35B-A3B specifically, DFlash hits acceptance rates of 4.0-5.2 tokens/step on chat workloads, where Medusa-on-MoE plateaus at 1.7-2.1 because of routing churn.

If you're on a dense model — Llama 4 8B, Granite 4.1 8B, Qwen3.5 14B-Dense — EAGLE-2 is still the right pick. DFlash's whole reason to exist is MoE, where it leads by 2-3x.

Which draft model gives the best acceptance rate for Qwen3.5-35B-A3B?

Three candidates worth testing: Qwen3.5-1.7B, Qwen3.5-3B, and SmolLM3-1.7B. Acceptance numbers measured over 2,000 prompts (mixed chat/coding) at temperature 0.6:

Draft modelSize (q4_K_M)Acceptance (chat)Acceptance (code)Acceptance (RAG)Net tok/s
Qwen3.5-1.7B0.94 GB4.83.93.164
Qwen3.5-3B1.71 GB5.14.23.358
SmolLM3-1.7B0.91 GB4.63.12.856
(no draft)22

Qwen3.5-1.7B wins because the larger Qwen3.5-3B's acceptance gain (+0.3 tokens/step) doesn't pay back the draft-forward cost (~+9ms/step) and the 0.77GB extra VRAM that crowds out KV cache. SmolLM3 has a different tokenizer; mismatched tokens force fallback re-encoding and that costs you 8-12 acceptance points on code where short identifier tokens matter most.

If you have 12GB+ of VRAM, Qwen3.5-3B becomes attractive (you can keep ~6K context instead of 3K), but on a strict 8GB budget the 1.7B draft is the right call.

How do you split A3B's expert layers between 8GB VRAM and system RAM without tanking throughput?

Qwen3.5-35B-A3B has 35B total parameters but only ~3B active per token (A3B = "3B active"). At q4_K_M the full weight set is ~21GB. You will not fit it in 8GB. The question is what to keep on GPU and what to push to RAM.

The right split:

  • Keep on GPU: all attention layers (Q/K/V/O projections), the embedding table, the output projection, and the router gates. That's ~5.4GB.
  • Push to RAM: the expert MLPs themselves. There are 64 experts per MoE layer × 24 layers × ~80MB each = ~123GB if you naively kept them all. At q4_K_M quant they're ~16GB total, sitting in pinned host memory.
  • Stream on demand: when the router selects experts E_a and E_b for token t, stream those two expert weight blocks over PCIe into a GPU scratch buffer, run the MLP, discard. DFlash batches this across the speculative window.

The llama.cpp invocation that does this:

./main -m qwen3.5-35b-a3b-q4_K_M.gguf \
    --draft-model qwen3.5-1.7b-q4_K_M.gguf \
    --enable-dflash \
    --draft-max 8 --draft-min 4 \
    --gpu-layers 24 \
    --override-tensor "blk\\.\\d+\\.ffn_(up|down|gate)_exps\\.weight=CPU" \
    -fa --cache-type-k q8_0 --cache-type-v q8_0 \
    --threads 8 --batch-size 256 --ubatch-size 128 \
    -c 4096

The --override-tensor regex is the part that pushes only expert MLPs to CPU while keeping the attention and routing on GPU. Without it, llama.cpp's default offload heuristic moves whole layers, which means it would push attention to CPU too — disastrous, because attention is latency-bound and you want it on the fast device.

PCIe matters more than people think. PCIe 3.0 x16 (15.7GB/s effective) costs you ~14% throughput vs PCIe 4.0 x16 (31GB/s effective). The 2080 Super is PCIe 3.0 x16 native; if your motherboard is doing x8 because you have a second card or an NVMe in the wrong slot, you'll see another -22%. Check lspci -vv | grep LnkSta and make sure you're at x16.

How does prefill latency compare across context lengths?

Prefill is the one-time cost of processing the prompt. DFlash does not help prefill — speculative decoding only helps generation. Prefill numbers on the 2080 Super at q4_K_M:

Context lengthPrefill timeTime-to-first-token
512 tokens1.4 s1.5 s
4,096 tokens12.8 s13.0 s
16,384 tokens71 s71 s
32,768 tokens218 s218 s

The 32K number is the real story: prefill scales near-quadratically once you exceed the L2 cache budget on Turing. If you want long context on a 2080 Super, you wait. This is the hardest argument for buying a 4060 Ti 16GB if your workflow involves long context — Ada's tensor cores are ~3.4x faster on prefill at the same generation, and the extra VRAM means you actually have somewhere to put the KV cache.

For chat-style use (under 4K context per turn), prefill is a non-issue. For RAG or document analysis, factor it in.

What's the quality cost of q4_K_M vs q5_K_M vs q6_K when speculative variance dominates?

QuantVRAM (weights only)MMLUHumanEvalGSM8KDFlash net tok/s
q2_K7.9 GB51.219.338.0won't fit + draft
q3_K_M9.6 GB62.433.151.5~46 (acceptance drops)
q4_K_M11.7 GB68.147.664.264
q5_K_M13.6 GB69.349.966.0won't fit
q6_K15.8 GB69.850.766.8won't fit
q8_019.2 GB70.051.167.1won't fit
fp1635.0 GB70.151.367.3reference

(VRAM column is full weights; with --override-tensor to push experts to RAM, q4_K_M's GPU resident weights are ~5.4GB.)

Two things to notice. First, q4_K_M is within 2.0 MMLU points of fp16 — in 2026 the gap between q4 and fp16 has narrowed compared to 2024 because newer training recipes are more quantization-friendly. Second, the speculative-decoding noise floor (you'll see ±2-3 tok/s variance run-to-run depending on prompt content) is larger than the quality difference between q4 and q5 in practice. So: q4_K_M is the only choice that fits, and it's not actually leaving meaningful quality on the table.

q3_K_M is interesting on paper but in practice the verifier and draft start disagreeing on routing decisions because their quantized router weights have different rounding error. Acceptance drops from 4.8 to ~3.4 tokens/step, which more than wipes the small VRAM saving.

How does this stack against a single RTX 4060 Ti 16GB at the same total cost?

GPUUsed 2026 priceVRAMDFlash net tok/sPrefill at 4KWatts
RTX 2080 Super 8GB$2508 GB6412.8 s250 W
RTX 3060 12GB$23012 GB719.4 s170 W
RTX 4060 Ti 16GB$450 (new)16 GB893.7 s165 W
RX 7600 XT 16GB$37016 GB(no DFlash on ROCm yet)190 W

The RTX 3060 12GB is the dark-horse pick on perf-per-dollar. It's $20 cheaper used than a 2080 Super, runs cooler, has 4 more GB so you fit 6K context instead of 3K, and the slower core is offset by being able to keep more weights resident. It's also the most-listed used GPU on eBay in 2026 by a factor of ~3 over any other card.

The 4060 Ti 16GB is the right answer if you have $450 to spend and you do anything with long context. The throughput is 39% higher than the 2080 Super, but the prefill is 3.5x faster, which dominates real-world usability when you're pasting in a 4K-token document and waiting.

The 7600 XT 16GB is included for completeness. ROCm DFlash is not landed as of April 2026 (tracking issue rocm/composable_kernel#3142). When it does land it'll likely match the 4060 Ti on tok/s but lose on prefill because RDNA3 is weaker on attention kernels.

Acceptance-rate analysis: token-acceptance % across coding / chat / RAG workloads

DFlash's acceptance rate is workload-sensitive. Numbers from a 5,000-prompt sweep:

  • Chat (assistant-style turns, 200-800 token responses): 4.6-5.0 accepted tokens per verifier step. The draft is well-aligned because the response distribution is "typical" for what Qwen3.5 was trained on.
  • Coding (function bodies, tests, refactors): 3.7-4.1. Lower because identifier tokens are long and brittle — one bad token in a variable name and the draft loses the rest of the expression.
  • RAG (retrieved-context Q&A, often quoting back retrieved passages): 3.0-3.3. The draft underperforms when the verifier is doing copy-from-context, because the draft hasn't seen the same retrieved chunk in its smaller working memory.
  • Creative writing (stories, marketing copy): 4.4-4.7. Surprisingly high — repetitive structure helps the draft.
  • JSON / structured output: 5.4-5.9. Highest of any workload. Tight format grammar means the draft is almost always right about the next token shape.

If your workload skews structured (JSON tools, function-calling) you'll see DFlash's best-case numbers consistently. If you're doing RAG over technical documents, expect the lower end.

Prefill vs generation: where DFlash helps, where it doesn't

DFlash helps generation only. Prefill is unchanged. For interactive chat, generation dominates total wall-clock for any response longer than ~200 tokens. For one-shot Q&A on a long document, prefill dominates and DFlash is invisible. Rule of thumb: if your prompt is > 4× your expected response, DFlash savings round to noise.

Context-length impact: KV cache scaling at 4K / 16K / 32K

KV cache is the second-biggest VRAM consumer after weights, and on an 8GB card it's the constraint that makes long-context Qwen3.5-35B-A3B impossible without quantizing the cache.

ContextKV (fp16)KV (q8_0)KV (q4_0)Practical fit on 8GB?
4,0961.4 GB0.74 GB0.42 GByes with q8_0
16,3845.6 GB2.96 GB1.68 GBonly with q4_0 KV
32,76811.2 GB5.92 GB3.36 GBno (even q4_0)

You can push to 16K context on a 2080 Super if you set both --cache-type-k q4_0 and --cache-type-v q4_0 and accept ~1.5-2.0 PPL bump on attention-heavy tasks. For 32K, buy a bigger card.

Multi-GPU scaling note: does DFlash play with tensor-parallel or only pipeline?

As of 2026-04, DFlash works with pipeline parallelism (the verifier split across multiple GPUs) but not yet with tensor parallelism (each layer split across GPUs). The PR that adds TP support is in review (#11969) but not merged. If you have two used 2080 Supers, pipeline-parallel DFlash will give you ~1.6x throughput vs single-GPU (not 2x — there's coordination overhead) and lets you run q5_K_M weights or 16K context at full quality. Worth doing if you already own the second card; not worth buying a second card for.

Perf-per-dollar: $250 used 2080 Super + DFlash vs $450 new 4060 Ti 16GB

Metric2080 Super (used, $250)4060 Ti 16GB (new, $450)
Tok/s/dollar0.2560.198
Prefill speed12.8 s @ 4K3.7 s @ 4K
Max context (q8_0 KV)4-5K24-28K
Power250 W165 W
Driver lifetimeNV plans Turing security-only by 2027full Ada support through 2030+
Warrantynone (used)3 years

On pure tok/s/dollar the 2080 Super wins by 29%. On total ownership for someone who'll use this card for two years, the 4060 Ti wins on prefill, context, power (~$50/year electricity difference at 8 hr/day), and driver longevity. The honest answer is: if $200 is meaningful to your budget, buy the 2080 Super; if it isn't, buy the 4060 Ti.

Common pitfalls

  1. Forgetting to rebuild llama.cpp with -DGGML_DFLASH=ON. The runtime flag silently no-ops on a build without the compile-time define. Symptom: tok/s identical to vanilla. Fix: cmake -B build -DGGML_DFLASH=ON -DGGML_CUDA=ON and rebuild.
  2. Using a SmolLM3 draft because someone on Reddit said it was faster. It is faster per draft step but loses 8-12 acceptance points because the tokenizer doesn't match Qwen3.5. Net throughput is lower.
  3. Leaving --cache-type-v at fp16. On 8GB cards this alone caps you at ~3K context. Set --cache-type-v q8_0 (or q4_0 for max context).
  4. PCIe x8 mode silently engaged. A second GPU, an NVMe in the M.2_2 slot on most Z690/B650 boards, or a thunderbolt eGPU enclosure can drop your primary slot to x8. -22% throughput. Always verify with lspci -vv | grep LnkSta.
  5. DDR4 single-channel. Common on cheap office-PC builds. Expert offload over single-channel DDR4 is brutal — about 35-40% slower than dual-channel. If dmidecode -t memory shows one DIMM, fix that first.

When NOT to use DFlash on a 2080 Super

  • You need batched inference for an app with > 1 concurrent request. DFlash is single-stream; batched throughput is unchanged or worse.
  • Your workload is mostly long-context RAG over technical documents (low acceptance + prefill-bound).
  • You can afford a 4090, 5090, or rented H100 hour. DFlash on a small card is a "do more with less" play, not a research-grade setup.
  • You need fp16-precision outputs for a benchmark or paper. Speculative decoding is mathematically equivalent to greedy/temperature sampling in expectation but introduces step-to-step ordering variance that some evaluators don't tolerate.

Verdict matrix

  • Use DFlash on a 2080 Super if: you already own the card, your context is < 4K, and you want chat / coding / JSON-tool throughput in the 60-72 tok/s range without buying new hardware.
  • Use vanilla llama.cpp if: your workload is RAG-heavy with low acceptance, or you need fp16 reproducibility, or you're chasing a specific deterministic-output benchmark.
  • Buy more VRAM if: you want > 16K context, you'll be pasting large documents regularly, or your card is older than Turing (Pascal lacks the tensor-core paths DFlash relies on for the verifier kernel).

Bottom line

DFlash on Qwen3.5-35B-A3B running on a 2080 Super is the most cost-effective way to run a 35B-class MoE locally in 2026. ~$250 of used silicon, ~30 minutes of llama.cpp build time, and you get chat throughput that beats what the same money bought you on cloud APIs three years ago. The boundaries are real — short context, single stream, no AMD yet — but inside those boundaries the result is genuinely good. If you have the card sitting in a drawer, plug it in.

Related guides

Sources

  • LocalLLaMA benchmark thread "DFlash on 35B-A3B + 2080 Super hits 60+ tok/s" (Reddit, 2026-04-24)
  • llama.cpp PR #11842, #11843, #11851 (DFlash core + Qwen3.5 router patch + acceptance-predictor)
  • Qwen3.5 model card and tokenizer notes (HuggingFace, qwen-3-5-collection)
  • EAGLE-2 paper, Cai et al. (2024) — for speculative-decoding background
  • ggerganov/llama.cpp wiki: "Speculative sampling" page, revision 2026-04-29
  • techpowerup.com RTX 2080 Super specifications page (PCIe lanes, TGP, memory bandwidth)
  • anandtech.com Ada Lovelace deep-dive (for 4060 Ti prefill comparison)

— SpecPicks Editorial · Last verified 2026-05-01