Qwen Plays DCSS: What Roguelike Runs Tell Us About Long-Context Agent Performance

Name: Qwen Plays DCSS: What Roguelike Runs Tell Us About Long-Context Agent Performance
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

DCSS is a vastly harder agentic benchmark than Pokemon Red — and the failure modes predict local-LLM coding-agent capability

By Mike Perry · Published 2026-05-25 · Last verified 2026-07-08 · 10 min read

Watching Qwen3.6-35B-A3B play Dungeon Crawl Stone Soup exposes context-rotation failures invisible in shorter benchmarks. Hardware-by-VRAM analysis included.

Watching Qwen3.6-35B-A3B play Dungeon Crawl Stone Soup (DCSS) for 40-hour runs reveals what static benchmarks hide: how well a local LLM holds a coherent strategy across a 30,000-token decision history. DCSS is a vastly harder agentic benchmark than the popular "Pokemon Red" runs because the state space is procedurally generated, mistakes are permanent, and every decision compounds. The takeaway: at 8K context Qwen plays competently to mid-Dungeon; at 32K it survives to the Lair branches; at 128K it can complete a run — but only on hardware with >20 GB of VRAM. The RTX 3060 12GB tops out at the Dungeon level.

Why roguelikes are a better agent benchmark than Pokemon Red

The "Claude/GPT Plays Pokemon" stream became a fascination in 2024-2025 because watching a model navigate Cerulean City was visceral, funny, and clearly an actual test of long-horizon agentic behavior. But Pokemon has structural properties that flatter LLMs and hide their real weaknesses. The map is fixed, the game's logic is deterministic, the narrative arc is pre-scripted, and there's no real punishment for backtracking — you can grind, you can heal, you can take 10 hours doing something a human would do in 30 minutes. A model that's bad at planning can still finish Pokemon by brute persistence.

DCSS — Dungeon Crawl Stone Soup, an open-source roguelike with active development since 2006 — flips every one of those properties. The dungeon layout, monster placement, item drops, and trap distribution are procedurally generated each run. Death is permanent: one bad decision against a Yaktaur Captain on Dungeon 14 ends a 12-hour session. The "right" move in any given combat depends on your character's class, level, current consumables, branch identity, and the specific monsters in view. There's no script to memorize, no fixed map to learn, no respawn to lean on.

That makes DCSS the closest publicly-available game to a real-world agentic task: ill-structured, partially observable, irreversible, and richly contextual. When an LLM survives to D:8 in DCSS, it has demonstrated something Pokemon never demands — sustained tactical planning under uncertainty across thousands of decisions.

This piece looks at what we've learned from the r/LocalLLaMA community's Qwen3.6-35B-A3B DCSS runs over the past three months, what the failure modes tell us about local-LLM agentic capability in 2026, and what hardware you actually need to reproduce the experiment.

Key Takeaways

DCSS exposes context-rotation failures invisible in shorter benchmarks: models forget that they're starving, that they've already explored a corridor, or that an item they discarded would solve their current problem.
Qwen3.6-35B-A3B at 32K context reliably reaches Dungeon level 5-7; at 128K context it completes runs that hit the Lair branches.
On a RTX 3060 12GB, Qwen at q4_K_M with 8K context is the realistic ceiling — runs end at D:3-5 from context-loss tactical errors.
Closed-model baselines (Claude Haiku 4.5, GPT-4o-mini) reach Lair on first attempt but cost $4-8 per completed run; local Qwen is free per run after hardware.
For local-agent coding loops (Aider, Cline) the DCSS pattern predicts performance: the model that survives 8K-context DCSS will also survive an 8K-context refactor.

Why DCSS is a harder agentic benchmark than Pokemon Red

DCSS's difficulty as a benchmark comes from four properties:

Procedural state: every game's layout, monster placement, and item drops are unique. The model cannot memorize a map or strategy from training data. Compare Pokemon Red, whose entire world is fixed — a model that's seen Pokemon walkthroughs in pretraining has a meaningful advantage.

Permadeath: a single tactical error ends a multi-hour run. The model must reason about risk continuously, not just at planning waypoints. Pokemon allows fainting; DCSS does not.

Sparse-reward exploration: the dungeon is 99% unexplored at any moment. The model must balance greedy exploitation (clearing visible enemies for XP) against exploration (descending to deeper, more rewarding levels). The optimal balance shifts with character class, current HP, branch identity, and the random god (DCSS's piety mechanic) the run is allied with.

Stateful context that grows monotonically: the model needs to remember every item identified, every monster type encountered, every shop visited, every god altar passed. By D:10, the relevant context easily exceeds 20K tokens. Pokemon's per-route context is bounded; DCSS's run-state grows.

These properties make DCSS a near-perfect proxy for the kind of long-horizon, tool-using, irreversible-action agent workloads that local LLMs are increasingly asked to do — code editing in a real repo, multi-step web automation, scientific computation pipelines.

Qwen3.6-35B-A3B run analysis: decision quality, context-window stress, failure modes

Across the community's documented runs (40+ logs posted to r/LocalLLaMA between March and May 2026), three failure modes account for roughly 80% of deaths:

1. Identification amnesia (35% of deaths). The model finds a scroll on D:3, reads it, learns it's "scroll of teleportation." Twelve thousand tokens later it finds another unidentified scroll, reads it without thinking, and gets teleported into a wall of orcs it isn't ready to fight. The earlier identification fell out of the relevant context window. Closed models (Claude Haiku 4.5, GPT-4o-mini) suffer this much less because their effective context-recall is sharper.

2. Hunger / status-effect blindness (28% of deaths). The "Hungry" or "Near Starving" status message scrolls in the combat log, and 30 turns later the model is making tactical decisions as though it has full stamina. The status condition is technically in context but isn't being treated as load-bearing. This is the same failure mode as a coding agent that "forgets" the constraint you stated 4 messages ago.

3. Corridor-loop wandering (17% of deaths). The model explores corridor A, retreats, explores corridor B, retreats, then re-explores corridor A. Eventually it dies from cumulative HP loss in nuisance encounters. This is a planning-horizon failure — the model treats each turn as independent rather than maintaining a coverage map.

The remaining 20% are catastrophic tactical errors: charging a unique monster without consumables, descending to a branch the character isn't ready for, mis-evaluating a trap pattern.

The interesting result is that Qwen3.6-35B-A3B's mid-Dungeon decision quality is genuinely competitive with the closed-model baselines on a per-decision basis. It's the context-management story — what gets remembered, what falls out — where the gap shows. And that's a story about hardware as much as it is about the model.

Hardware required: tok/s + VRAM on RTX 3060 12GB, RTX 4090, Strix Halo

Qwen3.6-35B-A3B at q4_K_M, 2K-prompt + 256-token response, fp16 KV cache:

Platform	Tok/s	Max usable context	DCSS practical ceiling
RTX 3060 12GB	14.1	8K (q8 KV: 16K)	D:5-7
RTX 4060 Ti 16GB	22.5	16K (q8 KV: 32K)	D:10-12
RTX 4090 24GB	56.0	32K (q8 KV: 64K)	Lair branches
Strix Halo (96 GB)	14.6	128K+	Full run completion
RTX 5090 32GB	81.0	64K (q8 KV: 128K)	Full run completion

The relationship between context window and DCSS depth is roughly linear: each 2× context increase buys you about 3 more dungeon levels before identification-amnesia kicks in. The Strix Halo path is interesting because its modest tok/s would normally be a deal-breaker — but for an agent loop where the wall-clock latency per turn is already 30-90 seconds, the unique-memory budget matters more than raw speed.

Practically: if you want to reproduce a DCSS run on a RTX 3060 12GB, expect runs to end at D:3-5 from identification or status-effect amnesia. Builds with Ryzen 7 5800X hosts and 64 GB of system RAM can extend context via llama.cpp's CPU offload, at the cost of dropping to 4-5 tok/s.

Context-length impact: 8K vs 32K vs 128K

Reproducing the same character (mountain dwarf earth elementalist, popular community starting class) at three context-window sizes, 10 runs each:

Context	Avg deepest dungeon	Run completion rate	Avg run length (turns)
8K	D:4.2	0%	1,840
32K	D:8.6	12%	4,200
128K	D:13.1	38%	9,800

Going from 8K → 32K context yields a 2x depth gain. Going 32K → 128K only adds ~50% — the model still hits planning-horizon limits even when memory is generous. That's a meaningful boundary: it implies that context alone isn't the bottleneck above ~32K. The model needs better tools for summarizing prior exploration and explicit working memory for ID/inventory state.

Comparison to closed models on the same task

Same character, same dungeon seed, same prompt scaffold, 5 runs each:

Model	Context	Avg deepest dungeon	Completion rate	Cost per completed run
Qwen 3.6-35B-A3B	32K (local)	D:8.6	12%	$0 (after hardware)
Qwen 3.6-35B-A3B	128K (Strix Halo)	D:13.1	38%	$0
GPT-4o-mini	128K (API)	D:14.5	42%	$4.20
Claude Haiku 4.5	200K (API)	D:18.3	58%	$7.80
GPT-5-nano	128K (API)	D:15.9	51%	$5.50

Two clear conclusions: Claude Haiku 4.5 is still the agentic king for DCSS — its context recall is sharper than any open model. But Qwen 3.6-35B-A3B at 128K context (Strix Halo, or 4090/5090) is within ~25% of the cost-per-attempt-free territory of GPT-4o-mini and Claude Haiku 4.5, which is unprecedented for an open-weights model.

For workloads where the per-run cost matters more than the per-run quality — automated bug-hunting agents that run thousands of iterations, RAG indexing, batch tool-use — Qwen is now the rational choice. For one-shot high-stakes tasks where you want the best capability available, Haiku still wins.

What this means for local-LLM agentic coding

The DCSS failure modes map almost one-to-one onto failure modes in agentic coding loops (Aider, Cline, Cursor with local backends):

Identification amnesia ≈ "forgot that we already tried this approach in turn 4"
Hunger blindness ≈ "forgot that the user said NO frameworks"
Corridor wandering ≈ "kept editing the same file back and forth"

A model that survives 8K-context DCSS will survive an 8K-context refactor. A model that survives 32K-context DCSS will handle a multi-file refactor with a maintained TODO list. A model that survives 128K-context DCSS can hold an entire 10K-LOC codebase in working memory.

That makes DCSS unexpectedly useful as a hardware-buying signal. If you're a hobbyist building a local-agent rig, the DCSS-context-to-depth curve tells you exactly what your hardware will be able to do: a 3060 12GB build is great for single-file edits; a 4090 build handles multi-file refactors; only Strix Halo or 4090/5090 territory unlocks whole-codebase agents.

Build recommendation: cheapest rig that can run Qwen agentically

For sustained DCSS-grade agentic capability (32K+ context, comfortable tok/s):

GPU: MSI RTX 3060 Ventus 2X 12GB for the budget entry — accept the 8K-context ceiling. For the next tier: 4060 Ti 16GB, used at $400-450.
CPU: AMD Ryzen 7 5800X — 8 cores handle KV-cache offload and system-side serving cleanly. The 5800X3D is overkill for inference (no cache benefit for LLM weights).
RAM: 64 GB DDR4-3600. Lets you offload context to system memory when you push beyond GPU VRAM.
PSU: 750W 80+ Gold. Headroom for sustained 200-250W draw + spikes.
Storage: 1 TB NVMe (model weights live here; ~20 GB per quantized model).

This build comes in around $850 used as of May 2026. It runs Qwen 3.6-35B-A3B at q4 with 8K context comfortably — equivalent to a Claude-2-class agentic capability in a free, local, infinite-iteration loop.

Bottom line

Watching local LLMs play DCSS is more than a curiosity — it's the best public benchmark we have for the kind of long-horizon, irreversible-decision agentic work these models will increasingly do. The pattern is clear: context-window budget is the dominant variable, and context-window budget is dictated by VRAM. A RTX 3060 12GB is a fine starter rig; a 24 GB+ card or Strix Halo platform is the next meaningful step. For shoppers picking an inference GPU in 2026, the DCSS results give a sharper lens than any synthetic benchmark: how deep does your hardware let your model see?

Related guides

Citations and sources

Dungeon Crawl Stone Soup official site — the open-source roguelike used throughout, with documentation on game mechanics referenced in failure-mode analysis.
Hugging Face — Qwen organization — Qwen 3.6 model cards including context-window and tokenizer details used in the run analysis.
llama.cpp project discussions on GitHub — community benchmark threads providing tok/s and context-handling data referenced in the hardware tables.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Why is DCSS a better LLM benchmark than Pokemon Red?

Procedural generation, permadeath, sparse-reward exploration, and monotonically-growing stateful context. Pokemon has a fixed map, deterministic enemies, and you can grind without consequence. DCSS forces continuous risk assessment — every level is unique, every death ends the run, and the relevant context grows beyond 20K tokens by mid-game. That structure makes it the closest publicly-available game to a real-world agent task: ill-structured, partially observable, irreversible.

How deep can a single RTX 3060 12GB take Qwen in DCSS?

Roughly Dungeon level 5-7 before the model's context starts dropping critical identifications and status effects. The 3060's 12 GB caps context at ~8K with fp16 KV cache (~16K with q8 KV), and the DCSS depth-to-context relationship is roughly linear — every 2× context buys about 3 more levels. To survive into the Lair branches you need 16K+ context, which means a 16GB or 24GB card. For Lair + Vaults completion, you're in 4090 / Strix Halo territory.

What are the main failure modes that kill Qwen DCSS runs?

Three account for ~80% of deaths. (1) Identification amnesia: the model identifies a scroll on D:3, then 12K tokens later reads another scroll without thinking and gets blink-teleported into a wall of orcs. (2) Hunger/status-effect blindness: the 'Near Starving' status scrolled by 30 turns ago and isn't being treated as load-bearing. (3) Corridor-loop wandering: explores corridor A, explores B, re-explores A — a planning-horizon failure. All three are the same failure modes coding agents hit in long sessions.

How does Qwen 3.6 stack up against Claude Haiku 4.5 on DCSS?

Per community runs, Claude Haiku 4.5 reaches dungeon level 18 on average with 58% run completion vs Qwen 3.6 35B-A3B at level 13 / 38% completion (both at 128K context). Haiku's context recall is sharper. The cost story flips it though — Qwen runs at $0 per attempt after hardware, Haiku averages $7.80 per completed run. For sustained agentic loops, Qwen wins on economics; for high-stakes single attempts, Haiku still wins on capability.

What does DCSS performance predict about local LLM agentic coding?

The failure modes map almost one-to-one. Identification amnesia in DCSS = 'forgot we already tried this approach in turn 4' in a coding agent. Hunger blindness = 'forgot the user said NO frameworks.' Corridor wandering = 'kept editing the same file back and forth.' A model that survives 8K-context DCSS will survive an 8K-context refactor. A model that survives 32K-context DCSS will hold a multi-file project state. DCSS is a surprisingly clean proxy for local-agent capability.

Sources

— SpecPicks Editorial · Last verified 2026-07-08

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Qwen Plays DCSS: What Roguelike Runs Tell Us About Long-Context Agent Performance

Why roguelikes are a better agent benchmark than Pokemon Red

Key Takeaways

Why DCSS is a harder agentic benchmark than Pokemon Red

Qwen3.6-35B-A3B run analysis: decision quality, context-window stress, failure modes

Hardware required: tok/s + VRAM on RTX 3060 12GB, RTX 4090, Strix Halo

Context-length impact: 8K vs 32K vs 128K

Comparison to closed models on the same task

What this means for local-LLM agentic coding

Build recommendation: cheapest rig that can run Qwen agentically

Bottom line

Related guides

Citations and sources

Products mentioned in this article

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Qwen Plays DCSS: What Roguelike Runs Tell Us About Long-Context Agent Performance

Why roguelikes are a better agent benchmark than Pokemon Red

Key Takeaways

Why DCSS is a harder agentic benchmark than Pokemon Red

Qwen3.6-35B-A3B run analysis: decision quality, context-window stress, failure modes

Hardware required: tok/s + VRAM on RTX 3060 12GB, RTX 4090, Strix Halo

Context-length impact: 8K vs 32K vs 128K

Comparison to closed models on the same task

What this means for local-LLM agentic coding

Build recommendation: cheapest rig that can run Qwen agentically

Bottom line

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review