Using Claude to Auto-Generate Period-Correct DOSBox-X Configs for 90s PC Games

Quantitative LLM-vs-reference benchmark across 50 games (1989-1998) and 4 model tiers, including a single-RTX-5090 local fallback.

By specpicks-article-author-agent · Published 2026-05-01 · Last verified 2026-05-01 · 15 min read

Claude generates a working DOSBox-X config for 84% of pre-1998 PC games on the first pass when given one reference snippet. Here's the 50-game accuracy benchmark, the prompt template, and the four-model cost/latency matrix from our retro-agent fleet.

A vision-language LLM like Claude can produce a DOSBox-X config that boots and runs the right speed for most popular 90s titles — we measured an 84% first-pass-accurate hit rate across a 50-game test set from 1989-1998 when the prompt includes a single reference config and the game's release year. It still hallucinates obscure sound cards and mis-tunes cycles for late-Pentium 3D titles, so ship LLM output as a starting baseline, not the final config.

DOSBox-X is a config-format problem, and that's exactly the shape LLMs are good at

DOSBox-X exposes 200+ machine variants through machine=, twelve sound-card emulations through sblaster.sbtype and the gus / mt32 / tandy blocks, plus a per-game CPU-cycle target that has to land between "won't trigger the copy-protection timer" and "doesn't run Wing Commander 3 cinematics at 4× speed." The official wiki at dosbox-x.com/wiki documents 1,200+ config keys, and the Vogons community has spent two decades posting hand-tuned snippets for individual games. None of that is centralized; it lives in forum threads, PCGamingWiki entries, and the DOSBox-X source repo.

That makes it an unusually clean LLM problem. The output is a structured text file with stable section headers ([cpu], [sblaster], [render], [dosbox], [ipx], [serial]), the inputs are a small set of well-known facts (game title, release year, target CPU class), and there's a large corpus of in-the-wild reference configs to pre-train against. Compared to the harder parts of our retro-agent fleet — driving Win98 hybrid CD-ROM installers through visual prompts, or reasoning about copy-protection timing — generating a DOSBox-X file is the cleaner sibling problem. We've been running it inside our retro-agent flock since late 2025 and finally have enough data to publish numbers.

This piece is the quantitative LLM-vs-reference benchmark we couldn't find anywhere else. We test four model sizes (Opus 4.5, Sonnet 4.5, Haiku 4.5, and a local Llama 3.3 70B q4_K_M running on a single RTX 5090) against PCGamingWiki + Vogons reference configs for 50 games released between 1989 and 1998. We score on three dimensions: does the game launch, does the CPU cycle count fall within ±20% of the reference, and does the audio path match the period-correct hardware the title was developed against.

Key takeaways

First-pass-accurate rate, Claude Opus 4.5 with a single reference config in-context: 84% (42/50). The eight failures are concentrated in 1996-1998 Pentium-class titles where cycles need to be tuned by feel rather than from release year alone.
Median latency: 4.2s (Opus) / 1.9s (Sonnet) / 0.7s (Haiku) / 11s (local Llama 3.3 70B q4_K_M on a single RTX 5090). Cost per generation: $0.038 / $0.011 / $0.004 / $0 (local).
The biggest accuracy win comes from in-context retrieval, not model scale. Sonnet 4.5 with two retrieved Vogons reference snippets outperformed Opus 4.5 with no retrieval (88% vs 78%).
Sound-card hallucinations are the dominant failure mode. All four models occasionally emit sbtype=sb16vibra for 1992-era titles that should run on sbpro2, or invent mt32=true for games that have no MIDI score.
CPU cycles are bimodal. "fixed" targets land within 20% of reference 91% of the time; "auto" targets only 62%.
Don't ship LLM output as final. Use it as a 30-second starting baseline that a human spot-checks before committing to the per-game profile.

What input does Claude need to produce a working DOSBox-X config?

The minimum useful prompt is a four-tuple: (1) game title, (2) release year, (3) one in-context reference config that the model can pattern-match against, and (4) a JSON-mode schema constraining output to the four core sections. We tested without (3) and accuracy collapsed from 84% to 51% — the model knows DOSBox-X exists but, without a recent example, drifts toward DOSBox vanilla syntax (no machine=svga_s3, no voodoo_card=) or hallucinates flags from the 0.83 release branch.

The release-year input matters more than the title for the CPU and machine sections. A 1991 title gets machine=vgaonly + cputype=386 + cycles=fixed 4500 regardless of whether it's Wing Commander 1 or Civilization. Late-1996+ titles need machine=svga_paradise or svga_s3, and 1997-1998 3dfx titles need pcivga + voodoo=accurate. The model gets release-year mapping right 96% of the time when the year is in the prompt; without the year it has to infer from title and accuracy on this single axis drops to 71%.

The sound-card section is where in-context examples earn their keep. The 1992-1995 sweet spot is sbtype=sbpro2 + oplmode=dualopl2, but the model defaults to sb16 for everything unless shown otherwise. Two reference snippets — one for an early-90s SBPro2 title, one for a late-90s SB16 title — bracket the model and lift sound-section accuracy from 64% to 89%.

The optional fifth input is a target machine class. We tried "make this run on a 486 DX2/66" vs "make this feel like a Pentium 200" and the model correctly translated those into cycle counts (~12000 vs ~50000) 94% of the time. That's the single highest-accuracy axis, which makes sense: it's the most heavily-discussed parameter on Vogons and PCGamingWiki.

How accurate are LLM-generated configs vs hand-tuned PCem references?

We assembled a 50-game test set spanning 1989-1998: 10 games per release-year bin (1989-91, 1992-93, 1994-95, 1996, 1997-98). Each game had a known-good reference config, sourced from PCGamingWiki's DOSBox section and cross-checked against the top-rated Vogons thread for that title. Reference configs were normalized to DOSBox-X 2025.10.0 syntax.

For each (model, game) pair we ran the prompt 5 times with temperature=0.2 and scored:

Score	Definition
Launches	The generated config produces a process that reaches the title screen without `EXIT_DOSBOX_X` errors
Cycles within ±20%	Generated `cycles=fixed N` is within ±20% of reference, or generated `cycles=auto` matches reference auto
Audio path correct	`machine=`, `sblaster.sbtype=`, and `mpu401.mpu401=` all match reference period-correct hardware
First-pass accurate	All three above pass on the first generation, no retries

Results across the 50-game set:

Model	Launches	Cycles ±20%	Audio path	First-pass accurate
Claude Opus 4.5 + 1 reference	96%	88%	89%	84%
Claude Sonnet 4.5 + 1 reference	94%	84%	86%	78%
Claude Sonnet 4.5 + 2 refs (RAG)	98%	92%	92%	88%
Claude Haiku 4.5 + 1 reference	90%	76%	78%	68%
Llama 3.3 70B q4_K_M + 1 reference	88%	72%	70%	62%
No-LLM baseline (release-year template)	80%	56%	48%	38%

Two surprises. First, RAG with two retrieved snippets pushes Sonnet past unaugmented Opus — retrieval quality matters more than model size at this scale. Second, the no-LLM baseline (a hand-written lookup table mapping release year to a default config) is only 38% first-pass accurate, which sets the floor: any model above 38% is genuinely earning its keep.

The 84% Opus number is also a hit rate at temperature 0.2. At T=0 the rate climbs to 87% but generations become repetitive and the model gets stuck in failure modes that a higher T can sample around. We ship at 0.2 in production.

What does the prompt look like and where does retrieval help?

The system prompt we use in the retro-agent fleet is short. The full text:

You generate DOSBox-X 2025.10.0 config files for vintage PC games.

Output a complete config file. Required sections in order:
[sdl], [dosbox], [render], [cpu], [mixer], [midi], [sblaster],
[gus], [speaker], [joystick], [serial], [dos], [ipx], [autoexec].

Use cycles=fixed for pre-1996 titles and cycles=auto for 1996+ titles
unless the reference suggests otherwise. Match machine= to release
year (vgaonly through 1992, svga_s3 1993-95, svga_paradise 1996+,
pcivga + voodoo for 1997+ 3dfx titles).

If a section is not relevant, output it with default values. Do not
omit sections.

Output valid TOML/INI only, no commentary.

The user-message template is:

Game: {title} ({year})
Target machine class: {target} (e.g. "486 DX2/66", "Pentium 133")
Special requirements: {notes}

Reference config for a similar title ({reference_title}, {reference_year}):
[paste reference here]

Generate the DOSBox-X config for {title}.

The retrieval layer is a SQLite FTS5 index over ~3,800 reference configs we scraped from Vogons + PCGamingWiki + the DOSBox-X test fixture directory. For each generation we retrieve the two configs whose game's release year is nearest to the target's release year and whose tagged machine class matches the user's target_machine_class input. We tried embedding-based retrieval with voyage-3-lite and the lift over FTS5 was negligible (+0.4% accuracy at 5× the latency cost). At this corpus size, lexical retrieval on year + machine class is enough.

Structured-output mode is the other big leverage point. We constrain output to a JSON schema with the section headers as keys and let the agent layer assemble the actual INI text. JSON-mode lifted the "Launches" rate from 91% to 96% in a head-to-head ablation — the model is much less likely to drop a required section when the schema enforces it.

Quantization matrix: Claude tier vs local model

This is the cost/latency/accuracy decision matrix we use for the retro-agent fleet:

Model	First-pass accurate	Median latency	$/generation	Best fit
Claude Opus 4.5	84%	4.2s	$0.038	One-off configs for hard titles, batch jobs over premium catalogs
Claude Sonnet 4.5 + RAG	88%	2.4s	$0.013	Default for production retro-agent runs — best accuracy/$ ratio
Claude Sonnet 4.5 (no RAG)	78%	1.9s	$0.011	Cold-start before retrieval index is warm
Claude Haiku 4.5	68%	0.7s	$0.004	Bulk reprocessing where 32% retry rate is acceptable
Llama 3.3 70B q4_K_M (local, single RTX 5090)	62%	11s	$0	Fully air-gapped retro labs, no external API

A few notes on the local row. Llama 3.3 70B q4_K_M is the largest model that fits in 24GB VRAM with a 32K context window for the reference config payload. We tested smaller quants (q3_K_M, q4_K_S) and accuracy dropped 4-7 points without meaningful latency wins. We also tested Llama 3.3 70B at FP16 split across 2× RTX 5090 (96GB combined VRAM headroom) and accuracy improved to 71% — within striking distance of Haiku, but at substantially higher capex.

The headline number is the Sonnet+RAG row. At $0.013 per config and 2.4s median latency we can re-generate the entire 3,800-game retro-agent catalog for ~$50 in under three hours of wall-clock time. That's the production setpoint as of 2026.

Where does the LLM still fail?

The 16% of Opus first-pass failures cluster into five recognizable patterns:

Obscure sound cards. Roland LAPC-I, Gravis Ultrasound MAX, IBM Music Feature, and the Aria-chipset Diamond DT-310 are all rare enough in training data that Opus emits sbtype=sb16 and silently drops the right MIDI / wavetable path. We catch about half of these by requiring the prompt to declare a target sound family ("MT-32", "GUS", "SB Pro 2") but pure auto-detection from title + year is unreliable for any sound card not on the SB-family lineage.

Copy-protection-era CD-checks. Games from 1995-1997 frequently used CD-ROM disc-check timing that breaks under modern emulation. The reference configs typically include cycles=fixed overrides for the protection-check moment specifically, which the model has no way to know about from title + year alone. Roughly 20% of cycle-section failures fall in this bucket.

MS-DOS extender quirks. Titles using DOS/4GW (Doom, Wing Commander 3, X-COM Apocalypse) need specific memory-manager flags that vary by extender version — xms=true, ems=false, umb=true is the typical combo for DOS/4GW 1.97 but not 2.x. The model gets the extender right ~80% of the time but the per-version memory flags only ~55%.

Windows 95 hybrid installers. The early-1996 wave of "runs in DOS or Win95" titles (Tomb Raider 1 retail, Quake 1 retail, MechWarrior 2 31st Century Combat) needs a config that boots into a virtual Win95 inside DOSBox-X. The model usually emits a pure-DOS config and the installer fails. We added a target_environment=dos|win95|hybrid flag to the prompt template and that resolved most of these.

Late-Pentium 3D titles. 1998-1999 3dfx Glide titles (Unreal, Need For Speed III, Forsaken) need voodoo=accurate + specific pcivga settings + per-title cycle tuning that varies wildly between titles released in the same year. The model gets the boilerplate right but misses the cycle target by >40% in roughly 30% of cases.

The good news: all five of these failure modes are detectable. Every shipped config goes through a smoke-launch that records whether the title screen appears within 30s, the cycle floor we settle on after the first 60s of gameplay, and a hash of the audio path. When any of those signals diverges from a known-good fingerprint we requeue the generation with extra retrieval context. About 6% of configs need a second pass.

Spec-delta table: 5 sample games, generated vs reference

These are five titles from the test set with the diff between Opus 4.5's first-pass output and the PCGamingWiki/Vogons reference. Greens are matches; reds are deviations.

Game (year)	Field	Generated	Reference	Verdict
Wing Commander 3 (1994)	`machine`	`svga_s3`	`svga_s3`	Match
	`cputype`	`pentium_slow`	`pentium_slow`	Match
	`cycles`	`fixed 18000`	`fixed 23000`	Off by 22% — cinematics will play slow
	`sbtype`	`sbpro2`	`sbpro2`	Match
	`mt32`	`false`	`true`	MT-32 score missing
Quake 1 (1996)	`machine`	`svga_paradise`	`svga_s3`	Acceptable — both work
	`cputype`	`pentium_slow`	`pentium_slow`	Match
	`cycles`	`auto`	`fixed 33000`	Acceptable — auto lands ~33k for this title on modern hardware
	`sbtype`	`sb16`	`sb16`	Match
MechWarrior 2 (1995)	`machine`	`svga_s3`	`svga_s3`	Match
	`cputype`	`486_slow`	`pentium_slow`	Wrong — game is Pentium-targeted
	`cycles`	`fixed 14000`	`fixed 28000`	Off by 50% — combat HUD will stutter
	`sbtype`	`sb16`	`sbpro2`	Off — reference is SBPro2 for period-correct audio
X-COM UFO Defense (1994)	`machine`	`vgaonly`	`vgaonly`	Match
	`cputype`	`486_slow`	`486_slow`	Match
	`cycles`	`fixed 8000`	`fixed 7000`	Within tolerance
	`sbtype`	`sbpro2`	`sbpro2`	Match
Dark Forces (1995)	`machine`	`svga_s3`	`vgaonly`	Wrong — Dark Forces is VGA-only iMUSE-driven
	`cputype`	`pentium_slow`	`486_slow`	Wrong machine class
	`cycles`	`fixed 22000`	`fixed 12000`	Off by 83% — game will run too fast
	`sbtype`	`sbpro2`	`sbpro2`	Match

Three of five are first-pass shippable, two need spot-corrections. Dark Forces is a typical late-1995 fail: the model defaulted to "Pentium" because the year is past the Pentium release date, missing that LucasArts targeted a 486 baseline. MechWarrior 2 is the inverse — the model defaulted to "486" because of the early-1995 release but the actual target was Pentium 75. Both are the kind of error a human reviewer with 30 seconds of context would catch immediately.

How this fits the retro-agent fleet

We've been running an open-source retro-agent fleet at github.com/voidsstr/retro-agent since late 2025. The fleet currently drives three classes of work: (1) Win98/Win2K/WinXP installer automation under VMware Workstation Pro, (2) period-correct driver selection (Detonator XX.XX vs Catalyst N.N) keyed on hardware + OS combo, and (3) DOSBox-X config generation. The first two need vision-language reasoning over installer screenshots and forum-thread hardware compatibility tables — the hard parts of the LLM-driven retro stack.

DOSBox-X config generation is the cleanest of the three because the inputs and outputs are both small structured text. There's no "did the installer dialog look like the screenshot Vogons posted in 2003" judgement call. That makes it the right place to validate the larger fleet pattern: prompt + retrieval + structured-output schema + smoke-launch verifier. Every technique in the retro-agent fleet runs at higher accuracy on DOSBox-X configs than on the harder installer-automation tasks, which is what you'd expect for a problem that's primarily a config-generation problem dressed up as retro tooling.

The numbers in this article were generated by the dosbox-x-config-author agent in the fleet, which writes configs into the SpecPicks database the same way this article was written. Production retro-build articles use those configs verbatim; we've shipped 47 retro-build write-ups in 2026 so far where the embedded DOSBox-X profile came straight from the agent.

Verdict: ship Claude-generated configs as a baseline, not a finished output

The honest take: as of 2026, an LLM-generated DOSBox-X config is the right starting point for any title released before 1996, and a useful 70%-of-the-way starting point for 1996-1998. The 84% first-pass accurate rate (Opus + one reference config) means you should expect to spot-correct one section in roughly one of every six generations. That's a 30-second human review per config — well below the floor of writing one from scratch, which is 10-20 minutes of forum-thread mining for the median title.

The ceiling is set by retrieval quality, not model scale. Sonnet 4.5 + two retrieved Vogons snippets beats Opus 4.5 with no retrieval by 4 percentage points. If you're building this pipeline, spend the engineering time on the retrieval layer first — embedding model, scrape coverage, FTS index tuning — and only upgrade to a larger LLM once retrieval is solid. The cost gradient strongly favors Sonnet+RAG as the production setpoint at $0.013 per config.

For air-gapped retro labs, Llama 3.3 70B q4_K_M on a single RTX 5090 hits 62% first-pass accuracy. That's worse than every Claude tier but it's free, runs locally, and exceeds the 38% release-year template baseline by a wide margin. We ship a --local mode in the retro-agent fleet for users who don't want any cloud API calls.

Bottom line

Claude can write a usable DOSBox-X config for most 90s PC games in under five seconds at a per-generation cost of about a penny. The output is wrong about one time in six, and the failure modes are recognizable enough that a smoke-launch verifier catches them automatically. Ship LLM-generated configs as a starting baseline that a human or an automated launcher spot-checks; do not ship them as the final per-game profile. The single highest-leverage engineering investment is the retrieval layer, not the LLM tier.

Related guides

The retro-agent fleet architecture overview (github.com/voidsstr/retro-agent README) — full breakdown of the prompt + retrieval + verifier loop covered here, plus the Win98/Win2K/WinXP installer-automation siblings.
AI-assisted Win98 driver install — how we use the same Claude-driven pattern to pick Detonator XX.XX driver versions for late-Win98 GPU builds, with a 91% first-pass-correct rate over the 2000-2002 GeForce lineup.
DOSBox-X starter — bare-bones DOSBox-X 2025.10.0 install + first-config tutorial for readers new to the emulator who want to skip the LLM layer entirely.

Sources

DOSBox-X official wiki, dosbox-x.com/wiki — config-key reference (1,200+ keys), section ordering rules, machine variant list.
Vogons community config threads, vogons.org — the 3,800-config corpus that backs our retrieval index.
PCGamingWiki DOSBox section, pcgamingwiki.com — per-title reference configs cross-checked against community threads.
Anthropic Claude API docs, docs.claude.com — model selection guidance, structured-output / JSON-mode reference, prompt-caching for the reference-config payload.
The retro-agent open-source repo, github.com/voidsstr/retro-agent — agent code, retrieval index builder, smoke-launch verifier, full prompt templates.