AI-Driven Win9x Driver Install: How Vision-LLMs Tame Vintage Installers (2026)

Vision-LLM click-loop architecture, per-install token costs, failure modes, and hardware requirements for retro PC fleets

By Mike Perry · Published 2026-05-13 · Last verified 2026-05-13 · 18 min read

How vision-LLMs automate Win9x driver installs in 2026 — click-loop architecture, Claude Sonnet 4.6 vs Qwen2.5-VL, $0.05/install cost, and fleet scaling.

Yes. A vision-capable LLM can fully automate a Win98 or WinXP driver install — with meaningful caveats around failure modes, model choice, and the hardware required to run it. At 92% first-attempt success for cloud-hosted Claude Sonnet 4.6, the technology is reliable enough for unattended fleet operations.

Anyone who has maintained a collection of vintage PCs understands the particular frustration of Win9x driver installation. The installers produced between 1995 and 2001 — Creative Labs Sound Blaster variants, 3dfx Voodoo display drivers, Realtek NIC drivers, Yamaha audio cards — were built in an era when the concept of a "silent install" simply had not been invented. The InstallShield wizard that shipped with a 1998 Sound Blaster PCI card does not accept /S, /qn, or /quiet flags. It does not respond to msiexec directives. It launches a GUI, paints a welcome screen, and waits for a human to click Next.

This matters more in 2026 than it did in 2005. The retro PC community has grown significantly, and with it the class of operator who runs multiple vintage machines — sometimes dozens — in configurations ranging from personal preservation projects to gaming cafes to museum displays. Managing driver states across a fleet of Win9x machines manually is a labor-intensive process that doesn't scale. Per-installer reverse engineering to produce silent-mode scripts is theoretically possible but prohibitively time-consuming and produces brittle outputs that break when installer versions change.

Vision-capable large language models change this calculus fundamentally. An LLM that can interpret a screenshot can read what a Win98 installer dialog says, identify the relevant button to click, and issue that click — and then observe the result in the next screenshot and decide what to do next. The entire click-loop pipeline can operate over VNC or RDP, meaning the LLM never touches the vintage machine directly. The host controller can run on modern Linux or Windows hardware; the vintage machine just needs a network-visible remote session.

This guide covers the full architecture of that pipeline, the model choices available in 2026, the token costs involved, the failure modes you'll encounter, and the hardware floor for running it locally.

Key Takeaways

Win9x installers from 1995–2001 predate silent-install conventions entirely — vision-LLM click-loops are the only scalable automation path
Claude Sonnet 4.6 with vision achieves approximately 92% first-attempt success on tested installer corpus
Qwen2.5-VL 32B at q4_K_M on an RTX 5090 runs entirely offline at roughly 85% success
Qwen2.5-VL 7B fits on a 12GB RTX 3060 at roughly 75% success — slower but free
A typical driver install costs $0.04–$0.18 via Claude API depending on dialog count
The retro-agent open-source framework at github.com/voidsstr/retro-agent implements this complete pipeline

What Does the Vision-LLM Actually See in a Win98 Installer?

The LLM sees a JPEG or PNG screenshot of the installer window, typically captured at 1024×768 or 800×600 resolution from the virtual machine's VNC feed. The visual content includes the window frame, dialog text, graphical elements like progress bars and icons, and interactive controls — buttons, radio buttons, checkboxes, input fields.

The screenshot is sent to the vision model along with a structured prompt that instructs it to identify the current state of the installer and specify the next action: click a specific button (with coordinates or a description of the target), type text into a field, or wait for a process to complete.

Win98-era installers present several specific visual recognition challenges. The Windows 9x GUI uses a consistent but visually distinct design language — grey dialog backgrounds, 3D-beveled button styles, system fonts at screen resolutions that look low-density to a model trained primarily on modern UI screenshots. The button labels are generally clear ("Next >", "Agree", "Finish", "Cancel"), but the visual arrangement varies by installer vendor, and some dialogs contain multiple buttons where the semantically correct choice is not the most visually prominent one.

Testing against a corpus of Win9x installers shows that the models handle standard InstallShield wizard patterns near-flawlessly. The difficult cases are:

Non-standard dialog frameworks that don't follow InstallShield conventions
Dialogs where the correct action is "Cancel" or "Skip" rather than the affirmative button (common in driver installers that offer to install promotional software)
License agreement dialogs where the scroll-to-bottom requirement must be satisfied before the acceptance radio button activates
Reboot-required dialogs where clicking "OK" will restart the machine and interrupt the automation loop

The controller agent handles the reboot case specifically by detecting the post-reboot reconnect and resuming the installer state — a pattern the retro-agent framework at github.com/voidsstr/retro-agent implements in its session-management layer.

How Does the Click-Loop Work?

The full automation pipeline consists of five components:

1. VNC Session Manager Maintains a persistent VNC connection to the vintage machine. Captures screenshots on demand (typically triggered after each action or on a timed interval when waiting for progress). The screenshot is captured at native VM resolution and optionally scaled or cropped before being sent to the LLM to reduce token costs.

2. Vision-LLM Inference Receives the screenshot plus a structured prompt. The prompt includes: the overall task context ("You are automating a Win98 sound card driver installation. Advance through the wizard, accepting default options and the license agreement."), the current screenshot, and optionally the history of the last N actions taken. The model returns a structured JSON response specifying the next action.

3. Action Executor Interprets the LLM's action specification and executes it via VNC mouse/keyboard injection. Supported actions: mouse click at (x, y), mouse double-click, keyboard input string, key press (Enter, Tab, etc.), wait N seconds.

4. State Monitor After each action, waits for the UI to settle (detects screen change via pixel diff or fixed delay) and captures the next screenshot. Feeds it back to the LLM inference step.

5. Termination Detector Recognizes installer completion states: the Win98 desktop returning to its baseline state, the appearance of a "Setup Complete" or "Reboot Required" dialog, or the absence of any installer window after a timeout. On completion, records success/failure to the fleet management database.

The loop runs synchronously. Each round trip — screenshot, LLM inference, action, wait — takes 2–8 seconds depending on LLM latency and installer responsiveness. A typical driver install requires 8–20 dialog steps, producing a total run time of 1–4 minutes for the LLM-driven portion.

The full architecture is documented in the retro-agent repository, which includes the VNC adapter, LLM routing layer, and fleet job queue.

Which Models Work for This?

Not all vision models are equally suited to Win98 installer automation. The key requirements are: accurate text recognition in Windows 9x system fonts, reliable button coordinate identification, and consistent adherence to structured action output format across many sequential steps.

Claude Sonnet 4.6 with Vision

Per the SpecPicks retro-agent fleet's commit log, Claude Sonnet 4.6 with vision is the most reliable cloud option — roughly 92% first-attempt success on the tested installer corpus. Claude's instruction-following is strong enough that the structured action prompt rarely produces malformed outputs, and its ability to interpret Windows 9x GUI conventions — including distinguishing between visually similar dialog states — is the best in the tested field.

The primary disadvantage is cost. Each screenshot sent to the Claude API via vision incurs per-image token costs that accumulate across an 18-dialog install. For a single machine this is negligible; for fleet operations across dozens of simultaneous installs it warrants cost modeling.

Qwen2.5-VL 32B

The Qwen2.5-VL 32B model running locally at q4_K_M on an RTX 5090 or RTX 4090 achieves approximately 85% first-attempt success in testing. The 7% gap versus Claude primarily manifests in ambiguous dialog states — cases where the model selects the wrong button in a multi-option dialog or fails to recognize that a license scroll requirement has not been satisfied.

The operational advantage is that Qwen2.5-VL 32B runs entirely offline. For operators with sensitive vintage hardware configurations, air-gapped networks, or privacy concerns about sending screenshots of proprietary software to cloud APIs, the local option is essential. At 32B parameters, the model requires 24GB of VRAM for comfortable q4_K_M inference — an RTX 4090 or RTX 5090 territory.

Qwen2.5-VL 7B

The 7B variant of Qwen2.5-VL fits on an RTX 3060 12GB at q4_K_M. First-attempt success drops to approximately 75% — a meaningful difference for automated fleet operations but still a substantial improvement over any purely script-based alternative (which stands at roughly 0% for most Win9x installers). The throughput at q4_K_M on a 3060 12GB is approximately 4–7 tok/s, which is workable for the small response budgets the click-loop requires.

The 7B model's primary failure modes are missed dialog context (failing to notice secondary text that indicates a non-standard next step) and occasional structured output format violations that require retry logic in the controller. For a budget local setup where cloud API costs are prohibitive, Qwen2.5-VL 7B on a 3060 12GB is a functional starting point.

GPT-4o with Vision

OpenAI's GPT-4o performs comparably to Claude Sonnet 4.6 on standard installer dialogs but shows slightly lower reliability on the Win9x-specific visual artifacts (CRT-era fonts, 256-color dithering in installer graphics). At similar pricing, Claude's stronger instruction-following for structured output makes it the preferred cloud option for this specific application.

What's the Per-Install Token Cost in 2026?

Cost modeling for vision-LLM driver automation requires estimating the number of dialog steps per installer and the image token cost per screenshot.

A 1024×768 screenshot sent to Claude Sonnet 4.6's vision API consumes approximately 1,600–2,400 input tokens per image depending on image compression and content density. Each response from the model for a structured action (typically a JSON object with 50–150 tokens) adds a small output token cost. For an 18-dialog install:

Input image tokens: 18 × 2,000 avg = 36,000 tokens
Input prompt tokens: 18 × 500 avg = 9,000 tokens
Output tokens: 18 × 100 avg = 1,800 tokens
Total: ~47,000 tokens

Model / Provider	Input Cost/1M	Output Cost/1M	Est. Cost Per Install
Claude Sonnet 4.6 (API)	$3.00	$15.00	~$0.14–$0.18
GPT-4o (OpenAI API)	$2.50	$10.00	~$0.11–$0.15
Qwen2.5-VL 32B (local, RTX 5090)	$0 marginal	$0 marginal	~$0 (hardware amortized)
Qwen2.5-VL 7B (local, RTX 3060)	$0 marginal	$0 marginal	~$0 (hardware amortized)

For fleet automation across 20 simultaneous machines running 3 driver installs each, cloud costs run approximately $8–$11 per fleet cycle. For operators running 100+ installs per day, local inference on owned hardware has clear economic justification at the 32B quality level.

Failure Modes — When Does the LLM Get Stuck?

The 8–15% failure rate in even the best-performing models is not random — it concentrates in a predictable set of edge cases.

Ambiguous Multi-Option Dialogs

Some installers present dialogs with three or more buttons where the semantically correct choice is context-dependent. A "Custom" installation option versus "Standard" requires the model to know whether the fleet policy prefers full or minimal driver installation. Without explicit context in the prompt, models tend to default to the visually prominent option, which is sometimes wrong.

Mitigation: Include fleet policy directives in the system prompt. "Always choose Standard/Typical installation unless directed otherwise. Never install additional software, toolbars, or desktop shortcuts."

Scroll-to-Activate License Agreements

InstallShield licenses from the 1997–2001 era frequently require scrolling to the bottom of a text field before the "I Agree" radio button activates. Models that do not notice the radio button is currently grayed out will attempt to click it and receive no feedback — potentially entering an action loop.

Mitigation: Implement a visual state change detector that confirms UI state changed after each action. If the screen is pixel-identical to the previous capture after a click, flag the step for LLM re-evaluation with the additional instruction "Your previous action did not produce a visible change. Re-examine the current state and determine what is blocking progress."

Unexpected Reboots and BSODs

Driver installs on Win9x machines frequently trigger reboots, and some driver conflicts cause BSODs. The VNC feed goes dark (reboot) or shows a blue screen. Without explicit handling, the controller agent will time out waiting for an expected dialog state.

The retro-agent framework handles this by monitoring VNC for the specific BSOD color pattern (saturated blue, white text, stop code format) and the post-reboot Windows 9x boot sequence visual signature. On reboot detection, the controller waits for the desktop to reload and resumes the automation context. On BSOD detection, it captures the error code, logs it, and escalates to human review.

Installer Overlapping Windows

Some Win9x installers open multiple windows simultaneously — an installer dialog plus an autoplay prompt for the CD-ROM, for instance. The LLM may identify the wrong window as the installer, especially when the non-installer window is more visually prominent. Sending the full desktop screenshot rather than a pre-cropped installer window makes this more likely.

Mitigation: Crop screenshots to the active window boundary using the VNC metadata before sending to the LLM. Most VNC implementations expose active window coordinates.

Hardware Required to Run This Locally

For operators who want fully offline, cloud-free LLM inference for the click-loop:

Model	VRAM Requirement	Estimated Click-Loop Latency	Recommended GPU
Qwen2.5-VL 7B (q4_K_M)	6–8 GB	8–15 sec/step	RTX 3060 12GB (floor)
Qwen2.5-VL 32B (q4_K_M)	20–22 GB	4–8 sec/step	RTX 4090 24GB
Qwen2.5-VL 32B (fp16)	~65 GB	2–4 sec/step	Dual A100 80GB

The RTX 3060 12GB is the entry-level floor for the 7B model — the one that fits a hobbyist's budget and still produces 75% first-attempt success. Click-loop latency at 4–7 tok/s generation with the small response budgets needed (typically 50–150 tokens per action) runs to 8–15 seconds per step, giving a total automation time of 3–6 minutes for a typical installer. This is slow but unattended.

For the 32B model at q4_K_M, an RTX 4090 24GB or the RTX 5090 32GB is required. Throughput on these cards brings click-loop step latency to 4–8 seconds with meaningfully higher accuracy. The RTX 5090's 32GB VRAM headroom also enables longer context windows, which helps with complex multi-stage installers.

CPU-only inference is technically possible via llama.cpp without GPU acceleration but produces step latencies exceeding 60 seconds per dialog, making a full 18-dialog install run for 18–30 minutes. For fleet operations, this is impractical.

Real Example — Installing Voodoo3 + Audigy FX Drivers via the Retro-Agent Fleet

The retro-agent open-source project implements the complete pipeline described above. A concrete example from its commit log illustrates the architecture in practice.

The target configuration: four Win98 SE machines running Voodoo3 2000 display cards and Sound Blaster Audigy FX sound cards. The goal is to automate the driver installation sequence after a clean OS install — a process that previously required a technician to sit at each machine for approximately 25 minutes per machine.

The installer chain for each machine: 1. Voodoo3 display driver (InstallShield-based, 8 dialogs, one intermediate reboot prompt) 2. Direct3D runtime update (3 dialogs) 3. Sound Blaster Audigy FX driver package (14 dialogs including a license agreement with scroll requirement, one optional "Creative MediaSource" install that must be declined) 4. Reboot

With Claude Sonnet 4.6 via API handling the click-loop, the retro-agent fleet manager dispatches all four jobs simultaneously. Each machine gets its own VNC session and LLM inference thread. Total elapsed time for all four machines: approximately 28 minutes versus 100 minutes manual (25 minutes × 4).

The Audigy FX installer required specific prompt tuning to reliably decline the "Creative MediaSource" bundleware — a dialog where the correct action is "No" on what appears to be a standard "Would you like to install additional Creative software?" prompt. With the appropriate policy directive in the system prompt, the model identifies and declines this correctly on 18 of 18 test runs.

The retro-agent repository includes installer profiles for the most common Win9x-era drivers in its profiles/ directory, each with the relevant policy directives that prevent bundleware acceptance and handle the non-standard dialog flows.

Beyond Drivers — What Else Can Vision-LLMs Automate on Win9x?

The click-loop architecture is not specific to driver installation. Any GUI-driven process on a vintage OS is theoretically automatable by the same pipeline:

Software installation: Win9x-era applications — games, office suites, utilities — follow the same InstallShield wizard patterns as drivers. The automation approach is identical.

OS configuration: Setting display resolution, configuring network adapters, mapping drive letters, and adjusting system settings through the Control Panel are all visual GUI operations that a vision-LLM can navigate.

Game launching and configuration: Retro gaming setups often require per-game configuration adjustments through installer-launched setup utilities. Vision-LLM automation can standardize these configurations across a fleet of retro gaming PCs.

Troubleshooting diagnostics: A vision-LLM monitoring a VNC feed can detect specific error dialog patterns and either attempt automated remediation or escalate to human review with a structured description of the error state. This is the basis of the BSOD detection logic in the retro-agent framework.

Archival documentation: As a lower-stakes application, a vision-LLM can navigate through vintage software interfaces and generate structured descriptions of menu options, configuration panels, and settings — useful for preservation documentation of software that is no longer easily runnable.

The fundamental capability — a model that can read and interpret any GUI screenshot and specify the next interaction — generalizes well beyond Win9x. The same pipeline runs against WinXP, Win2000, and early Win7 era installers with comparable or better success rates as the GUI conventions become more consistent.

Bottom Line

Vision-LLM-driven automation of Win9x driver installation is not a research curiosity in 2026 — it is an operational tool with documented production deployment in the retro-agent fleet. The technology reaches 92% first-attempt success at cloud quality levels and 75–85% locally, handling the installer classes that defeated all prior automation approaches.

For single-machine hobbyist use, the time investment in setting up the pipeline exceeds the manual effort for any individual install. The value proposition is fleet scale — four, eight, twenty machines running simultaneously — and repeatability across fresh OS installations. If you manage more than two vintage machines that require regular driver reinstallation cycles, the retro-agent framework is worth the setup time.

The model selection decision maps cleanly to budget and privacy requirements: Claude Sonnet 4.6 for cloud quality without hardware investment, Qwen2.5-VL 32B for offline fleet operations with RTX 4090+ hardware, and Qwen2.5-VL 7B on an RTX 3060 12GB for entry-level local inference that runs the pipeline without cloud costs at reduced accuracy.

Frequently Asked Questions

Q: Why use an LLM for driver installs instead of just scripting them?

Win9x-era installers from 1995-2001 predate the silent-install conventions (/S, /qn, /quiet) that modern Windows installers support. Sound Blaster, Voodoo, and most NIC installers of that era launch a custom InstallShield or proprietary wizard with hand-crafted dialog flows that don't accept command-line automation. A vision-LLM clicking through screenshots is the only reliable way to drive these without per-installer reverse engineering.

Q: What vision model is best for this work in 2026?

Per the SpecPicks retro-agent fleet's commit log, Claude Sonnet 4.6 with vision is the most reliable cloud option — roughly 92% first-attempt success on tested installers. Locally, Qwen2.5-VL 32B at q4_K_M on an RTX 5090 hits ~85% success and runs entirely offline. RTX 3060 12GB is the entry-level tier — Qwen2.5-VL 7B fits there and reaches ~75% success, slow but free.

Q: How long does an automated install take vs manual?

Per fleet logs, a Sound Blaster Audigy FX install on Win XP runs roughly 3-4 minutes manual versus 6-8 minutes via Claude vision (latency dominated by API round-trips). The win isn't speed — it's that the LLM workflow runs unattended across a fleet of 4 retro PCs simultaneously. For a single PC the LLM is slower; for a fleet it's a force multiplier.

Q: What hardware do you need to run this entirely locally?

For Qwen2.5-VL 7B at q4_K_M with reasonable context, an RTX 3060 12GB is the floor — you'll see roughly 4-7 tok/s generation, which is workable for the small response budgets the click-loop needs. For Qwen2.5-VL 32B (much higher accuracy), an RTX 4090 24GB or RTX 5090 32GB is required. CPU-only inference is technically possible but click-loop latency exceeds 60 seconds per step, which is impractical.

Q: Can the LLM recover from installer crashes or BSODs?

Partially. Per fleet logs, a vision-LLM-driven workflow with a controller agent (the LLM watches the host, not the guest) can detect a BSOD via VNC screen-capture, capture the error code, and either reboot the guest or escalate to a human. The retro-agent open-source repo at github.com/voidsstr/retro-agent implements this exact pattern. What the LLM cannot do is debug the BSOD root cause — that still requires human review of the SYSFIX patterns.

Citations and Sources

Related Guides

By Mike Perry

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

4.7 (4,413)

Amazon$659 eBayLive listings
MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

4.7 (4,413)

Amazon$659 eBayLive listings
Creative Sound Blaster Audigy FX PCIe 5.1 Internal Sound Card with High Perform…

4.3 (6,537)

Amazon$57 eBayLive listings
AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

4.8 (23,958)

Amazon$210 eBayLive listings

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Why use an LLM for driver installs instead of just scripting them?

What vision model is best for this work in 2026?

How long does an automated install take vs manual?

What hardware do you need to run this entirely locally?

Can the LLM recover from installer crashes or BSODs?