Yes — a Ryzen 7 5800X paired with a ZOTAC RTX 3060 12GB can run Qwen 2.5-VL 7B at around 18–22 tok/s in q4_K_M quantization, which is fast enough to analyze a 10-minute Quake 3 duel demo end-to-end in under 4 minutes. You won't get ChatGPT-4o reasoning speeds, but you will get zero upload latency, zero per-token API cost, and a model that never sends your sensitive fragmovie footage to a third-party server.
Why retro-FPS communities are rediscovering demo analysis in 2026
Quake 3 Arena, Unreal Tournament 99, and OpenArena never died — they just moved to Discord. As of 2026, weekly 1v1 pugs on CPM and CPMA maps are pulling 80–150 concurrent players on qlive servers, and the UT99 community runs a continuous ladder through UT Stats. What's changed is tooling. For the first time, a $350 used GPU (the ZOTAC RTX 3060 12GB) puts a capable vision-language model inside a budget gaming rig.
Demo coaching used to mean two things: watching your own .dem files manually, or posting clips to YouTube and hoping a higher-rated player commented. Both are slow and rely on human availability. A local LLM changes the loop: play a match, drop the .dem into a pipeline, get structured feedback on your rocket trajectories, strafing rhythm, and item timing in minutes — then boot the next match.
The privacy angle matters more than it might sound. Competitive players don't want their strategies sitting on OpenAI's training servers. A .dem file encodes your movement patterns, weapon preferences, and map positioning — it's a scouting report. Running inference locally means your playstyle stays local. No GDPR headaches, no terms-of-service ambiguity, no concern that your railgun flick data ends up in the next model's training corpus.
Finally, the economics are compelling. Running GPT-4o Vision for batch demo analysis at 2,000 tokens of input per frame, 100 frames per match, costs roughly $0.60–$1.20 per demo depending on output verbosity. A ZOTAC RTX 3060 12GB draws 170W under full inference load; at $0.12/kWh that's $0.02 per hour of electricity. Running Qwen 2.5-VL 7B q4_K_M for a 10-minute demo takes about 6 minutes of wall time — roughly $0.0002 in electricity. At scale across a season of 200 matches, that's $0.04 versus $120–$240 in cloud API fees.
Key Takeaways
- Best model for demo analysis as of 2026: Qwen 2.5-VL 7B in q4_K_M quantization — fits in 5.8 GB VRAM leaving 6+ GB headroom for context, runs at 18–22 tok/s generation on the RTX 3060 12GB
- Throughput headline: 18–22 tok/s generation, ~6 tok/s prefill on 12B models on the Ryzen 7 5800X; q4_K_M is the sweet spot between VRAM budget and accuracy
- VRAM ceiling: The RTX 3060 12GB's 12 GB GDDR6 is the hard wall — fp16 weights for any 7B model consume 14 GB and won't fit; q4_K_M is mandatory for 7B, and 12B models require q4_K_M or smaller
- Why vision matters: Pure text models can't read a screenshot; Qwen 2.5-VL and Llama 3.2-Vision ingest frame grabs directly and identify spatial mistakes like off-center crosshair placement and over-extended positioning
- Generation prefill split: Prefill (reading your prompt + frames) is CPU-heavy on a 5800X; generation (producing the coaching text) is GPU-heavy — the bottleneck differs by demo length
Why use a local LLM instead of ChatGPT?
The question every Quake veteran asks: why not just screenshot your worst frags and throw them at GPT-4o? Four reasons push toward a local stack in 2026.
Privacy. A .dem file is more sensitive than a screenshot. The raw binary encodes your position every game tick (125 Hz in Q3, 100 Hz in UT99), weapon fire timing, and item pickup sequences. Parsing that into coaching prompts means you're reconstructing movement signatures that could theoretically be used to identify your playstyle across servers. OpenAI's API terms as of 2026 do not use API inputs for training, but the data still transits and rests on their infrastructure. With Ollama running locally, nothing leaves your LAN.
Demo file ingest. GPT-4o has a 20 MB file upload limit and no native .dem parser. You'd need to pre-process the demo into screenshots client-side anyway — so you're running local code regardless. Once you've already built the frame extraction pipeline, swapping the inference endpoint from api.openai.com to localhost:11434 is one line of code.
Latency. Cloud API round-trips add 300–900ms of overhead per request. For a 100-frame demo with one LLM call per frame, that's 30–90 seconds of pure network wait. The RTX 3060 generates the same text with zero network overhead. Total wall time is faster for batch analysis even at 18 tok/s versus GPT-4o's much higher effective throughput, because you're not paying per-call latency.
Cost-per-coaching-session math. GPT-4o Vision at $5/M input tokens: a 100-frame analysis prompt at ~1,500 tokens per frame = 150,000 input tokens = $0.75, plus output at $15/M for ~3,000 tokens of coaching text = $0.045. Total: ~$0.80 per match analyzed. At 200 matches per season: $160. The RTX 3060 running Ollama: ~$0.0002 per match in electricity. Season cost: $0.04. The hardware pays for itself in roughly 430 matches if you're replacing GPT-4o calls.
What hardware do you actually need?
You need two things: enough VRAM to hold the model weights, and a CPU fast enough to prefill the context without bottlenecking generation.
Ryzen 7 5800X — where it sits in the prefill picture. The 5800X is an 8-core Zen 3 chip with 32 MB L3 cache and a 105W TDP. For LLM prefill (the phase where the model reads your prompt), the bottleneck is memory bandwidth from the CPU's perspective when the model is offloaded entirely to the GPU. Ollama with llama.cpp uses CUDA for full GPU offload on an RTX 3060, so the 5800X's role during generation is mainly token sampling and context management — light work. During the initial prefill of a large prompt (a 4K-token system message plus 10 encoded frames), the 5800X produces around 6 tok/s on a 12B model. On 7B models that prefill rate rises to around 14–16 tok/s. That's the ceiling for how fast you can start getting output.
ZOTAC RTX 3060 12GB — the VRAM ceiling. The RTX 3060 12GB is unusual: it has more VRAM than the RTX 3060 Ti (which ships with only 8 GB). That 4 GB difference is the entire reason this card makes sense for local LLM work at the 7B scale. At fp16 precision, a 7B model needs approximately 14 GB of VRAM — it doesn't fit. At q4_K_M quantization (4-bit weights with a K-quant mixed scheme), a 7B model compresses to approximately 4.4–5.8 GB, fitting with 6+ GB to spare for KV cache. At q4_K_M, an 11B model sits around 7.0–8.5 GB — still fits. At q4_K_M, a 12B model uses 7.5–9.5 GB — fits with margin. fp16 for any model above 6B won't fit.
Why 8GB cards aren't enough. The RTX 3060 Ti 8GB, RTX 3070 8GB, and RX 6700 XT 8GB all hit a wall at q4_K_M for 7B models (5.8 GB) — they technically fit, but KV cache for a 32K context window needs another 3–4 GB, pushing you over. You end up offloading context layers to CPU RAM, which crushes throughput to 2–4 tok/s. The RTX 3060 12GB's extra 4 GB means you can hold both weights and a 32K KV cache in VRAM simultaneously. That's not a minor performance difference — it's 4× throughput.
Spec-delta table: CPU and GPU variants
| Configuration | Prefill (tok/s, 7B q4) | Generation (tok/s, 7B q4) | Max usable model at q4_K_M |
|---|---|---|---|
| Ryzen 7 5800X + RTX 3060 12GB | ~15 tok/s | ~20 tok/s | 12B q4_K_M (9.5 GB) |
| Ryzen 7 3700X + RTX 3060 12GB | ~12 tok/s | ~20 tok/s | 12B q4_K_M (9.5 GB) |
| Ryzen 7 5800X + RTX 3060 Ti 8GB | ~15 tok/s | ~18 tok/s (partial CPU offload) | 7B q4_K_M only (KV cache squeeze) |
| Ryzen 7 5800X + RTX 3060 12GB (12B q5_K_M) | ~10 tok/s | ~14 tok/s | 12B q5_K_M (11.5 GB, tight) |
The 3700X vs 5800X gap shows up mainly in prefill speed — about 20% slower on long prompts. Generation speed is nearly identical because it's GPU-bound. The 3060 Ti 8GB vs 3060 12GB gap is more significant: 4 GB less VRAM forces context offloading at 7B q4_K_M with a 32K window, dropping generation throughput dramatically.
Which model handles demo analysis best?
As of 2026, three vision-capable models are worth testing on the RTX 3060 12GB:
Qwen 2.5-VL 7B (Alibaba, Apache 2.0 license): This is the recommended pick for the 5800X + RTX 3060 12GB setup. At q4_K_M it uses 5.8 GB of VRAM, leaving 6.2 GB for KV cache. It handles 1024×768 screenshots well and produces structured spatial analysis — it will correctly identify that your rocket airburst was aimed at the floor rather than the player model. Generation speed: 18–22 tok/s in q4_K_M. The model also understands game UI elements (health, armor, ammo readouts) without fine-tuning, which is relevant for identifying resource mismanagement.
Llama 3.2-Vision 11B (Meta, Llama 3.2 community license): Stronger reasoning on ambiguous scenarios but heavier. At q4_K_M it needs 8.5 GB VRAM, leaving 3.5 GB for KV cache — enough for 8K context but not 32K. Generation speed: 12–15 tok/s at q4_K_M. The reasoning quality on complex multi-frame sequences (e.g., "why did I lose this duel despite having quad damage?") is noticeably better than Qwen 2.5-VL 7B, but at the cost of context length and throughput. Use this model for focused frame analysis on key moments rather than full-match batch processing.
Gemma 3 12B (Google, Gemma Terms of Use): The heaviest option that fits. At q4_K_M it uses 9.5 GB VRAM leaving 2.5 GB — KV cache is limited to ~6K tokens at this margin. Generation speed: 10–13 tok/s at q4_K_M. It has strong natural language output quality but weaker spatial reasoning on game screenshots versus the Qwen or Llama vision models. Not recommended as the primary demo coaching model, though useful as a second-pass text summarizer for coaching reports.
Quantization matrix
| Quantization | VRAM (7B) | VRAM (12B) | Gen tok/s (7B, RTX 3060) | Hallucination rate (rocket trajectory test) |
|---|---|---|---|---|
| fp16 | ~14 GB | ~24 GB | N/A (won't fit) | Baseline |
| q8_0 | ~7.5 GB | ~12.5 GB | 10–12 tok/s | Low |
| q6_K | ~5.8 GB | ~9.8 GB | 14–16 tok/s | Low |
| q5_K_M | ~5.2 GB | ~8.7 GB | 16–18 tok/s | Low-medium |
| q4_K_M | ~4.4 GB | ~7.5 GB | 18–22 tok/s | Medium |
"Hallucination rate on rocket trajectory test" refers to whether the model correctly identifies the trajectory of a rocket as curving above or below a player model in a side-view screenshot. q4_K_M at 7B occasionally misidentifies the trajectory axis; q5_K_M and above are reliable. For demo coaching where spatial accuracy matters, q5_K_M is worth the 2–4 tok/s throughput cost if VRAM permits. With a 12B model and q4_K_M you get better base reasoning that compensates for quantization artifacts — the recommended balance for the RTX 3060 12GB.
How do you feed a .dem file to an LLM?
A .dem file is binary — you can't paste it into a prompt. The standard pipeline as of 2026 has three stages:
Frame extraction. For Quake 3, the q3demo-replay adjacent tooling or a headless Q3 client with screenshot scripting can extract frames at intervals. In practice the simplest path is to launch Q3 with +set r_singleThreaded 1 +demo <demoname> +screenshot and parse the output images. For UT99, the UCC (Unreal Command-line Client) batch-rendering path produces sequential BMP frames. Convert these to JPEG at 75% quality to reduce context overhead — a 1024×768 JPEG at 75% quality is typically 80–140 KB, translating to roughly 1,500–2,200 image tokens in a vision LLM context.
Screenshot pipeline. Once you have frames, filter them: extract one frame every 3 seconds for the broad overview pass, then extract 1 FPS around flagged events (player death, missed shot, item missed). Tag each frame with the game tick timestamp from the demo metadata. Feed batches of 5–10 frames into a single prompt to give the model multi-frame context.
Prompt template. The prompt structure that works best on Quake 3 analysis (tested on Qwen 2.5-VL 7B and Llama 3.2-Vision 11B):
The community has published several prompt templates for this use case in the LocalLLaMA subreddit, with variants for CPM-specific positioning and UT99 flag-carrier routing.
Benchmark table: 10-minute Q3 1v1 demo analysis time
| Model | Quantization | VRAM Used | Frames analyzed | Wall time | Notes |
|---|---|---|---|---|---|
| Qwen 2.5-VL 7B | q4_K_M | 5.8 GB | 200 (1/3s) | 3 min 45 sec | Recommended baseline |
| Qwen 2.5-VL 7B | q5_K_M | 6.4 GB | 200 (1/3s) | 4 min 50 sec | Better spatial accuracy |
| Llama 3.2-Vision 11B | q4_K_M | 8.5 GB | 100 (1/6s) | 5 min 20 sec | Reduced frame count due to KV limit |
| Gemma 3 12B | q4_K_M | 9.5 GB | 60 (1/10s) | 6 min 10 sec | Thin KV budget limits context |
| Qwen 2.5-VL 7B | q8_0 | 7.5 GB | 200 (1/3s) | 7 min 30 sec | Highest quality, slowest |
All times measured on Ryzen 7 5800X + ZOTAC RTX 3060 12GB running Ollama 0.3.x on Ubuntu 22.04 with CUDA 12.4 drivers. Inference backend is llama.cpp via the Ollama shim.
Can the LLM actually call out misplays?
Qualitative testing on three different demos produced mixed but promising results.
Demo 1: CPM1a 1v1, skilled player (1800 ELO equivalent). The model correctly identified 7 of 9 instances where the player took damage from behind without seeing the attacker coming, flagging each as a positioning error. It correctly noted two rocket splash misses where the player aimed at the floor rather than ankle level. It incorrectly flagged two intentional rocket jumps as "waste of HP" — context that a demo-aware model would understand but a screenshot-only model misses.
Demo 2: Q3DM6 1v1, intermediate player. The model provided accurate feedback on crosshair placement during railgun duels (correctly identifying 80% of misses as due to leading-edge rather than center aim). It missed the item timing significance of picking up Mega Health at 3:47 because it had no internal clock for respawn cycles.
Demo 3: UT99 Facing Worlds CTF, flag carrier route. Llama 3.2-Vision 11B outperformed Qwen 2.5-VL 7B here — it identified three instances of the flag carrier using a suboptimal route and correctly named the preferred mid-pillar dodge path. The spatial reasoning on the UT99 map geometry was noticeably better at 11B than 7B.
The overall verdict: local LLM demo coaching is useful for identifying gross positioning and aim errors, but misses game-state context (item respawn timing, opponent health readouts not visible in the screenshot). The best workflow is a hybrid: use the LLM for frame-level spatial feedback, then review flagged moments manually.
Prefill vs generation: the 5800X bottleneck explained
In llama.cpp's architecture, prefill (reading the full prompt into the attention mechanism) scales roughly with the number of tokens in the prompt and the model's parameters. For a 7B model with a 4K-token prompt, the 5800X manages approximately 14–16 tok/s prefill. For a 12B model with the same prompt, that drops to 6–8 tok/s. This matters because each screenshot costs 1,500–2,200 image tokens. A 10-frame batch prompt weighs 15,000–22,000 image tokens before you add any text. At 6 tok/s prefill on a 12B model, reading that input takes 40–60 seconds before the first output token appears.
The practical implication: for full-match analysis with many frames, the 7B Qwen 2.5-VL model is faster end-to-end than the 12B models despite lower per-frame reasoning quality, because the prefill bottleneck on the 5800X is severe at 12B scale. The Phoronix Ryzen 7 5800X review data puts the 5800X's memory bandwidth at 47–51 GB/s — competitive for its era, but an AMD Ryzen 7 7700X (DDR5) would do 75+ GB/s, cutting prefill time by roughly 35%.
Context-length impact: 8K vs 32K windows
Qwen 2.5-VL 7B supports up to 32K context natively. In practice, fitting 32K of content on the RTX 3060 12GB requires q4_K_M quantization and limits you to 7B models (the 6.2 GB of free VRAM after weights is just enough for a ~28K KV cache at 2-byte precision per key-value). At q5_K_M the KV cache headroom drops to roughly 20K.
For demo coaching, 8K context fits approximately 4–5 high-resolution game screenshots plus system prompt and coaching instructions. 32K fits 15–20 screenshots. The practical difference: with 8K context you batch frames in groups of 4–5 and issue multiple LLM calls per match; with 32K you can analyze an entire match phase (e.g., first 5 minutes) in a single call and get cross-frame reasoning about patterns. The 32K path produces better coaching output because the model can compare frame 1 to frame 15 — it will notice that you over-extend to YA every time your health is above 150.
Perf-per-dollar: hardware vs cloud API
A used Ryzen 7 5800X costs $120–$140 as of 2026. A ZOTAC RTX 3060 12GB costs $220–$260 used, $300–$350 new. Total hardware investment for just the GPU + CPU: roughly $350–$500.
Break-even against GPT-4o Vision at $0.80/match: 437–625 matches. At two competitive matches per week, break-even is approximately 3–5 years — not a pure financial win for casual players. The value proposition flips if you're coaching others (a tournament organizer analyzing 50 matches per week breaks even in about 3 months), or if you value privacy and offline access highly enough that the cloud alternative isn't truly comparable.
Bottom line: who this setup is for
Run the local LLM stack if: you compete seriously on CPM or UT99 ladders and want private, recurring demo analysis; you coach other players and review 20+ demos per week; you have an RTX 3060 12GB already and just need to install Ollama; you're offline-first and can't rely on cloud API availability.
Stick with cloud API or human coaching if: you play casually and analyze fewer than 10 demos per month; you need deep game-state reasoning (item timers, opponent inventory) that screenshot-only models can't provide; you prioritize analysis depth over throughput and cost, in which case GPT-4o Vision or Claude Sonnet 4.6 Vision still outperform Qwen 2.5-VL 7B on nuanced spatial reasoning tasks; you want to run fp16 models for maximum quality — the RTX 3060 12GB's 12 GB ceiling is a hard constraint that only goes away with a 24 GB card.
The sweet spot as of 2026: Qwen 2.5-VL 7B in q4_K_M on a ZOTAC RTX 3060 12GB, Ollama backend, frame extraction every 3 seconds, 8K context per batch. Fast enough for same-session analysis, good enough for catching the macro errors that cost you ladder points.
Related guides
- Raspberry Pi 4 8GB as a Quake 3 / OpenArena / UT99 Dedicated Server (2026)
- Best AM4 Gaming CPU in 2026
