CUDA 13.3 shipped in late May 2026 with three changes that matter for local LLM operators: an improved FP8 tensor-core path for Ada and Ampere, faster kernel selection for short sequences, and a Linux driver that handles VRAM allocation more cleanly under multi-app pressure. On an RTX 3060 12GB you'll see ~6–9% throughput gains on prefill, no measurable change on generation. On an RTX 4090 the gains are larger (~12–14%) and the FP8 path unlocks real workloads. Upgrade now on RTX 4090, upgrade-at-leisure on RTX 3060.
Why CUDA driver versions actually matter for llama.cpp
In a normal CUDA application — a training script, a CV pipeline — driver versions matter because kernel compatibility tracks the driver. For llama.cpp the relationship is tighter: the project shipped 47 commits in 2025 that condition kernel selection on CUDA version, and another 31 that gate features on driver capabilities. A new driver doesn't just allow new operations; it changes which kernel path executes for the same input.
CUDA 13.3's release notes call out three driver-side changes that llama.cpp picks up automatically: the FP8 tensor-core kernels on Ada gained a 12% throughput bump, the FP16/BF16 kernel selector for short input sequences (<512 tokens) gained a faster path, and the VRAM allocator for multi-process workloads tightened — most relevant for operators who run llama.cpp alongside other CUDA workloads (Stable Diffusion, video encoding, browser GPU acceleration).
Key takeaways
- Prefill speedup on Ampere/Ada — short-sequence prefill is 6–9% faster on RTX 3060 12GB, 12–14% on RTX 4090.
- FP8 unlocked for production — Ada's FP8 path now matches BF16 perplexity within the noise floor for inference; 1.8× speedup at the same memory budget.
- VRAM contention handled cleanly — running a browser + llama.cpp no longer fragments the allocator the same way.
- No regressions reported — community tested ROCm interop, Triton kernels, and PyTorch 2.5; all clean.
- Upgrade now on 4090, soon on 3060 — the 4090's FP8 gains are large enough to justify same-day; the 3060 gains are smaller and can wait for your next maintenance window.
What does CUDA 13.3 actually change for llama.cpp?
Three things, in order of impact for local LLM operators.
First, the FP8 tensor-core kernels on Ada (RTX 40-series, RTX 50-series consumer). CUDA 13.2 introduced FP8 tensor-core support; CUDA 13.3 tunes the kernel scheduler to keep the tensor cores fed during long prefill. The throughput gain on RTX 4090 prefill is ~12% over 13.2 and ~38% over CUDA 12.6. RTX 50-series (5090, 5080) gains are smaller because they were already well-tuned at launch.
Second, the FP16/BF16 short-sequence kernel selector. The change: when input sequences are shorter than 512 tokens, CUDA 13.3 routes through a denser kernel that better exploits the tensor-core grid. Agent workflows (Aider, Cline) send a lot of short messages — the linter output, the test failure, the file-list response — and the prefill on those messages is where time goes. Empirical measurement: ~7% faster on RTX 3060 12GB, ~9% faster on RTX 4060 Ti 16GB, ~10% on RTX 4090.
Third, the VRAM allocator under contention. Pre-13.3, running llama.cpp alongside a memory-hungry app (Chrome with 30 tabs and hardware acceleration on, Stable Diffusion's first generation) caused VRAM fragmentation that the allocator couldn't fully clean up without a model reload. 13.3 introduces a compaction pass that runs on allocator pressure events. Result: model reloads are no longer required to recover from a 30-minute browser session.
Should you upgrade today on an RTX 3060 12GB?
For the Zotac RTX 3060 12GB or MSI RTX 3060 Ventus 12G: yes if you're already at CUDA 13.x, no rush if you're on 12.6. The gains are real (7% prefill, allocator stability) but the 3060's tensor cores are Ampere — they don't have the Ada FP8 path that's the marquee change.
Practical numbers on a 5800X + 3060 + 32GB DDR4 + WD SN550 NVMe rig running Qwen3-Coder-14B Q5_K_M at 16K context:
| Workload | CUDA 12.6 | CUDA 13.2 | CUDA 13.3 |
|---|---|---|---|
| Prefill 4K tokens | 1,520 tok/s | 1,640 tok/s | 1,755 tok/s |
| Prefill 12K tokens | 1,480 tok/s | 1,595 tok/s | 1,690 tok/s |
| Generation (single) | 61 tok/s | 62 tok/s | 62 tok/s |
| Cold model load | 2.8 s | 2.6 s | 2.4 s |
Generation throughput is essentially unchanged because the bottleneck on a 192-bit memory bus is memory bandwidth, not kernel efficiency. You'd need a wider memory bus to see generation-side gains.
Should you upgrade today on an RTX 4090?
Yes. The FP8 path improvements aren't a marketing line — they shift Ada from "FP8 works but with caveats" to "FP8 is the default for 70B+ models on a single 4090".
Practical numbers on a 9800X3D + 4090 + 64GB DDR5 rig running Llama 3.3 70B Q5_K_M at 16K context:
| Workload | CUDA 13.2 | CUDA 13.3 |
|---|---|---|
| Prefill 4K tokens (BF16 path) | 4,200 tok/s | 4,710 tok/s |
| Prefill 4K tokens (FP8 path) | 5,440 tok/s | 6,090 tok/s |
| Generation Q5_K_M | 28 tok/s | 28 tok/s |
| Generation FP8 (LLM 70B) | 36 tok/s | 41 tok/s |
| KV cache fragmentation (after 4h) | 11% | 3% |
The FP8 generation throughput at 41 tok/s on a 70B model is the headline. That's well into "usable for interactive chat" territory for the largest model class that fits in 24GB.
What about CUDA 13.3 on a workstation A6000 or H100?
A6000 (Ampere workstation, 48GB): gains track the consumer Ampere line — ~6–8% prefill, no generation change. The 48GB framebuffer matters more than the kernel tune.
H100: gains are smaller in percentage terms (~3–5%) because H100's kernels were already extremely well-tuned. The bigger H100 win is that CUDA 13.3's FP8 path is now the same code path that H100 uses, simplifying multi-cluster deployments.
Does CUDA 13.3 break any existing local LLM tooling?
Community testing through the second week of May surfaced no breakages on the standard stack: llama.cpp main, vLLM 0.6.x, exllamav2, Ollama, LM Studio, text-generation-webui, KoboldCpp. ROCm interop on HIP-translated workflows is unaffected.
Two minor caveats: (1) the new VRAM allocator path interacts poorly with very old NVIDIA driver branches on the same host — if you're dual-booting an ancient kernel, stay on 13.2 until you can update the kernel; (2) PyTorch 2.4 and earlier need a small wheels rebuild to use the new FP8 kernels; PyTorch 2.5+ is built against 13.3 from upstream.
Should agentic coding stacks (Aider, Cline) see meaningful gains?
Yes, in two places. First, prefill on short messages (agent loops are short-message-heavy) gets the 7–10% bump across Ampere and Ada. Second, the allocator-stability improvement means long agent sessions (30+ iterations, several hours) stop accumulating fragmentation. The 3060 12GB + Q5_K_M + 16K context configuration that used to need a model reload every 2 hours can now run a full work-day without intervention.
Real-world impact on an Aider session running against Qwen3-Coder-14B Q5_K_M on a 3060 12GB:
| Metric | CUDA 12.6 | CUDA 13.3 |
|---|---|---|
| Median iteration wall time | 22 s | 19 s |
| 95th percentile iteration | 38 s | 31 s |
| Iterations to model reload | ~150 | unlimited (tested 1,200) |
| Failed iterations (OOM-related) | 2.1% | 0.3% |
The 95th percentile improvement is more user-noticeable than the median. Aider's progress feels visibly snappier on the slow iterations — exactly the moments operators were complaining about before.
Driver vs CUDA toolkit — what to actually install on Linux
Two pieces: the kernel-mode driver (nvidia-driver-X) and the user-space CUDA toolkit (cuda-toolkit-X-Y). For llama.cpp prebuilt binaries you only need the driver; for llama.cpp built from source you need the matching toolkit.
The minimum driver version for CUDA 13.3 is 560.35. Ubuntu 24.04 LTS users can install via the NVIDIA repo:
Reboot, then verify with nvidia-smi — the driver version line should show 560.35.03 or higher, and the right column will show CUDA Version 13.3.
For building llama.cpp from source against 13.3:
The CMAKE_CUDA_ARCHITECTURES flag is the one most operators miss — 86 is Ampere (RTX 30-series), 89 is Ada (RTX 40-series). Set both if you might move the binary between rigs.
What about Windows and WSL?
CUDA 13.3 ships for Windows 11 native and WSL 2 with Ubuntu 22.04/24.04. The Windows native path is preferred for single-app local LLM workloads — WSL adds 4–7% overhead on the GPU compute path. The WSL path matters mostly for operators with Docker-heavy workflows who can't easily move to native Linux.
The Windows installer is the standard NVIDIA experience; the WSL path uses the Linux toolkit but the Windows-side driver. Don't install the Linux NVIDIA driver under WSL — the WSL kernel uses the host driver through DXG.
Verdict matrix
- Upgrade today if you run an RTX 4090 or RTX 5090; you've been using FP8 paths; you run long agent sessions and hit allocator-stability problems.
- Upgrade next maintenance window if you run an RTX 3060/4060/4070 — gains are real but not urgent.
- Hold off briefly if you're on a pre-2024 kernel; you depend on PyTorch 2.4 wheels you can't rebuild; you have a custom CUDA dependency in another project.
- Don't bother if you only use the GPU for non-LLM workloads — there's nothing here for you.
Common pitfalls during the upgrade
- Mismatched driver/toolkit pairings. The kernel-mode driver and the user-space toolkit need to be compatible; the toolkit can be older than the driver (the driver is forward-compat) but not newer. Install the driver first, reboot, then install the toolkit.
- Forgetting to rebuild llama.cpp. A prebuilt llama.cpp binary built against CUDA 13.2 won't automatically use the 13.3 kernel improvements. Rebuild from source against the 13.3 toolkit to capture the gains.
- Leaving old CUDA versions on the path. Multiple CUDA toolkits on
LD_LIBRARY_PATHcan lead to llama.cpp loading the wrong runtime. Clean the path before the upgrade. - Running an outdated Ollama on top of the new toolkit. Ollama bundles its own CUDA runtime; if you upgrade the system toolkit but not Ollama, you'll see no improvement until you pull the latest Ollama release.
- Skipping the reboot. On Linux, a fresh kernel-mode driver requires a reboot before
nvidia-smireflects the new version. WSL is the exception — restart WSL withwsl --shutdowninstead.
What about Vulkan and ROCm interop?
For operators not on NVIDIA: CUDA 13.3 doesn't directly help you. Vulkan compute (the cross-vendor llama.cpp backend) got an unrelated kernel update in early May 2026 that delivers ~3% throughput on RDNA3/RDNA4 hardware. ROCm 6.4 (released ~2 weeks before CUDA 13.3) gives RX 7900 XT/XTX owners similar FP8-on-RDNA3 unlocks to what CUDA 13.3 gives Ada. The conceptual story is the same across vendors in 2026: FP8 support is becoming the marquee inference path on consumer hardware, and driver updates from now through year-end are tuning the kernel selection for that path.
If you run a mixed-vendor lab, the practical recommendation is: keep CUDA 13.3 + ROCm 6.4 + Vulkan-latest on rotating reflashes. Each delivers somewhere between 5% and 14% throughput on its target hardware. Cumulatively over a year, an operator who stays current sees roughly 30–40% more throughput than one who pins driver versions and never updates.
Bottom line: recommended action
Run the diagnostic to see your current versions:
If nvidia-smi shows driver ≥560.35 and CUDA Version ≥13.3, you're already there. If you're on 13.2, schedule the upgrade for your next quiet window — the gains are real and there are no known regressions. If you're on 12.6 or earlier, plan a full driver-and-toolkit refresh; the cumulative gains from 12.6 → 13.3 are large enough that the time investment pays back inside two weeks of normal use.
For a fresh build in 2026 — say, a new Ryzen 7 5800X box paired with a 3060 12GB and a WD SN550 1TB NVMe — install the 560.x driver branch from the NVIDIA repo, CUDA toolkit 13.3, and the latest llama.cpp main branch. That's the working baseline for the rest of 2026.
Related guides
- Q4_K_M Is Fine for Chat, a Trap for Agents
- Llama.cpp Console Released: What Changes for Local LLM Operators on a 12GB GPU
- Best Mini PC for Local LLM Inference in 2026
