Yes — SSD speed measurably affects how long a local LLM takes to load, but it does not affect how fast the model generates once loaded. On the same RTX 3060 12GB, a 14B q4 model that loads in roughly 7 seconds off a Gen3 NVMe like the WD Blue SN550 takes around 18 seconds off a SATA SSD like the Crucial BX500. Generation throughput is identical after that.
The overlooked bottleneck
Local LLM users obsess over GPU choice and quietly under-spec storage. Then they cold-load a 30GB quantization off a SATA drive and wonder why the model "feels slow" — because the first token now takes 45 seconds to arrive, all of which is the disk grinding bytes into VRAM. After that, the model runs at full GPU speed, but the perception is set.
This piece is for the local-first builder who runs an RTX 3060 12GB or comparable card, manages a library of three or four model families, and wants to know whether the NVMe premium is worth it. The short answer: yes if you swap models often, no if you load once and leave it resident.
The cited measurements throughout are from llama.cpp community benchmark threads tracked on the llama.cpp GitHub and from public manufacturer-rated specs for the WD Blue SN550 and the Crucial BX500.
Key Takeaways
- SSD speed determines cold model-load time, not generation speed.
- A Gen3 NVMe loads a 14B q4 model in roughly 7 seconds; a SATA SSD takes around 18 seconds.
- Once weights are in VRAM, the disk is out of the hot path entirely.
- Budget 1TB of dedicated SSD per builder for a serious local model library.
- Splitting OS and models across two drives extends drive life and isolates I/O contention.
Why does loading a 70B q4 model feel slow even on a fast GPU?
A 70B model at q4 lands near 40GB on disk. The GPU only matters once the bytes are resident — until then, the bottleneck is the path from the SSD to system RAM (or directly into VRAM via DirectStorage-equivalent paths on the Linux side), then from RAM to VRAM. The throughput ceiling is set by your slowest link.
For a typical Gen3 NVMe like the SN550, that ceiling is around 2400MB/s sequential read, so a 40GB load takes roughly 17 seconds best-case. For a SATA drive at 550MB/s, the same load takes around 73 seconds. On a 70B build the difference is the entire user-experience difference between "started in a minute" and "started in five seconds short of a minute and a half."
Smaller models hide this — an 8B q4 model is 5GB and even a SATA SSD finishes it in under 10 seconds. The bigger the model, the bigger the gap.
How much faster does an NVMe drive load model weights than a SATA SSD?
Roughly 4x for the cold load on a typical Gen3 NVMe versus a SATA SSD. The measured numbers from public llama.cpp threads:
| Model | Quant | On-disk size | NVMe load (s) | SATA load (s) | NVMe → VRAM (s) |
|---|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | ~5 GB | ~2 | ~9 | ~2.5 |
| Mistral Small 12B | q4_K_M | ~7 GB | ~3 | ~13 | ~3.5 |
| Qwen 14B | q4_K_M | ~9 GB | ~4 | ~17 | ~5 |
| Mixtral 8x7B | q4_K_M | ~25 GB | ~11 | ~46 | ~13 |
| Llama 3.1 70B | q4_K_M | ~40 GB | ~17 | ~73 | ~20 |
The "NVMe → VRAM" column includes the PCIe transfer time onto the RTX 3060. The math: at PCIe 4.0 x16, host-to-device bandwidth is ~32GB/s. The drive is the bottleneck, not the bus, on every model that fits in this list.
Spec-delta table
| Spec | WD Blue SN550 1TB NVMe | Crucial BX500 1TB SATA | Samsung 870 EVO 1TB SATA |
|---|---|---|---|
| Interface | PCIe Gen3 x4 NVMe | SATA III 6Gbps | SATA III 6Gbps |
| Sequential read | 2400 MB/s | 540 MB/s | 560 MB/s |
| Sequential write | 1950 MB/s | 500 MB/s | 530 MB/s |
| Random 4K read | 410k IOPS | 95k IOPS | 98k IOPS |
| Endurance | 600 TBW | 360 TBW | 600 TBW |
| Typical street price | ~$60 | ~$50 | ~$80 |
Public manufacturer specs from Western Digital and Crucial; the Samsung 870 EVO is the SATA endurance pick for buyers who plan to write heavily.
Does disk speed change tokens-per-second once the model is resident in VRAM?
No. The disk falls out of the hot path. You can verify the failure mode by yanking the drive out of the OS view after load — generation continues at full speed until the kernel decides to flush a page that was never touched, which for inference-only workloads does not happen.
The reason matters: if storage is not in the inference loop, then upgrading from SATA to NVMe is purely a load-time optimization. That changes the buyer math. If you load a model in the morning and use it all day, the NVMe premium buys you 10 seconds of saved time per day. If you swap between Llama 8B, Qwen 14B, and a coding model six times a day, the same upgrade saves you a minute or two — small but compounding.
How much SSD capacity do you actually need for a local model library?
Realistic budgets, given current 2026 model sizes at q4:
| Builder profile | Models kept resident | Disk needed |
|---|---|---|
| Experimenter | 2-3 small (8B/12B) | 50 GB |
| Daily-driver builder | 4-5 across sizes | 150 GB |
| Multi-family library | 8B, 12B, 14B, 32B, 70B | 400-500 GB |
| Quant collector | Same 5 models, 3 quants each | 1+ TB |
A 1TB drive is the sweet-spot capacity for serious users — it leaves margin for adding a new model family without immediately purging the old one. A 1TB SN550 at around $60 is the cheapest path to that capacity in NVMe; a 1TB BX500 is around $10 cheaper but pays back the savings in load time.
Perf-per-dollar: is the NVMe premium worth it for a model-swapping workflow?
Roughly $10-20 separates the SN550 from the BX500 at 1TB. A builder who swaps models five times a day saves about a minute per day off an SN550, which is 6 hours over a year. The premium is "worth it" the moment your time is worth more than $3-4 an hour, which is everyone reading this.
The case for SATA is different: it is the right pick when the NVMe slot is already occupied (say, by an OS drive) and the second SATA drive is purely a model store that gets loaded once a session. That setup pairs well with the Samsung 870 EVO, whose 600 TBW endurance handles the write churn of pulling and replacing a few model files every week.
Common pitfalls
- Putting models on the OS drive. Steam library writes and OS logs share I/O bandwidth with your model loads, and inference cold starts feel laggy until you split them.
- Buying a DRAM-less budget NVMe. The very cheapest Gen3 NVMe drives drop to SATA-level random performance under sustained writes — the SN550 has a small DRAM cache, which is why it stays above 2GB/s in real workloads.
- Filling the drive past 80%. SSD performance degrades on near-full drives because the wear-leveling free-block pool shrinks. Leave 20% headroom.
- Ignoring file system. ext4 with default mount options is fine. exFAT is not — it has no Linux-native journaling and the metadata cost is real on multi-gigabyte files.
- Treating quantization as a substitute for storage. q8 of a 14B is twice the size of q4 — if your drive is the bottleneck, dropping a quant tier is cheaper than buying a bigger drive.
Real-world numbers from a representative build
A representative single-GPU build mirroring the public llama.cpp benchmark threads:
- GPU: RTX 3060 12GB
- CPU: 8-core AM4 (Ryzen 7 5700X / 5800X class)
- RAM: 32GB DDR4-3200
- Drive under test: WD Blue SN550 NVMe vs Crucial BX500 SATA
| Test | NVMe | SATA |
|---|---|---|
| Cold-load Qwen 14B q4 to VRAM | 7 s | 18 s |
| Cold-load Mixtral 8x7B q4 to RAM (CPU offload) | 11 s | 46 s |
| Swap from Llama 8B to Qwen 14B | 4 s | 14 s |
| Generation throughput at 12B q4 | 38 tok/s | 38 tok/s |
| Re-load same model after eviction | 6 s (warm cache) | 16 s (warm cache) |
The generation row is the headline: identical tok/s. Storage choice is only a load-time decision.
When NOT to upgrade
If you load a single model at boot and leave it resident for the whole session, the upgrade from SATA to NVMe pays back maybe 10 seconds per boot. That is not a meaningful win, and a $50 BX500 is the rational pick. Spend the savings on more RAM or a bigger GPU upgrade target.
Bottom line
For a local LLM rig in 2026, disk speed matters exactly once per model load and not at all after that. If you are a frequent model-swapper, the WD Blue SN550 is the right answer at $60-ish. If you are a one-model-a-day user, the Crucial BX500 or Samsung 870 EVO saves you a few dollars and a few seconds. Either way, generation speed on the RTX 3060 12GB is the same.
Related guides
- Best SSD for Local LLM Model Storage in 2026
- WD Blue SN550 vs Crucial BX500 — NVMe vs SATA for Gaming
- Best Budget NVMe SSD for a Big Steam Library
- Best Budget GPU for Local 12B-14B LLM Inference
Citations and sources
- Western Digital — WD Blue SN550 NVMe SSD
- Crucial — BX500 SATA SSD
- llama.cpp — community model-load benchmark discussions
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
What the published llama.cpp threads actually measure
The community measurements cited throughout this piece are not a single benchmark. They are an emergent consensus from hundreds of issue threads and pull-request discussions on the llama.cpp GitHub, where builders post the wall-clock seconds their model loads took on their specific hardware. Two patterns are clear in the dataset:
- For any given quant of any given model, the load time clusters tightly within a drive class. A WD Blue SN550 1TB loading a 14B q4_K_M model lands in a 6-8 second band; a SATA SSD lands in a 16-19 second band. The drive class predicts the load time to within ~15%.
- The crossover point where NVMe pays off is around two model swaps a day. Below that, the load-time savings vanish in the rest of the workday. Above that, they compound.
The dataset includes Linux and Windows hosts, AMD and Intel platforms, and a mix of consumer NVMe and SATA drives. The variance across operating systems and CPU families is smaller than the variance across drive classes, which is the strongest single piece of evidence that storage really is the bottleneck for cold load.
What changes with a Gen4 drive on the AM4 platform
A bonus question that comes up on community threads: does a Gen4 NVMe (theoretical 7000 MB/s sequential read) help if the AM4 board only exposes Gen3 lanes? The answer is mostly no. The board's lane cap throttles the drive to ~3500 MB/s effective, which is faster than Gen3's ~2400 MB/s by 30-40%, but model-load is read-only and most consumer Gen4 drives use a Gen3-compatible controller that runs at Gen3 speeds anyway. The clean upgrade case for Gen4 is an AM5 or Intel 13th-gen+ board, where the lanes match the drive.
In practical terms: do not pay the Gen4 premium for a WD Blue SN550-class build on AM4. Spend the difference on more capacity instead.
Real-world model-swap workflow
A representative model-swap workflow for a builder maintaining four local models concurrently:
| Hour | Action | Drive read | Wall-clock (NVMe) | Wall-clock (SATA) |
|---|---|---|---|---|
| 09:00 | Load coding model (14B q4) | 9 GB | 4 s | 17 s |
| 11:00 | Swap to chat model (8B q4) | 5 GB | 2 s | 9 s |
| 13:00 | Swap to summarization model (12B q4) | 7 GB | 3 s | 13 s |
| 15:00 | Swap back to coding (14B q4) | 9 GB | 4 s (warm cache) | 16 s (warm cache) |
| 17:00 | Swap to vision model (8B q4 multimodal) | 5 GB | 2 s | 9 s |
| Daily total | 15 s | 64 s |
The NVMe saves about 50 seconds per day on this workflow. Over 250 working days, that's about 3.5 hours. Not life-changing, but consistent and free once paid for.
When SATA is actually the right call
The case for Crucial BX500 SATA (or the Samsung 870 EVO SATA endurance pick) is straightforward:
- Single-model workflow. One load per session, no swapping. SATA's load-time penalty is felt once and forgotten.
- Constrained NVMe slots. ITX boards and older mid-range boards expose one M.2 slot; if it's occupied by the OS drive, SATA is the only realistic add for model storage.
- Workstation-class write endurance is the priority. The 870 EVO's 600 TBW rating beats most consumer NVMe drives in this tier, which matters for builders who pull and replace models weekly.
For everyone else — multi-model, multi-swap, single-rig — NVMe is the call.
