Skip to main content
NVMe vs SATA SSD for Local LLMs: Does Disk Speed Matter?

NVMe vs SATA SSD for Local LLMs: Does Disk Speed Matter?

Why your model load time is gated by sequential read, not by GPU, and which drive class is worth the premium.

An NVMe SSD cuts the cold-load time for a 14B q4 model from roughly 18 seconds to about 7 on the same RTX 3060 12GB build. Here is when the upgrade pays off.

Yes — SSD speed measurably affects how long a local LLM takes to load, but it does not affect how fast the model generates once loaded. On the same RTX 3060 12GB, a 14B q4 model that loads in roughly 7 seconds off a Gen3 NVMe like the WD Blue SN550 takes around 18 seconds off a SATA SSD like the Crucial BX500. Generation throughput is identical after that.

The overlooked bottleneck

Local LLM users obsess over GPU choice and quietly under-spec storage. Then they cold-load a 30GB quantization off a SATA drive and wonder why the model "feels slow" — because the first token now takes 45 seconds to arrive, all of which is the disk grinding bytes into VRAM. After that, the model runs at full GPU speed, but the perception is set.

This piece is for the local-first builder who runs an RTX 3060 12GB or comparable card, manages a library of three or four model families, and wants to know whether the NVMe premium is worth it. The short answer: yes if you swap models often, no if you load once and leave it resident.

The cited measurements throughout are from llama.cpp community benchmark threads tracked on the llama.cpp GitHub and from public manufacturer-rated specs for the WD Blue SN550 and the Crucial BX500.

Key Takeaways

  • SSD speed determines cold model-load time, not generation speed.
  • A Gen3 NVMe loads a 14B q4 model in roughly 7 seconds; a SATA SSD takes around 18 seconds.
  • Once weights are in VRAM, the disk is out of the hot path entirely.
  • Budget 1TB of dedicated SSD per builder for a serious local model library.
  • Splitting OS and models across two drives extends drive life and isolates I/O contention.

Why does loading a 70B q4 model feel slow even on a fast GPU?

A 70B model at q4 lands near 40GB on disk. The GPU only matters once the bytes are resident — until then, the bottleneck is the path from the SSD to system RAM (or directly into VRAM via DirectStorage-equivalent paths on the Linux side), then from RAM to VRAM. The throughput ceiling is set by your slowest link.

For a typical Gen3 NVMe like the SN550, that ceiling is around 2400MB/s sequential read, so a 40GB load takes roughly 17 seconds best-case. For a SATA drive at 550MB/s, the same load takes around 73 seconds. On a 70B build the difference is the entire user-experience difference between "started in a minute" and "started in five seconds short of a minute and a half."

Smaller models hide this — an 8B q4 model is 5GB and even a SATA SSD finishes it in under 10 seconds. The bigger the model, the bigger the gap.

How much faster does an NVMe drive load model weights than a SATA SSD?

Roughly 4x for the cold load on a typical Gen3 NVMe versus a SATA SSD. The measured numbers from public llama.cpp threads:

ModelQuantOn-disk sizeNVMe load (s)SATA load (s)NVMe → VRAM (s)
Llama 3.1 8Bq4_K_M~5 GB~2~9~2.5
Mistral Small 12Bq4_K_M~7 GB~3~13~3.5
Qwen 14Bq4_K_M~9 GB~4~17~5
Mixtral 8x7Bq4_K_M~25 GB~11~46~13
Llama 3.1 70Bq4_K_M~40 GB~17~73~20

The "NVMe → VRAM" column includes the PCIe transfer time onto the RTX 3060. The math: at PCIe 4.0 x16, host-to-device bandwidth is ~32GB/s. The drive is the bottleneck, not the bus, on every model that fits in this list.

Spec-delta table

SpecWD Blue SN550 1TB NVMeCrucial BX500 1TB SATASamsung 870 EVO 1TB SATA
InterfacePCIe Gen3 x4 NVMeSATA III 6GbpsSATA III 6Gbps
Sequential read2400 MB/s540 MB/s560 MB/s
Sequential write1950 MB/s500 MB/s530 MB/s
Random 4K read410k IOPS95k IOPS98k IOPS
Endurance600 TBW360 TBW600 TBW
Typical street price~$60~$50~$80

Public manufacturer specs from Western Digital and Crucial; the Samsung 870 EVO is the SATA endurance pick for buyers who plan to write heavily.

Does disk speed change tokens-per-second once the model is resident in VRAM?

No. The disk falls out of the hot path. You can verify the failure mode by yanking the drive out of the OS view after load — generation continues at full speed until the kernel decides to flush a page that was never touched, which for inference-only workloads does not happen.

The reason matters: if storage is not in the inference loop, then upgrading from SATA to NVMe is purely a load-time optimization. That changes the buyer math. If you load a model in the morning and use it all day, the NVMe premium buys you 10 seconds of saved time per day. If you swap between Llama 8B, Qwen 14B, and a coding model six times a day, the same upgrade saves you a minute or two — small but compounding.

How much SSD capacity do you actually need for a local model library?

Realistic budgets, given current 2026 model sizes at q4:

Builder profileModels kept residentDisk needed
Experimenter2-3 small (8B/12B)50 GB
Daily-driver builder4-5 across sizes150 GB
Multi-family library8B, 12B, 14B, 32B, 70B400-500 GB
Quant collectorSame 5 models, 3 quants each1+ TB

A 1TB drive is the sweet-spot capacity for serious users — it leaves margin for adding a new model family without immediately purging the old one. A 1TB SN550 at around $60 is the cheapest path to that capacity in NVMe; a 1TB BX500 is around $10 cheaper but pays back the savings in load time.

Perf-per-dollar: is the NVMe premium worth it for a model-swapping workflow?

Roughly $10-20 separates the SN550 from the BX500 at 1TB. A builder who swaps models five times a day saves about a minute per day off an SN550, which is 6 hours over a year. The premium is "worth it" the moment your time is worth more than $3-4 an hour, which is everyone reading this.

The case for SATA is different: it is the right pick when the NVMe slot is already occupied (say, by an OS drive) and the second SATA drive is purely a model store that gets loaded once a session. That setup pairs well with the Samsung 870 EVO, whose 600 TBW endurance handles the write churn of pulling and replacing a few model files every week.

Common pitfalls

  • Putting models on the OS drive. Steam library writes and OS logs share I/O bandwidth with your model loads, and inference cold starts feel laggy until you split them.
  • Buying a DRAM-less budget NVMe. The very cheapest Gen3 NVMe drives drop to SATA-level random performance under sustained writes — the SN550 has a small DRAM cache, which is why it stays above 2GB/s in real workloads.
  • Filling the drive past 80%. SSD performance degrades on near-full drives because the wear-leveling free-block pool shrinks. Leave 20% headroom.
  • Ignoring file system. ext4 with default mount options is fine. exFAT is not — it has no Linux-native journaling and the metadata cost is real on multi-gigabyte files.
  • Treating quantization as a substitute for storage. q8 of a 14B is twice the size of q4 — if your drive is the bottleneck, dropping a quant tier is cheaper than buying a bigger drive.

Real-world numbers from a representative build

A representative single-GPU build mirroring the public llama.cpp benchmark threads:

  • GPU: RTX 3060 12GB
  • CPU: 8-core AM4 (Ryzen 7 5700X / 5800X class)
  • RAM: 32GB DDR4-3200
  • Drive under test: WD Blue SN550 NVMe vs Crucial BX500 SATA
TestNVMeSATA
Cold-load Qwen 14B q4 to VRAM7 s18 s
Cold-load Mixtral 8x7B q4 to RAM (CPU offload)11 s46 s
Swap from Llama 8B to Qwen 14B4 s14 s
Generation throughput at 12B q438 tok/s38 tok/s
Re-load same model after eviction6 s (warm cache)16 s (warm cache)

The generation row is the headline: identical tok/s. Storage choice is only a load-time decision.

When NOT to upgrade

If you load a single model at boot and leave it resident for the whole session, the upgrade from SATA to NVMe pays back maybe 10 seconds per boot. That is not a meaningful win, and a $50 BX500 is the rational pick. Spend the savings on more RAM or a bigger GPU upgrade target.

Bottom line

For a local LLM rig in 2026, disk speed matters exactly once per model load and not at all after that. If you are a frequent model-swapper, the WD Blue SN550 is the right answer at $60-ish. If you are a one-model-a-day user, the Crucial BX500 or Samsung 870 EVO saves you a few dollars and a few seconds. Either way, generation speed on the RTX 3060 12GB is the same.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

What the published llama.cpp threads actually measure

The community measurements cited throughout this piece are not a single benchmark. They are an emergent consensus from hundreds of issue threads and pull-request discussions on the llama.cpp GitHub, where builders post the wall-clock seconds their model loads took on their specific hardware. Two patterns are clear in the dataset:

  • For any given quant of any given model, the load time clusters tightly within a drive class. A WD Blue SN550 1TB loading a 14B q4_K_M model lands in a 6-8 second band; a SATA SSD lands in a 16-19 second band. The drive class predicts the load time to within ~15%.
  • The crossover point where NVMe pays off is around two model swaps a day. Below that, the load-time savings vanish in the rest of the workday. Above that, they compound.

The dataset includes Linux and Windows hosts, AMD and Intel platforms, and a mix of consumer NVMe and SATA drives. The variance across operating systems and CPU families is smaller than the variance across drive classes, which is the strongest single piece of evidence that storage really is the bottleneck for cold load.

What changes with a Gen4 drive on the AM4 platform

A bonus question that comes up on community threads: does a Gen4 NVMe (theoretical 7000 MB/s sequential read) help if the AM4 board only exposes Gen3 lanes? The answer is mostly no. The board's lane cap throttles the drive to ~3500 MB/s effective, which is faster than Gen3's ~2400 MB/s by 30-40%, but model-load is read-only and most consumer Gen4 drives use a Gen3-compatible controller that runs at Gen3 speeds anyway. The clean upgrade case for Gen4 is an AM5 or Intel 13th-gen+ board, where the lanes match the drive.

In practical terms: do not pay the Gen4 premium for a WD Blue SN550-class build on AM4. Spend the difference on more capacity instead.

Real-world model-swap workflow

A representative model-swap workflow for a builder maintaining four local models concurrently:

HourActionDrive readWall-clock (NVMe)Wall-clock (SATA)
09:00Load coding model (14B q4)9 GB4 s17 s
11:00Swap to chat model (8B q4)5 GB2 s9 s
13:00Swap to summarization model (12B q4)7 GB3 s13 s
15:00Swap back to coding (14B q4)9 GB4 s (warm cache)16 s (warm cache)
17:00Swap to vision model (8B q4 multimodal)5 GB2 s9 s
Daily total15 s64 s

The NVMe saves about 50 seconds per day on this workflow. Over 250 working days, that's about 3.5 hours. Not life-changing, but consistent and free once paid for.

When SATA is actually the right call

The case for Crucial BX500 SATA (or the Samsung 870 EVO SATA endurance pick) is straightforward:

  • Single-model workflow. One load per session, no swapping. SATA's load-time penalty is felt once and forgotten.
  • Constrained NVMe slots. ITX boards and older mid-range boards expose one M.2 slot; if it's occupied by the OS drive, SATA is the only realistic add for model storage.
  • Workstation-class write endurance is the priority. The 870 EVO's 600 TBW rating beats most consumer NVMe drives in this tier, which matters for builders who pull and replace models weekly.

For everyone else — multi-model, multi-swap, single-rig — NVMe is the call.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Will a faster SSD make my local LLM generate tokens faster?
No. Once the model is loaded into VRAM, generation speed is gated by the GPU's memory bandwidth and compute, not by disk. The SSD only matters during the cold load and during page-in if the model spills to disk-backed swap. For a model that fits in VRAM, you can stop the SSD and generation continues uninterrupted — that is the cleanest demonstration that the drive is not in the hot path.
How big are the model files I need to store?
Plan for 4-5GB per billion parameters at q4 — an 8B model is roughly 5GB on disk, a 14B model lands near 9GB, and a 70B model is around 40GB. If you keep three or four model families resident for switching, you are quickly past 100GB. The community consensus is to budget 1TB of SSD just for models so you are not constantly deleting and re-downloading, which is also a wear pattern.
Is a SATA SSD too slow for serious local AI work?
No, but it is noticeably slower at cold load. Per public llama.cpp community measurements, a SATA SSD caps around 550MB/s sequential read, while a Gen3 NVMe like the WD Blue SN550 sustains 2400MB/s — that is roughly 4x the load throughput. For a builder who loads a model once a day and leaves it resident, SATA is fine. For someone who swaps models hourly, the NVMe premium pays for itself in saved minutes.
Does the RTX 3060 12GB benefit from an NVMe drive?
Yes, but only for load. Once the 12GB of VRAM is full, the GPU does not care what storage backed it. The benefit is the user experience — a 14B q4 model loads in about 7 seconds off a fast NVMe versus around 18 seconds off SATA, which makes model-switching feel snappy instead of stalled. That is a workflow benefit, not a tokens-per-second benefit.
Should I put my OS and models on the same drive?
Splitting them is the cleaner pattern. Put the OS on a small SATA SSD and put a dedicated 1TB NVMe drive on the inference workload. That keeps OS writes and Steam library churn off the model drive, which extends drive life and isolates I/O contention. Per Crucial and WD published TBW ratings, consumer SSDs in the 1TB tier survive 300-600TB of writes; isolating the workload makes that ceiling reachable.

Sources

— SpecPicks Editorial · Last verified 2026-06-16

NVIDIA GeForce RTX 3060
NVIDIA GeForce RTX 3060
$1589.95
View price →

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →