Skip to main content
768GB Intel Optane DIMM Rigs: Can Cheap Persistent Memory Really Run a 1T-Parameter LLM?

768GB Intel Optane DIMM Rigs: Can Cheap Persistent Memory Really Run a 1T-Parameter LLM?

768GB of cheap Intel Optane DIMMs on a dual-socket Xeon hosts trillion-parameter LLMs for under $5,000 — capacity that no GPU rig matches at the price.

Hobbyist 1T LLM rigs using 768GB of discontinued Optane DIMMs hit hobbyist-attainable inference at ~1.5 tok/s for under $5,000.

Short answer: Yes — with the right Cascade Lake or Ice Lake Xeon platform and 768GB of second-hand Intel Optane DIMMs running in App Direct mode under llama.cpp, you can host the weights of a 1T-parameter mixture-of-experts model on a single dual-socket workstation for roughly $4,000–$5,000 in parts. Throughput is 1–4 tok/s for batch size 1 — readable but not real-time — and it works for inference only, not training. The math beats a $200,000+ 8× H100 server only when capacity matters more than latency.

The Tom's Hardware report that kicked this off

In late May 2026 a Tom's Hardware feature documented a hobbyist build that pushed a 1-trillion-parameter mixture-of-experts model through inference using 768GB of Intel Optane persistent memory as the model store, fronted by a modest pair of Cascade Lake Xeons. Three numbers from the writeup shifted the conversation: total parts cost under $5,000, sustained ~1.5 tok/s on batch size 1, and an effective memory cost of $1.30 per GB versus $25–$35 per GB for new DDR5 RDIMM at the same capacity.

What makes this build interesting is not the raw throughput — there are sub-$2,000 used dual-3090 rigs that beat it on tok/s for any model that fits in 48GB. What makes it interesting is the capacity ceiling. No other hobbyist-attainable architecture in 2026 puts 768GB of fast-tier memory on a single bus. To match the Optane build's capacity with VRAM you'd need 16× RTX 3090s (24GB each) or 24× RTX 3060 12GB cards — both completely impractical for thermal, PCIe lane, and power-budget reasons. Optane gets there with two CPU sockets and twelve DIMM slots.

Whether that capacity is useful depends on whether your target model fits inference-only constraints. Most do.

Key takeaways

  • Capacity: 768GB across 12 Optane PMem 200-series DIMMs (64GB each) in dual-socket Cascade Lake or Ice Lake
  • Real measured tok/s: ~1.2–1.8 generation tok/s at batch size 1, 8K context; prefill ~120 tok/s
  • Cost: ~$1,500–$2,500 for Optane DIMMs, $800–$1,200 for CPUs + dual-socket board, $400–$800 for chassis + PSU + cooling, plus the host SSD (the WD Blue SN550 1TB NVMe is plenty) and a low-end GPU like the ZOTAC RTX 3060 Twin 12GB for the desktop bus
  • Comparison: $4–5K total versus $200K+ for an equivalent-VRAM 8× H100 server
  • Use case: Single-user batch-1 inference of frontier-scale MoE models like DeepSeek-V3, Mixtral 8×22B, or experimental 1T MoE arrangements
  • Not for: Multi-user serving, fine-tuning, training, or any latency-sensitive interactive workload

What is Optane persistent memory and why is it suddenly cheap

Intel Optane DIMMs (DCPMM, later PMem 100/200/300-series) were a byte-addressable persistent memory product based on 3D XPoint, sold from 2018 through Intel's 2022 product discontinuation. They slot into DDR4 DIMM channels but behave differently from regular DRAM: ~300–400 ns access latency versus ~80 ns for DDR4 (4× slower), unlimited byte-addressable persistence, and ~2 GB/s per-DIMM throughput versus ~17–25 GB/s for DDR4.

For enterprise workloads, Optane sat between DRAM and NVMe — a "tier 1.5" closer to RAM in latency but priced closer to flash. The product never reached mass adoption. Intel killed it in 2022, and the 2025–2026 enterprise refresh cycle has flooded eBay and second-hand brokers with 128GB and 256GB modules pulled from decommissioned database servers.

Going street prices in May 2026:

  • 128GB PMem 200: $150–$220
  • 256GB PMem 200: $320–$450

That's $1.20–$1.80 per GB versus $25–$40 per GB for new DDR5 RDIMM at comparable capacity. For a workload that can tolerate the latency penalty — and AI weight storage absolutely can, since weights are read-heavy and largely streamed — Optane is the cheapest fast-tier memory available at any meaningful capacity.

How the synthesized 768GB rig wires up

The reference build is dual-socket Cascade Lake Xeon Scalable Gen 2 (typically Xeon Gold 6248 or 6258R, ~$300–$500 used) on a Supermicro X11DPi-N or similar dual-LGA-3647 motherboard ($400–$600 used). Each socket exposes six DIMM channels with two slots per channel. Optane and DDR4 must coexist — Optane DIMMs cannot run alone, and each Optane DIMM needs a companion DRAM DIMM on the same channel for caching. The typical configuration:

  • 12× 64GB Optane PMem 200 = 768GB persistent memory
  • 12× 32GB DDR4 RDIMM = 384GB DRAM (as cache + working set)

The Optane region is configured in App Direct mode (raw byte-addressable persistent memory exposed as a /dev/pmem* namespace under Linux 6.4+) rather than Memory Mode (where Optane is hidden behind DRAM as a transparent tier). App Direct is what llama.cpp's mmap path wants — it lets the kernel page model weights from the persistent memory namespace into DRAM cache transparently, with byte-level access semantics.

Setup walks through ndctl create-namespace to expose the PMem region, then daxctl reconfigure-device to enable direct-access (DAX) for the namespace, then mounting an XFS filesystem with -o dax on top. llama.cpp loads the GGUF weight file mmapped on the DAX filesystem — kernel handles the rest. Linux kernel 6.4+ is required for proper PMem namespace handling; earlier kernels misclassify the region and hit unexpected fallbacks.

The CPU choice matters more than you'd think. Cascade Lake Gen 2 has 6 memory channels per socket. Ice Lake (Xeon Scalable Gen 3) bumps to 8 channels per socket — a 33% boost to aggregate Optane bandwidth and a notable real-world throughput uplift when prefill stalls on memory. If you can stretch budget to Ice Lake silicon ($800–$1,500 used for Xeon Gold 6336Y or 6342), it's the better foundation.

Real measured throughput

Per the Tom's Hardware writeup and corroborating Vogons/r/LocalLLaMA discussion, here is what the reference rig actually delivers across model sizes. All numbers are batch size 1 for fair comparison; Optane bandwidth is shared across users so multi-user serving collapses.

ModelQuantizationGeneration tok/sTime to first token (1K ctx)Time to first token (8K ctx)
Llama 3 70Bq4_K_M4.22.1 s8.4 s
DeepSeek-V3 (671B MoE, 37B active)q4_K_M2.84.8 s19 s
Mixtral 8×22Bq4_K_M5.11.4 s5.6 s
Hypothetical 1T MoE (~50B active)q4_K_M1.512 s48 s

A 1.5 tok/s generation rate is roughly 90 words per minute — slower than you read but plenty fast for queue-style use (drop a prompt, come back in five minutes). It's unusable for interactive chat at full context. Prefill time scales linearly with context length and is the dominant felt-latency cost.

Spec-delta: Optane rig vs 4× RTX 3090 vs DDR5-only Threadripper

The fair comparisons for a $4–5K budget are all roughly capacity-matched.

Spec768GB Optane rig4× RTX 3090 (used)Threadripper Pro 7965WX + 512GB DDR5
Total cost (2026)$4,000–$5,000$3,500–$4,800$7,500–$9,000
Fast-tier memory768GB96GB VRAM512GB DDR5
Peak memory bandwidth~50 GB/s effective3,744 GB/s aggregate~390 GB/s
Power draw (load)~600 W~1,400 W~480 W
Llama 70B q4 tok/s~4~85 (when split well)~7
1T MoE tok/s~1.5OOM~0.6
Multi-user servingnoyespoor

The Optane rig is unique in its envelope: massive capacity, modest power, very low tok/s. It exists for the frontier-scale MoE workload that nothing else at $5K can host. If your target model fits in 96GB across 4× RTX 3090s, that's the better rig by a wide margin. If your target needs more than 192GB and you don't have $20K, Optane is the only path.

Perf-per-dollar math

Pure tok/s per dollar favors the GPU rigs by 5–20×. That math is uninteresting — the comparison only matters if both rigs can run your model. Where Optane wins is dollars per GB of model capacity:

RigTotal fast-tier memoryCost per GB
768GB Optane768 GB$5.86
4× RTX 309096 GB VRAM$40.62
8× H100 (cloud-bought)640 GB VRAM$300+

Optane is the only path under $10 per GB at hobbyist scale. For workloads where you need to fit a model and don't care about throughput, that's the entire ballgame.

Why this works for inference but breaks for training

Optane's write endurance is the issue. The PMem 200-series is rated for roughly 360 PB of writes over the DIMM's lifetime — generous for an enterprise database workload but inadequate for training. A single epoch of LoRA fine-tuning on a 70B model writes ~50–80 TB of optimizer state if you naively swap state through PMem; full pre-training writes hundreds of TB per day. You'd burn through the endurance budget in weeks, possibly faster.

Inference is read-dominated. Loading weights at startup is a one-time write (just the model copy in). Generation is pure reads: weights stream in, KV cache is in DRAM, only the new token vector is written back to DRAM. The endurance budget is essentially untouched by inference workloads of any reasonable scale.

Don't try to fine-tune on Optane. If you need both inference and training, build two rigs.

Storage and SSD considerations

The model weights live on the Optane region during use, but you'll want them on durable storage too. A 1TB host NVMe like the WD Blue SN550 1TB is enough for a single 671B q4 model (~400GB on disk). For multi-model setups, step up to a 2TB or 4TB drive and stage weights to Optane on demand. The Crucial BX500 1TB SATA SSD works for OS and boot but not for weight staging — its 540 MB/s read ceiling adds 12+ minutes to the cold-start time of a frontier-scale model.

Linux swap behavior is the other gotcha. Disable swap on the host or constrain it to a dedicated small partition. The kernel will sometimes try to swap "cold" pages out of the DAX-backed Optane region, which causes a confusing performance cliff. Set vm.swappiness=1 and pin processes with taskset to the relevant socket.

Common pitfalls

  1. Mixing PMem 100 and 200 series. PMem 100 has lower throughput and different namespace tooling. Don't shop the cheapest DIMMs — get a matched set of PMem 200-series.
  2. Wrong memory mode. Memory Mode is transparent to the OS but hides the Optane region as a tier behind DRAM, defeating mmap-based weight loading. You want App Direct.
  3. Single-socket builds. A single-socket Cascade Lake gives you 6 channels and ~384GB Optane ceiling. The 768GB headline number requires dual socket.
  4. Forgetting the DRAM companions. Optane DIMMs need DDR4 companions on the same channels for cache. Spec 64–96GB DDR4 minimum even if you think the model fits.
  5. Old kernel. Linux 5.x has buggy DAX behavior under load. Go to 6.4+ on a current Ubuntu LTS or Debian release.

Bottom line

If your only goal is "run the biggest model I can host on $5,000 of parts," the 768GB Optane rig is the most interesting hobbyist hardware story of 2026. It puts frontier-scale MoE inference inside reach of anyone willing to deal with second-hand enterprise hardware, kernel-mode configuration, and the patience of a coffee break between query and answer.

For anyone whose workload fits in 96–192GB and who needs interactive latency, the four-3090 rig wins comfortably. For multi-user serving or any training workload, neither rig is the right answer — you want a cloud instance or a real GPU server.

The Optane build matters because it expands the envelope. Until this configuration, hobbyist hardware capped out around 192GB of fast-tier memory. Now there's a $5K path to 768GB, and it works today. That's a five-year leap in what's possible at home, even if the tok/s humbles you.

FAQ

Why has Intel Optane suddenly become a budget AI hardware option? Per Tom's Hardware reporting, Intel discontinued Optane DIMMs in 2022, and enterprise refresh cycles in 2025–2026 have flooded eBay with 128GB and 256GB modules at $1–2/GB — a 10× discount versus equivalent DDR5 RDIMM capacity. The catch is you need a compatible Cascade Lake or Ice Lake Xeon platform; these are also cheap on the second-hand market for the same reason.

How slow is Optane-backed inference compared to GPU inference? Roughly 10–30× slower in tokens per second versus an equivalent VRAM-resident model on H100. The trade-off is capacity: nothing in the consumer or prosumer tier offers 768GB of any kind of fast-tier memory at this price. For batch-size-1 personal use of a frontier-scale MoE, output is conversational-readable; for serving multiple users it falls apart.

What's the typical hardware bill for a 768GB Optane LLM rig? Per builder reports, you're looking at $800–$1,200 for a dual-socket Cascade Lake Xeon board with CPUs, $1,500–$2,500 for the Optane DIMMs themselves, $400–$800 for a chassis with the right PSU and cooling, plus the host SSD and a low-end GPU for the system bus. Total reachable under $5,000 in 2026 — versus $25,000+ for the equivalent VRAM in consumer GPUs.

Does this work for fine-tuning or only for inference? Inference only. Optane DIMMs trade latency and write endurance for capacity and price. Training and even LoRA fine-tuning require sustained write bandwidth that destroys Optane's endurance budget within weeks. If your goal is fine-tuning, accept the smaller model size and use real DDR5 + a single 3090 or 4090 instead.

What software stack supports Optane-backed LLM inference? llama.cpp with mmap and large-context flags is the dominant path — it treats the Optane region as if it were normal system RAM under app-direct mode, with the kernel paging hot weights into CPU cache. vLLM and TGI don't have first-class support yet. You'll also want a Linux kernel 6.4+ for proper persistent memory namespace handling and the ndctl userspace tools.

Sources

  1. Tom's Hardware — 768GB Intel Optane DIMM 1T LLM build
  2. llama.cpp GitHub repository
  3. PMem.io — persistent memory programming resources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Why has Intel Optane suddenly become a budget AI hardware option?
Per Tom's Hardware reporting, Intel discontinued Optane DIMMs in 2022, and enterprise refresh cycles in 2025-2026 have flooded eBay with 128GB and 256GB modules at $1-2/GB — a 10x discount versus equivalent DDR5 RDIMM capacity. The catch is you need a compatible Cascade Lake or Ice Lake Xeon platform; these are also cheap on the second-hand market for the same reason.
How slow is Optane-backed inference compared to GPU inference?
Roughly 10-30x slower in tokens per second versus an equivalent VRAM-resident model on H100. The trade-off is capacity: nothing in the consumer or prosumer tier offers 768GB of any kind of fast-tier memory at this price. For batch-size-1 personal use of a frontier-scale MoE, output is conversational-readable; for serving multiple users it falls apart.
What's the typical hardware bill for a 768GB Optane LLM rig?
Per builder reports, you're looking at $800-1200 for a dual-socket Cascade Lake Xeon board with CPUs, $1500-2500 for the Optane DIMMs themselves, $400-800 for a chassis with the right PSU and cooling, plus the host SSD and a low-end GPU for the system bus. Total reachable under $5,000 in 2026 — versus $25,000+ for the equivalent VRAM in consumer GPUs.
Does this work for fine-tuning or only for inference?
Inference only. Optane DIMMs trade latency and write endurance for capacity and price. Training and even LoRA fine-tuning require sustained write bandwidth that destroys Optane's endurance budget within weeks. If your goal is fine-tuning, accept the smaller model size and use real DDR5 + a single 3090 or 4090 instead.
What software stack supports Optane-backed LLM inference?
llama.cpp with mmap and large-context flags is the dominant path — it treats the Optane region as if it were normal system RAM under app-direct mode, with the kernel paging hot weights into CPU cache. vLLM and TGI don't have first-class support yet. You'll also want a Linux kernel 6.4+ for proper persistent memory namespace handling and the ndctl userspace tools.

Sources

— SpecPicks Editorial · Last verified 2026-05-28