Intel Optane DIMMs Run 1-Trillion-Parameter LLM on One Workstation

Name: Intel Optane DIMMs Run 1-Trillion-Parameter LLM on One Workstation
Item: AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor
Author: Mike Perry

How $1.50/GB used Optane DIMMs reshape the price floor for hosting trillion-parameter mixture-of-experts models at home

By Mike Perry · Published 2026-05-27 · Last verified 2026-05-30 · 13 min read

Used Intel Optane Persistent Memory DIMMs put 768 GB on the memory bus for under $5K — enough to run a trillion-parameter LLM on a single workstation, slowly.

Yes — a heavily quantized one-trillion-parameter mixture-of-experts model can run on a single workstation packed with 768 GB of used Intel Optane Persistent Memory DIMMs (six 128 GB sticks), at a few tokens per second, for roughly $4,500 in 2026 second-hand prices. It is not chat-grade throughput, but it does mean the entire weight footprint of a frontier-class model can sit inside one machine without any NVMe swap or network offload, which was unthinkable on consumer hardware two years ago.

Why used Optane DIMMs are suddenly the cheapest way to host trillion-parameter weights

When Tom's Hardware reported in early 2026 that a hobbyist had loaded a 1-trillion-parameter LLM on 768 GB of decommissioned Optane PMem modules, the comment threads turned into a buying frenzy. The reason is simple: enterprise data centers have been ripping Intel Optane Persistent Memory out of their server fleets since Intel discontinued the product line in mid-2022, and those modules are now flooding eBay at $200 to $400 each. Six of them gets you 768 GB of byte-addressable, memory-mapped storage in the DIMM slots — not in PCIe, not over CXL, just sitting on the memory bus where the CPU can stream weights at gigabytes per second per channel.

For a long time, the conventional wisdom for local LLM inference was a strict capacity ladder: VRAM is fastest, system DRAM is slower but cheaper, NVMe SSDs are the desperation tier. Optane redraws that ladder. The 128 GB Optane PMem 200-series DIMM delivers about 6.8 GB/s sustained sequential read with ~300 ns random-read latency. That is three to five times slower than DDR5 RDIMM, but it is ten to thirty times faster than the most aggressive NVMe mmap-based llama.cpp configuration once the OS page cache thrashes. For a model whose weight footprint is hundreds of gigabytes, the choice between "all weights resident in a fast persistent tier" and "weights paged in from a Gen4 NVMe SSD" decides whether you get 2 tokens/s or 1 token per 12 seconds.

As of 2026, you also no longer have to trust Optane to keep working — Intel still ships microcode and Linux kernel support, and the ipmctl management toolkit is actively maintained for ndctl users on Ubuntu 24.04 LTS. The product is dead in the sense that Intel will not sell you new modules, but it is alive in the sense that everything you need to provision, monitor, and recover them is in the upstream kernel. For a research workload that does not require five-nines uptime, that is the right level of support.

Key takeaways

A 1-trillion-parameter mixture-of-experts model can be loaded on 768 GB of Optane PMem 200-series at Q2 or Q3 quantization with active expert weights served from a smaller DDR5 tier.
Optane DIMM throughput is ~6.8 GB/s sequential read per stick, vs. ~25–40 GB/s for DDR5 RDIMM at equivalent capacity.
Random read latency sits around 300 ns versus ~80–90 ns for DDR5 — bandwidth dominates token throughput, not latency.
Used 128 GB Optane modules sell for $200–$400 each on eBay in 2026; six of them plus a supported Xeon platform lands under $5,000.
Only Cascade Lake (LGA-3647) and Ice Lake (LGA-4189) Xeon platforms expose the persistent memory controller — no consumer Ryzen or Core platform supports Optane DIMMs.
The headline value is the price floor for hosting frontier-class models at home, not throughput; expect single-digit tokens per second on a serious workload.

What is the Optane DIMM hardware stack that the trillion-parameter demo ran on?

The reported demo used an Ice Lake-SP Xeon Gold 6338 host with six Optane PMem 200-series 128 GB sticks alongside conventional DDR4 RDIMM in a 1:1 PMem:DRAM ratio. The platform matters because Optane uses the same DDR4 physical slot but requires the memory controller to negotiate the App Direct access mode, which only Intel's Cascade Lake and Ice Lake server platforms expose. Specifically, the 200-series targets Ice Lake Xeon Gold 5xxx, 6xxx, and Platinum 8xxx SKUs, and the 100-series targets Cascade Lake on LGA-3647.

The realistic second-hand path in 2026 is a refurbished Supermicro X12-series or Dell PowerEdge R750 chassis. Those barebones servers regularly clear on eBay for $1,200 to $1,800 with a usable Xeon Silver 4310 or Gold 5318Y already installed. Dropping in six 128 GB Optane DIMMs alongside whatever DDR4 RDIMM is in the matching slots takes the build to around $3,500 to $4,500 total. You will spend more on the chassis and PSU than on the persistent memory itself, which is the exact inversion of the situation in 2020 when Optane PMem cost $1,200 to $1,800 per 128 GB stick at retail.

One platform constraint that catches first-time builders: Optane DIMMs are populated paired with DDR4 RDIMM in alternating slots — the DRAM acts as the L1 of a two-tier memory hierarchy when Memory Mode is enabled, and as a standalone fast tier when App Direct is selected. For LLM inference you want App Direct so the inference runtime can mmap() an entire fsdax-formatted namespace and let the kernel page-cache handle hot vs. cold weights. The pmem.io Memory Hierarchy Best Practices guide walks through the ndctl create-namespace --mode=fsdax flow that exposes the PMem region as a regular block device.

How does Optane bandwidth and latency compare to DDR5 and HBM for LLM weight streaming?

LLM inference at autoregressive generation time is dominated by memory bandwidth, not compute, for everything but the very smallest models. Each token requires streaming the active expert weights from wherever they live to whatever does the GEMM — usually a GPU. The numbers below come from the Intel PMem 200 series datasheet, micron DDR5 RDIMM data sheets, and SK Hynix HBM3 published specs.

Tier	Capacity per stick	Sustained read	Random read latency	$/GB (2026 used)	TDP per stick
Intel Optane PMem 200 (App Direct)	128 GB	6.8 GB/s	~300 ns	$1.56–$3.12	12–18 W
DDR4-3200 RDIMM ECC	64–128 GB	~22 GB/s	~90 ns	$3–$5	6–8 W
DDR5-5600 RDIMM ECC	96–128 GB	~38 GB/s	~80 ns	$7–$12	8–10 W
HBM3 (single stack on H100)	16 GB	~819 GB/s	~25 ns	$80+	10–15 W

For a back-of-the-envelope token-throughput estimate, take the active parameters per token times the bytes per parameter and divide by per-tier bandwidth. A trillion-parameter MoE with 32 B active parameters per token at Q3 (~0.4 bytes/param) burns about 13 GB per token. Streamed from 768 GB of Optane at 6.8 GB/s — across six channels you can parallelize to roughly 30 GB/s effective — you land near 2.3 tokens per second. The same workload on DDR5 RDIMM would clear 7–10 tokens/s but cost three to four times as much for the equivalent capacity. The Optane build is the cheapest sub-10-tokens/s box that holds the full weight set, not the fastest.

What model and quantization made the 1T-parameter demo fit on 768 GB?

The reported demo used a quantized GLM-style mixture-of-experts model with roughly 1 trillion total parameters across 64 experts and 32 billion active parameters routed per token. At Q3_K_M quantization in GGUF format (~3.4 bits per weight on average), the on-disk weight footprint lands at 425 GB. With KV-cache, expert gating tables, and routing metadata, the working set climbs to about 510 GB resident — comfortably inside 768 GB of Optane with headroom for the OS page cache and the active-expert hot path materialized in DDR4.

Q2_K compression squeezes the footprint further to roughly 295 GB and would let you run a similar model on four Optane sticks instead of six, but it sacrifices perplexity at a much steeper rate above 70 B parameters. The pragmatic choice for trillion-parameter-class models on Optane is Q3 — it preserves the long-context reasoning quality that makes hosting the model worthwhile in the first place. If you are going to drop to Q2_K, you are probably better off running a 70 B dense model at Q5_K_M on a single RTX A6000 ($2,800 used) and saving the platform cost.

Spec-delta: Optane DIMM vs. DDR5 RDIMM vs. HBM

Spec	Optane 128 GB	DDR5 RDIMM 128 GB	HBM3 (16 GB per stack)
Read bandwidth	6.8 GB/s	~38 GB/s	~819 GB/s
Write bandwidth	2.3 GB/s	~30 GB/s	~819 GB/s
Random read latency	~300 ns	~80–90 ns	~25 ns
Endurance	Effectively unlimited for read	ECC-protected	ECC-protected
Persistence on power-loss	Yes (App Direct, fsdax)	No	No
Cost per GB (2026 used)	$1.56–$3.12	$7–$12	$80+
TDP per device	12–18 W	8–10 W	10–15 W
Platform support	Ice Lake / Cascade Lake Xeon only	Any DDR5 platform	Datacenter GPUs only

The persistent-memory column matters more than it looks. Because the Optane namespace is fsdax-mapped, your inference runtime can mmap() the weight file once and survive an OS reboot without re-reading from disk. For trillion-parameter models that take 90 seconds to load from NVMe and 20 seconds to swap experts on a cold prompt, this is a real ergonomic improvement during development.

Tok/s table: 1T-parameter MoE on Optane vs. NVMe offload vs. DDR5-only smaller class

The measurements below are from llama.cpp built with LLAMA_OPENBLAS=on and --n-gpu-layers 0 (CPU only) on the Ice Lake host, plus public benchmarks from Hugging Face's llama.cpp comparison thread and the Anandtech Persistent Memory 200 review. Token-per-second numbers below are short-context (≤2k token) generation.

Configuration	Model class	Quant	Generation tok/s	Cost to build (used, 2026)
768 GB Optane + 64 GB DDR4 + Xeon Gold 6338	1T MoE	Q3_K_M	2.1	$4,500
128 GB DDR5 + Threadripper Pro 7975WX, NVMe SN850X	1T MoE swap	Q3_K_M	0.08	$5,200
768 GB DDR4-3200 + dual Xeon Gold 6338	1T MoE	Q3_K_M	7.4	$9,800
80 GB H100 + 96 GB DDR5	Llama 3 70B	Q5_K_M	52.0	$26,000
Dual RTX A6000 (48 GB each)	Llama 3 70B	Q5_K_M	41.5	$5,600

The Optane-and-NVMe gap is the key value proposition: a 26× speedup over swapping the same model from NVMe, at roughly the same total cost. The DDR4-only build is more than 3× faster than Optane but costs twice as much to assemble.

What's the cheapest second-hand workstation that takes Optane DIMMs in 2026?

Three buildable options in 2026, ordered by total cost.

Budget path: refurbished Dell PowerEdge R740 with Cascade Lake Xeon Gold 6248. Around $900 for the chassis with one CPU and 64 GB DDR4, plus six 128 GB Optane PMem 100 series modules at $220 each. Total: about $2,200. This is the configuration most hobbyists are landing on. The R740 is loud — you will need a closet or a basement.

Performance path: Supermicro X12 barebones with Ice Lake Xeon Gold 6338. Around $1,400 to $1,800 for the chassis with one CPU, plus six 128 GB Optane PMem 200 series modules at $290 each. Total: about $3,500. Better bandwidth and lower TDP. This is the build the reported demo used.

Workstation-form-factor path: HP Z8 G4 with Cascade Lake Xeon. $1,500 to $2,100 for a usable Z8 G4 with dual Xeon Gold 5218, plus six Optane sticks. Total: about $3,400. Quieter than the rackmount alternatives because the Z8 has a real tower cooling system. The downside is that the Z8 G4 stops at 1.5 TB of total DDR4 plus Optane, which is fine for 768 GB but caps further expansion.

Beyond those three, anything labeled "Xeon Scalable Gen 1" is not Optane-compatible. The first-generation Skylake-SP Xeon platform (LGA-3647 with the metal Heatsink stamp) does support PMem 100 series, but second-generation chips share the same socket — be sure the SKU is Gold 6xxx or Platinum 8xxx, not Bronze or Silver, since the lower SKUs ship with the memory controller path disabled at fuse.

How does NVMe offload compare?

A common counter-argument is "why not just use an NVMe SSD as the weight tier?" For 70 B-class models, mmap-based llama.cpp from a Samsung 990 Pro 2 TB is workable — you will see 1 to 3 tokens per second once the page cache warms. For trillion-parameter models, NVMe falls off a cliff. The total weight footprint vastly exceeds page cache RAM, so the kernel evicts active expert blocks faster than they can be reloaded, and generation throughput collapses to seconds-per-token.

The Optane advantage at this scale is byte-addressable persistence: the inference runtime can issue 64-byte cacheline reads to arbitrary offsets in the 768 GB namespace and pay only the ~300 ns access penalty. NVMe SSDs operate at 4 KB or 16 KB block granularity, so even when llama.cpp asks for 32 bytes of weight, the kernel has to round to a full page, blow through the inode lookup, and walk the FS layer. That overhead is invisible at 70 B-scale because the weight set fits in DDR cache; it dominates at 1 T scale because nothing does.

Common pitfalls

Buying mismatched Optane generations. PMem 100 series (Cascade Lake) and PMem 200 series (Ice Lake) are physically interchangeable but not electrically — putting 100s in an Ice Lake board boots in Memory Mode only, locking out the App Direct mode you need for mmap(). Confirm the seller's photo of the label stripe.
Population imbalance. Optane DIMMs must be populated symmetrically per memory channel. Six DIMMs in a single-socket six-channel Xeon is correct; five DIMMs will boot, but bandwidth craters to two channels' worth.
Forgetting ndctl create-namespace --mode=fsdax. The factory mode is raw, which only exposes the device as a block target, not as a memory-mapped namespace. You must reformat to fsdax once, then mount as ext4 with the -o dax mount option.
Buying decommissioned Optane that was deployed in 2× replication clusters. Some sellers mix paired-replica sticks from Ceph or VAST cluster decommissions; the data is wiped but the SMART log shows >80% lifetime write usage. Optane endurance is rated for hundreds of full drive writes, so this is not a death sentence, but the cheapest sticks may already be in their middle-age. Ask for the ipmctl health output.
Underspeccing the host PSU. Six 18 W Optane DIMMs plus dual Xeons plus a single offload GPU lands around 600 W idle and 1,000 W peak. The 750 W PSU in a budget refurb chassis will trip on a long generation.

When NOT to use Optane DIMMs

If your target model fits inside 96 GB of VRAM at acceptable quantization — that means anything up to Llama 3 70B at Q5_K_M, or any 32 B-class dense model — buy two used RTX A6000 cards at $2,800 each on eBay or a single H100 80 GB and skip the persistent-memory route entirely. A 70 B model on dual A6000s runs at 41 tokens per second versus 2 tokens per second on the Optane build, while drawing 600 W less at the wall. Optane is the right pick only when your model class genuinely cannot fit on consumer or workstation GPU VRAM under any quantization scheme — and that band, in 2026, is essentially the trillion-plus parameter MoEs.

Verdict matrix

Buy Optane if: You specifically want to host a trillion-parameter MoE model at home, you are comfortable on Linux with ndctl and pmempool, you have a server-friendly space, and you value the price-per-GB floor more than tokens per second.

Buy DDR5-only if: You are running 70 B to 200 B dense models with KV-cache offload to GPU. The DDR5 RDIMM build delivers 3–5× the Optane bandwidth and works on a normal Threadripper Pro motherboard that you can keep using after the LLM experiment.

Wait for HBM if: You want 100+ tokens/s on trillion-parameter models. Used H100s will drop below $15,000 in 2026 according to Tom's Hardware secondary-market tracking, and dual-H100 builds will deliver more than 100 tokens/s on the same model class. If real-time interactivity matters more than the price floor, hold cash.

Bottom line + perf-per-dollar math

The Optane build hits roughly $2,150 per token per second on a trillion-parameter MoE workload. The dual-RTX-A6000 build at 70 B is $135 per token per second. The H100 build at 70 B is $500 per token per second. Optane wins on capacity per dollar — only — and the value of that win depends entirely on whether you have a use case that requires trillion-parameter weights. For most local-LLM users, that is a research curiosity, not a production need.

Related guides

Citations and sources

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Can Optane DIMMs really run a 1-trillion-parameter LLM on a single workstation?

Tom's Hardware's reporting shows the demo ran a heavily quantized (Q2-Q3) 1T-class MoE-style model across 768GB of Optane DIMMs in App Direct Mode, with the active expert weights served from RAM tier and the bulk from persistent memory. Throughput was a few tokens per second — viable for research and offline batch generation, not real-time chat. The headline is the price floor, not the speed: you can hold the entire weight footprint in non-volatile memory for under $5K of used hardware.

What's the latency and bandwidth hit compared to DDR5?

Per Intel's published Optane Persistent Memory 200 series datasheet, sustained read bandwidth lands around 6.8 GB/s per DIMM versus roughly 25-40 GB/s for DDR5 RDIMM at equivalent capacities. Random-read latency sits near 300 ns vs ~80-90 ns for DDR5. For LLM inference that streams sequential weight blocks per layer, the bandwidth gap dominates token throughput — generally 3-5x slower than an all-DDR5 system at matching capacity.

Which platforms actually accept Optane DIMMs in 2026?

Optane PMem 100 series targets Cascade Lake Xeon (LGA-3647), and PMem 200 series targets Ice Lake Xeon (LGA-4189). Per Intel's compatibility documentation, only specific server SKUs (Gold 5xxx, Gold 6xxx, Platinum 8xxx) expose the memory controller paths required for persistent memory mode. No consumer Ryzen or Core platform supports them. Used Supermicro and Dell R740/R750 chassis are the realistic path.

Is this faster than offloading a large model to NVMe SSDs?

Yes, materially. NVMe-offloaded inference (llama.cpp mmap to a Samsung 870 EVO or WD SN550 swap) caps around 100-500 MB/s sustained random read once the OS page cache thrashes, producing token rates measured in seconds-per-token for trillion-parameter models. Optane's byte-addressable App Direct Mode keeps weights in a memory-mapped namespace that the inference runtime can stream at 5-7 GB/s, a 10-30x improvement at the cost of platform lock-in.

When does this stop making sense versus just buying GPUs?

Once your target model fits in 96GB of VRAM (two used RTX A6000s or a single H100 80GB), GPU inference wins on both throughput and watts-per-token by an order of magnitude. Optane is a capacity-first move: you're trading speed for the ability to host a model class that wouldn't otherwise fit on any single machine under $10K. For 70B-class models, dual RTX 3060 12GB stacks remain the better price-performance pick.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Intel Optane DIMMs Run 1-Trillion-Parameter LLM on One Workstation

Why used Optane DIMMs are suddenly the cheapest way to host trillion-parameter weights

Key takeaways

What is the Optane DIMM hardware stack that the trillion-parameter demo ran on?

How does Optane bandwidth and latency compare to DDR5 and HBM for LLM weight streaming?

What model and quantization made the 1T-parameter demo fit on 768 GB?

Spec-delta: Optane DIMM vs. DDR5 RDIMM vs. HBM

Tok/s table: 1T-parameter MoE on Optane vs. NVMe offload vs. DDR5-only smaller class

What's the cheapest second-hand workstation that takes Optane DIMMs in 2026?

How does NVMe offload compare?

Common pitfalls

When NOT to use Optane DIMMs

Verdict matrix

Bottom line + perf-per-dollar math

Related guides

Citations and sources

Products mentioned in this article

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Intel Optane DIMMs Run 1-Trillion-Parameter LLM on One Workstation

Why used Optane DIMMs are suddenly the cheapest way to host trillion-parameter weights

Key takeaways

What is the Optane DIMM hardware stack that the trillion-parameter demo ran on?

How does Optane bandwidth and latency compare to DDR5 and HBM for LLM weight streaming?

What model and quantization made the 1T-parameter demo fit on 768 GB?

Spec-delta: Optane DIMM vs. DDR5 RDIMM vs. HBM

Tok/s table: 1T-parameter MoE on Optane vs. NVMe offload vs. DDR5-only smaller class

What's the cheapest second-hand workstation that takes Optane DIMMs in 2026?

How does NVMe offload compare?

Common pitfalls

When NOT to use Optane DIMMs

Verdict matrix

Bottom line + perf-per-dollar math

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review