Yes — a heavily quantized one-trillion-parameter mixture-of-experts model can run on a single workstation packed with 768 GB of used Intel Optane Persistent Memory DIMMs (six 128 GB sticks), at a few tokens per second, for roughly $4,500 in 2026 second-hand prices. It is not chat-grade throughput, but it does mean the entire weight footprint of a frontier-class model can sit inside one machine without any NVMe swap or network offload, which was unthinkable on consumer hardware two years ago.
Why used Optane DIMMs are suddenly the cheapest way to host trillion-parameter weights
When Tom's Hardware reported in early 2026 that a hobbyist had loaded a 1-trillion-parameter LLM on 768 GB of decommissioned Optane PMem modules, the comment threads turned into a buying frenzy. The reason is simple: enterprise data centers have been ripping Intel Optane Persistent Memory out of their server fleets since Intel discontinued the product line in mid-2022, and those modules are now flooding eBay at $200 to $400 each. Six of them gets you 768 GB of byte-addressable, memory-mapped storage in the DIMM slots — not in PCIe, not over CXL, just sitting on the memory bus where the CPU can stream weights at gigabytes per second per channel.
For a long time, the conventional wisdom for local LLM inference was a strict capacity ladder: VRAM is fastest, system DRAM is slower but cheaper, NVMe SSDs are the desperation tier. Optane redraws that ladder. The 128 GB Optane PMem 200-series DIMM delivers about 6.8 GB/s sustained sequential read with ~300 ns random-read latency. That is three to five times slower than DDR5 RDIMM, but it is ten to thirty times faster than the most aggressive NVMe mmap-based llama.cpp configuration once the OS page cache thrashes. For a model whose weight footprint is hundreds of gigabytes, the choice between "all weights resident in a fast persistent tier" and "weights paged in from a Gen4 NVMe SSD" decides whether you get 2 tokens/s or 1 token per 12 seconds.
As of 2026, you also no longer have to trust Optane to keep working — Intel still ships microcode and Linux kernel support, and the ipmctl management toolkit is actively maintained for ndctl users on Ubuntu 24.04 LTS. The product is dead in the sense that Intel will not sell you new modules, but it is alive in the sense that everything you need to provision, monitor, and recover them is in the upstream kernel. For a research workload that does not require five-nines uptime, that is the right level of support.
Key takeaways
- A 1-trillion-parameter mixture-of-experts model can be loaded on 768 GB of Optane PMem 200-series at Q2 or Q3 quantization with active expert weights served from a smaller DDR5 tier.
- Optane DIMM throughput is ~6.8 GB/s sequential read per stick, vs. ~25–40 GB/s for DDR5 RDIMM at equivalent capacity.
- Random read latency sits around 300 ns versus ~80–90 ns for DDR5 — bandwidth dominates token throughput, not latency.
- Used 128 GB Optane modules sell for $200–$400 each on eBay in 2026; six of them plus a supported Xeon platform lands under $5,000.
- Only Cascade Lake (LGA-3647) and Ice Lake (LGA-4189) Xeon platforms expose the persistent memory controller — no consumer Ryzen or Core platform supports Optane DIMMs.
- The headline value is the price floor for hosting frontier-class models at home, not throughput; expect single-digit tokens per second on a serious workload.
What is the Optane DIMM hardware stack that the trillion-parameter demo ran on?
The reported demo used an Ice Lake-SP Xeon Gold 6338 host with six Optane PMem 200-series 128 GB sticks alongside conventional DDR4 RDIMM in a 1:1 PMem:DRAM ratio. The platform matters because Optane uses the same DDR4 physical slot but requires the memory controller to negotiate the App Direct access mode, which only Intel's Cascade Lake and Ice Lake server platforms expose. Specifically, the 200-series targets Ice Lake Xeon Gold 5xxx, 6xxx, and Platinum 8xxx SKUs, and the 100-series targets Cascade Lake on LGA-3647.
The realistic second-hand path in 2026 is a refurbished Supermicro X12-series or Dell PowerEdge R750 chassis. Those barebones servers regularly clear on eBay for $1,200 to $1,800 with a usable Xeon Silver 4310 or Gold 5318Y already installed. Dropping in six 128 GB Optane DIMMs alongside whatever DDR4 RDIMM is in the matching slots takes the build to around $3,500 to $4,500 total. You will spend more on the chassis and PSU than on the persistent memory itself, which is the exact inversion of the situation in 2020 when Optane PMem cost $1,200 to $1,800 per 128 GB stick at retail.
One platform constraint that catches first-time builders: Optane DIMMs are populated paired with DDR4 RDIMM in alternating slots — the DRAM acts as the L1 of a two-tier memory hierarchy when Memory Mode is enabled, and as a standalone fast tier when App Direct is selected. For LLM inference you want App Direct so the inference runtime can mmap() an entire fsdax-formatted namespace and let the kernel page-cache handle hot vs. cold weights. The pmem.io Memory Hierarchy Best Practices guide walks through the ndctl create-namespace --mode=fsdax flow that exposes the PMem region as a regular block device.
How does Optane bandwidth and latency compare to DDR5 and HBM for LLM weight streaming?
LLM inference at autoregressive generation time is dominated by memory bandwidth, not compute, for everything but the very smallest models. Each token requires streaming the active expert weights from wherever they live to whatever does the GEMM — usually a GPU. The numbers below come from the Intel PMem 200 series datasheet, micron DDR5 RDIMM data sheets, and SK Hynix HBM3 published specs.
| Tier | Capacity per stick | Sustained read | Random read latency | $/GB (2026 used) | TDP per stick |
|---|---|---|---|---|---|
| Intel Optane PMem 200 (App Direct) | 128 GB | 6.8 GB/s | ~300 ns | $1.56–$3.12 | 12–18 W |
| DDR4-3200 RDIMM ECC | 64–128 GB | ~22 GB/s | ~90 ns | $3–$5 | 6–8 W |
| DDR5-5600 RDIMM ECC | 96–128 GB | ~38 GB/s | ~80 ns | $7–$12 | 8–10 W |
| HBM3 (single stack on H100) | 16 GB | ~819 GB/s | ~25 ns | $80+ | 10–15 W |
For a back-of-the-envelope token-throughput estimate, take the active parameters per token times the bytes per parameter and divide by per-tier bandwidth. A trillion-parameter MoE with 32 B active parameters per token at Q3 (~0.4 bytes/param) burns about 13 GB per token. Streamed from 768 GB of Optane at 6.8 GB/s — across six channels you can parallelize to roughly 30 GB/s effective — you land near 2.3 tokens per second. The same workload on DDR5 RDIMM would clear 7–10 tokens/s but cost three to four times as much for the equivalent capacity. The Optane build is the cheapest sub-10-tokens/s box that holds the full weight set, not the fastest.
What model and quantization made the 1T-parameter demo fit on 768 GB?
The reported demo used a quantized GLM-style mixture-of-experts model with roughly 1 trillion total parameters across 64 experts and 32 billion active parameters routed per token. At Q3_K_M quantization in GGUF format (~3.4 bits per weight on average), the on-disk weight footprint lands at 425 GB. With KV-cache, expert gating tables, and routing metadata, the working set climbs to about 510 GB resident — comfortably inside 768 GB of Optane with headroom for the OS page cache and the active-expert hot path materialized in DDR4.
Q2_K compression squeezes the footprint further to roughly 295 GB and would let you run a similar model on four Optane sticks instead of six, but it sacrifices perplexity at a much steeper rate above 70 B parameters. The pragmatic choice for trillion-parameter-class models on Optane is Q3 — it preserves the long-context reasoning quality that makes hosting the model worthwhile in the first place. If you are going to drop to Q2_K, you are probably better off running a 70 B dense model at Q5_K_M on a single RTX A6000 ($2,800 used) and saving the platform cost.
Spec-delta: Optane DIMM vs. DDR5 RDIMM vs. HBM
| Spec | Optane 128 GB | DDR5 RDIMM 128 GB | HBM3 (16 GB per stack) |
|---|---|---|---|
| Read bandwidth | 6.8 GB/s | ~38 GB/s | ~819 GB/s |
| Write bandwidth | 2.3 GB/s | ~30 GB/s | ~819 GB/s |
| Random read latency | ~300 ns | ~80–90 ns | ~25 ns |
| Endurance | Effectively unlimited for read | ECC-protected | ECC-protected |
| Persistence on power-loss | Yes (App Direct, fsdax) | No | No |
| Cost per GB (2026 used) | $1.56–$3.12 | $7–$12 | $80+ |
| TDP per device | 12–18 W | 8–10 W | 10–15 W |
| Platform support | Ice Lake / Cascade Lake Xeon only | Any DDR5 platform | Datacenter GPUs only |
The persistent-memory column matters more than it looks. Because the Optane namespace is fsdax-mapped, your inference runtime can mmap() the weight file once and survive an OS reboot without re-reading from disk. For trillion-parameter models that take 90 seconds to load from NVMe and 20 seconds to swap experts on a cold prompt, this is a real ergonomic improvement during development.
Tok/s table: 1T-parameter MoE on Optane vs. NVMe offload vs. DDR5-only smaller class
The measurements below are from llama.cpp built with LLAMA_OPENBLAS=on and --n-gpu-layers 0 (CPU only) on the Ice Lake host, plus public benchmarks from Hugging Face's llama.cpp comparison thread and the Anandtech Persistent Memory 200 review. Token-per-second numbers below are short-context (≤2k token) generation.
| Configuration | Model class | Quant | Generation tok/s | Cost to build (used, 2026) |
|---|---|---|---|---|
| 768 GB Optane + 64 GB DDR4 + Xeon Gold 6338 | 1T MoE | Q3_K_M | 2.1 | $4,500 |
| 128 GB DDR5 + Threadripper Pro 7975WX, NVMe SN850X | 1T MoE swap | Q3_K_M | 0.08 | $5,200 |
| 768 GB DDR4-3200 + dual Xeon Gold 6338 | 1T MoE | Q3_K_M | 7.4 | $9,800 |
| 80 GB H100 + 96 GB DDR5 | Llama 3 70B | Q5_K_M | 52.0 | $26,000 |
| Dual RTX A6000 (48 GB each) | Llama 3 70B | Q5_K_M | 41.5 | $5,600 |
The Optane-and-NVMe gap is the key value proposition: a 26× speedup over swapping the same model from NVMe, at roughly the same total cost. The DDR4-only build is more than 3× faster than Optane but costs twice as much to assemble.
What's the cheapest second-hand workstation that takes Optane DIMMs in 2026?
Three buildable options in 2026, ordered by total cost.
Budget path: refurbished Dell PowerEdge R740 with Cascade Lake Xeon Gold 6248. Around $900 for the chassis with one CPU and 64 GB DDR4, plus six 128 GB Optane PMem 100 series modules at $220 each. Total: about $2,200. This is the configuration most hobbyists are landing on. The R740 is loud — you will need a closet or a basement.
Performance path: Supermicro X12 barebones with Ice Lake Xeon Gold 6338. Around $1,400 to $1,800 for the chassis with one CPU, plus six 128 GB Optane PMem 200 series modules at $290 each. Total: about $3,500. Better bandwidth and lower TDP. This is the build the reported demo used.
Workstation-form-factor path: HP Z8 G4 with Cascade Lake Xeon. $1,500 to $2,100 for a usable Z8 G4 with dual Xeon Gold 5218, plus six Optane sticks. Total: about $3,400. Quieter than the rackmount alternatives because the Z8 has a real tower cooling system. The downside is that the Z8 G4 stops at 1.5 TB of total DDR4 plus Optane, which is fine for 768 GB but caps further expansion.
Beyond those three, anything labeled "Xeon Scalable Gen 1" is not Optane-compatible. The first-generation Skylake-SP Xeon platform (LGA-3647 with the metal Heatsink stamp) does support PMem 100 series, but second-generation chips share the same socket — be sure the SKU is Gold 6xxx or Platinum 8xxx, not Bronze or Silver, since the lower SKUs ship with the memory controller path disabled at fuse.
How does NVMe offload compare?
A common counter-argument is "why not just use an NVMe SSD as the weight tier?" For 70 B-class models, mmap-based llama.cpp from a Samsung 990 Pro 2 TB is workable — you will see 1 to 3 tokens per second once the page cache warms. For trillion-parameter models, NVMe falls off a cliff. The total weight footprint vastly exceeds page cache RAM, so the kernel evicts active expert blocks faster than they can be reloaded, and generation throughput collapses to seconds-per-token.
The Optane advantage at this scale is byte-addressable persistence: the inference runtime can issue 64-byte cacheline reads to arbitrary offsets in the 768 GB namespace and pay only the ~300 ns access penalty. NVMe SSDs operate at 4 KB or 16 KB block granularity, so even when llama.cpp asks for 32 bytes of weight, the kernel has to round to a full page, blow through the inode lookup, and walk the FS layer. That overhead is invisible at 70 B-scale because the weight set fits in DDR cache; it dominates at 1 T scale because nothing does.
Common pitfalls
- Buying mismatched Optane generations. PMem 100 series (Cascade Lake) and PMem 200 series (Ice Lake) are physically interchangeable but not electrically — putting 100s in an Ice Lake board boots in
Memory Modeonly, locking out the App Direct mode you need formmap(). Confirm the seller's photo of the label stripe. - Population imbalance. Optane DIMMs must be populated symmetrically per memory channel. Six DIMMs in a single-socket six-channel Xeon is correct; five DIMMs will boot, but bandwidth craters to two channels' worth.
- Forgetting
ndctl create-namespace --mode=fsdax. The factory mode israw, which only exposes the device as a block target, not as a memory-mapped namespace. You must reformat tofsdaxonce, then mount as ext4 with the-o daxmount option. - Buying decommissioned Optane that was deployed in 2× replication clusters. Some sellers mix paired-replica sticks from Ceph or VAST cluster decommissions; the data is wiped but the SMART log shows >80% lifetime write usage. Optane endurance is rated for hundreds of full drive writes, so this is not a death sentence, but the cheapest sticks may already be in their middle-age. Ask for the ipmctl health output.
- Underspeccing the host PSU. Six 18 W Optane DIMMs plus dual Xeons plus a single offload GPU lands around 600 W idle and 1,000 W peak. The 750 W PSU in a budget refurb chassis will trip on a long generation.
When NOT to use Optane DIMMs
If your target model fits inside 96 GB of VRAM at acceptable quantization — that means anything up to Llama 3 70B at Q5_K_M, or any 32 B-class dense model — buy two used RTX A6000 cards at $2,800 each on eBay or a single H100 80 GB and skip the persistent-memory route entirely. A 70 B model on dual A6000s runs at 41 tokens per second versus 2 tokens per second on the Optane build, while drawing 600 W less at the wall. Optane is the right pick only when your model class genuinely cannot fit on consumer or workstation GPU VRAM under any quantization scheme — and that band, in 2026, is essentially the trillion-plus parameter MoEs.
Verdict matrix
Buy Optane if: You specifically want to host a trillion-parameter MoE model at home, you are comfortable on Linux with ndctl and pmempool, you have a server-friendly space, and you value the price-per-GB floor more than tokens per second.
Buy DDR5-only if: You are running 70 B to 200 B dense models with KV-cache offload to GPU. The DDR5 RDIMM build delivers 3–5× the Optane bandwidth and works on a normal Threadripper Pro motherboard that you can keep using after the LLM experiment.
Wait for HBM if: You want 100+ tokens/s on trillion-parameter models. Used H100s will drop below $15,000 in 2026 according to Tom's Hardware secondary-market tracking, and dual-H100 builds will deliver more than 100 tokens/s on the same model class. If real-time interactivity matters more than the price floor, hold cash.
Bottom line + perf-per-dollar math
The Optane build hits roughly $2,150 per token per second on a trillion-parameter MoE workload. The dual-RTX-A6000 build at 70 B is $135 per token per second. The H100 build at 70 B is $500 per token per second. Optane wins on capacity per dollar — only — and the value of that win depends entirely on whether you have a use case that requires trillion-parameter weights. For most local-LLM users, that is a research curiosity, not a production need.
Related guides
- Best GPU for Local Llama 70B in 2026: RTX 3060 12GB Stack vs Single Workstation Card
- AMD Ryzen AI Max 400 'Gorgon Halo': 192GB Unified Memory APU Hits $3,999
- Intel llm-scaler-vllm 1.4: Arc Pro B70 Inference Support Lands
- AMD Ryzen 7 5800X (host CPU example)
- ZOTAC GeForce RTX 3060 12GB (offload-tier comparison)
