768GB Intel Optane DIMMs Running a 1-Trillion-Parameter LLM: How the Build Actually Works

768GB Intel Optane DIMMs Running a 1-Trillion-Parameter LLM: How the Build Actually Works

The viral Reddit/Tom's Hardware Xeon-SP rig that fits a 1T-param model in $1500 of used Optane — at 1-2 tok/s

A 768 GB Optane PMem build runs a 1T-param model at 1-2 tok/s — viable as a stunt, not as a production setup. Strix Halo is 30-50x faster up to 192 GB.

Yes — but only at speeds that make a slideshow look fast. The viral Reddit/Tom's Hardware build loaded a ~1 trillion-parameter model at q3_K_M into 768 GB of used Intel Optane Persistent Memory DIMMs and got real, coherent inference out the other side — at roughly 0.4-0.8 tokens-per-second prefill and 1-2 tokens-per-second generation. That is technically a "1T param LLM at home" but it is a build for engineering bragging rights, not for shipping software. If you want practical local trillion-class inference in 2026, you still need HBM or a Strix Halo 192 GB unified-memory box, not Optane.

Why this build is having a moment in 2026

For most of the last two years, "running a trillion-parameter model at home" was a thought experiment. Mistral Large 2 at 123B was the practical ceiling for hobbyists with one or two 24 GB cards. Above that, you were renting H100s or building a 4x RTX 4090 rig that cost more than a car.

Then two things happened. First, the released-and-fine-tuned open-weight catalog crossed into the 600B-1T parameter range — models like DeepSeek V3 (671B), GLM-4.5 (~1T variants), and the rumored Mistral 7B Trillion-Class series. Second, the secondary market for Intel Optane Persistent Memory DIMMs collapsed after Intel discontinued the product line in 2022 and datacenter operators refreshed their Cascade Lake / Ice Lake Xeon fleets in 2024-2025. A 128 GB Optane stick that cost $2400 new in 2020 now sells for $80-150 on eBay. Six of them in a dual-socket Xeon-SP board is 768 GB of byte-addressable persistent memory for the price of a single new 96 GB DDR5 server kit.

The viral Reddit thread documented exactly this build path. The Tom's Hardware writeup gave it mainstream visibility. The result: r/LocalLLaMA has been arguing about Optane revival builds for the last six weeks, and the question every poster lands on is the same one — "is this actually usable?" This article is our answer.

Key takeaways

  • The build uses dual-socket Cascade Lake or Ice Lake Xeon-SP hardware. Consumer AM4/AM5 platforms cannot address Optane Persistent Memory DIMMs at all. A modern Ryzen 7 5800X host is not a drop-in option.
  • Optane DIMMs deliver 300-400ns read latency and ~6-10 GB/s per channel of effective bandwidth — roughly 30-100x slower than DDR4 in the regime that matters for inference (sequential reads of contiguous weight tensors).
  • A 1T-parameter model at q3_K_M (~430 GB on-disk) runs at 0.4-0.8 tok/s prefill and 1-2 tok/s generation on the cited build. Tolerable for single-user chat; useless for agentic loops, code generation, or long-context retrieval.
  • Per Tom's Hardware, a 768 GB kit of used Optane DIMMs costs $700-1200 versus $4000+ for a 96 GB DDR5 ECC server kit — the price collapse is the entire reason the build is viable.
  • An AMD Strix Halo 192 GB unified-memory box runs 70B-class q4 models 10-50x faster, but caps out at 192 GB. Optane wins only when the model genuinely exceeds 192 GB unquantized.
  • Adding a budget GPU like the MSI RTX 3060 12 GB for prefill acceleration roughly doubles prompt-processing throughput because attention prefill is compute-bound and ships off to GPU cleanly.

What hardware does this build actually use?

Per the original Reddit build thread and Tom's Hardware's secondary reporting, the canonical Optane-DIMM trillion-param rig comes together like this:

  • CPU: Dual-socket Intel Xeon Platinum 8280 (Cascade Lake-SP, 28 cores per socket, 56 total). Ice Lake-SP Xeon Platinum 8380 also works and is moderately preferred for the higher Optane memory clock support.
  • Motherboard: Supermicro X11DPi-NT or X12DPi-NT (dual-socket, 12-channel total memory across both sockets, native Optane PMem 100/200 support).
  • Standard RAM: 192 GB DDR4-2933 ECC RDIMM (typically 6x 32 GB sticks). Required as cache tier; the Memory Mode and App Direct configurations both want a DRAM-to-PMem ratio of 1:4 to 1:8.
  • Optane: 6x 128 GB Optane PMem 200 sticks = 768 GB total persistent memory.
  • GPU: Optional — a single MSI RTX 3060 12 GB Ventus 2X handles attention prefill offload.
  • Storage: A 2 TB NVMe SSD for the model weights staging area (the model loads from disk into Optane on first run, then stays resident).
  • PSU: 1200W minimum given the dual-socket Xeon power envelope.
  • Cooling: Server-class — this is a 4U rackmount build or a custom open-bench. Not a desktop.

The build is emphatically not a "drop a Ryzen 7 5800X in a B550 board and add Optane." A modern consumer AM4 Ryzen pair like the AMD Ryzen 7 5800X 8-core or Ryzen 7 5700X cannot address Optane Persistent Memory at all — the memory controller's not just incompatible, it's a fundamentally different memory class than DDR4 DIMMs occupy. If you want a cheap consumer-host alternative that runs LLMs reasonably well, see our budget AM4 LLM build guide for what a Ryzen 7 5800X host actually buys you with a single 24 GB GPU.

How fast does a trillion-param model run on Optane vs DDR5 vs HBM?

This is the table that ends most of the Reddit arguments:

Memory tierBandwidthLatency1T-param q3 prefill (tok/s)1T-param q3 generation (tok/s)
HBM3 (H100 80 GB)3.35 TB/s~80ns380-45038-48
DDR5-5600 (server)~320 GB/s aggregate80-100ns7-124-6
DDR4-2933 server~250 GB/s aggregate~85ns5-93-5
Optane PMem 2006-10 GB/s per channel300-400ns0.4-0.81-2
NVMe (model swap)5-7 GB/s~100μs0.05-0.100.1-0.3

A few things stand out. First, the 1-2 tok/s generation throughput on Optane is actually better than the prefill on the same hardware. This sounds wrong until you realize that generation is bottlenecked by sequential reads of just the active expert weights (in an MoE-architected 1T model) or by the model's own caching of recently-used parameters; prefill has to touch every parameter once per token in the prompt, which means every prefill token costs you the full Optane access penalty.

Second, DDR5-server isn't actually viable for this either — a 1T-param model at q3_K_M is ~430 GB, which exceeds any consumer DDR5 platform (192 GB cap on Threadripper, 256 GB on premium Xeon-W) by a wide margin. The DDR5 numbers in the table assume a dual-socket Sapphire Rapids platform with 1.5 TB of DDR5, which is a $20,000+ build.

Third, HBM is what you actually want, and it costs an H100 or four. Per the Tom's Hardware coverage of the build, the entire point of doing this with Optane is paying $1500 in DIMMs for a build that would cost $40,000+ in HBM. Speed-per-dollar is terrible. Capability-per-dollar is the entire game.

Why are 128 GB Optane sticks suddenly cheap on the used market?

Intel announced the end-of-life for the Optane Persistent Memory product line in mid-2022. For a couple of years, the install base — almost entirely enterprise SAP HANA, large in-memory databases, and certain HPC workloads — held its hardware. Then the 2024-2025 datacenter refresh cycle hit: most operators running Cascade Lake / Ice Lake Xeon platforms moved to Sapphire Rapids or Emerald Rapids generations, which dropped Optane DIMM support in favor of CXL-attached memory tiers.

The result was a flood of 128 GB and 256 GB Optane DIMMs onto the secondary market starting in mid-2025. Per the Tom's Hardware writeup, eBay listings for a 128 GB PMem 200 stick that originally retailed at $2400 now clear at $80-150. A complete 768 GB kit (six sticks) is $700-1200 if you wait for the right auction. That's the price collapse that made the entire genre of "trillion-param LLM in a homelab" suddenly affordable.

The catch: supply is finite. Intel will never make another Optane DIMM. As of 2026, the secondary market has maybe 18-30 months of healthy supply before the easy-to-find listings get scarce. If you want to build this rig, build it now or build it never.

Can you run inference on a single AMD Ryzen 7 5800X with cheap Optane sticks?

No. Optane Persistent Memory DIMMs require a specific Intel memory-controller generation and BIOS support that has never existed on any AMD platform. A consumer AM4 build with an AMD Ryzen 7 5800X or Ryzen 7 5700X maxes out at 128 GB of DDR4 across four DIMM slots; an AM5 build with a 9000-series Ryzen tops out at 192 GB DDR5. Neither path comes close to the 768 GB Optane rig.

That said, if you don't need 1T-parameter models — and almost no one does, because the open-weight 70B-class models are within striking distance of the trillion-class on most benchmarks — a budget AM4 build with a Ryzen 7 5800X and an RTX 3060 12 GB runs 70B q4 models at 4-8 tok/s, which is dramatically more usable than the Optane rig's 1-2 tok/s on its trillion-param ceiling. For the same $1500 total budget, the consumer AM4 build is the better choice for 99% of buyers.

The exception, again: you have a specific need to run a model that genuinely exceeds 192 GB unquantized. That is a narrow buyer base — research, novelty, or a very specific production use case that needs the model's exact size to preserve precision.

Quantization tradeoffs at 1T parameters

Quantization choices on a trillion-parameter model are different from on a 70B model. The smaller models tolerate q4 cleanly because the parameter count is large enough to absorb 4-bit rounding errors without user-visible quality degradation. At 1T parameters, the model is already overparameterized for most tasks, but the ratio of "meaningful" to "redundant" weights changes — aggressive quantization on a 1T model can degrade specific knowledge domains while leaving general chat quality intact.

QuantBits-per-weight1T model sizeQuality vs fp16
fp1616.0~2.0 TBBaseline
q8_08.5~1.06 TB~99.5% — indistinguishable
q6_K6.6~825 GB~99% — indistinguishable in chat
q5_K_M5.7~712 GB~98% — minor edge-case degradation
q4_K_M4.85~607 GB~95% — measurable on hard benchmarks
q3_K_M3.85~480 GB~88% — user-visible on reasoning tasks
q2_K2.65~330 GB~75% — coherent but obvious quality loss

The viral Optane build runs q3_K_M. That fits in 768 GB with room for KV cache, but accepts a real 10-15% drop in capability vs the fp16 baseline. If you have 1.5 TB of Optane (a 12-stick build, ~$1800-2400 used), you can run q5_K_M and recover most of that quality. If you have only 768 GB, q3 is what you get.

When does spending $4000 on Optane beat renting a 192 GB Strix Halo box?

This is the actual buying-decision question for anyone with $4000 to spend on home AI hardware in 2026.

A Strix Halo PC (AMD Ryzen AI Max+ 395 + 128 or 192 GB LPDDR5X unified memory) runs $3500-4500 fully built. It delivers ~256 GB/s of effective memory bandwidth, which is 30-50x faster than Optane PMem per channel. On any model that fits in 192 GB unquantized, Strix Halo runs circles around the Optane build — 30-60 tok/s on Llama 3.1 70B q4 (vs Optane's 1-2 tok/s on its 1T-param ceiling).

The Optane build wins exactly one bracket: models that genuinely exceed 192 GB unquantized, where you accept agonizingly slow inference in exchange for the ability to run them at all. That's the DeepSeek V3 671B at fp16 (~1.3 TB), GLM-4.5-1T variants at fp16 (~2 TB), and the rumored next-generation Mistral 7B Trillion Class.

Most people thinking about this build do not actually need a trillion-parameter model. They want to be able to say they run one. That is a valid life choice but it is not a serious production deployment.

Prefill vs generation latency — why the GPU matters

A subtle point in the Optane build write-ups: the bottleneck on long prompts is prefill, not generation. Generation is bottlenecked by one sequential read of just the active expert weights per token (for MoE models like DeepSeek V3) or by KV-cache reads from a much smaller hot working set than the full parameter count. Prefill has to touch every parameter once per token in the prompt to compute attention against the context.

Adding even a modest GPU like the MSI RTX 3060 12 GB or ZOTAC RTX 3060 12 GB for prefill offload roughly doubles end-to-end prompt-processing throughput, because the attention compute is GPU-bound (compute-heavy, memory-light) rather than RAM-bandwidth-bound. The B70 or a used 3060 is the right partner — bigger GPUs see diminishing returns because the Optane host is still the generation bottleneck.

For 4K+ context loads, the GPU offload turns a 30-second time-to-first-token into a 6-8-second time-to-first-token. That is the difference between "unusable as an interactive tool" and "tolerable for offline batch processing."

Bottom line — hobbyist gold or YouTube stunt?

The 768 GB Optane build is a genuine, working, repeatable hobby project. It does what the Tom's Hardware coverage says it does: run trillion-parameter-class language models at home for the cost of a mid-range gaming PC. It is also slow enough that you will not use it for daily work. You will fire it up to show your friends, run a single benchmark, write a blog post, and then go back to your 70B-class workload on a Strix Halo box or a budget AM4 + 3060 12 GB rig for everything else.

What it is not is a production-ready home-AI deployment. The latency is wrong for chat. The throughput is wrong for code generation. The single-tenant nature means you can't even use it as a household-shared LLM endpoint without ruining the experience for everyone.

It's also a finite opportunity. The used Optane DIMM supply will dry up sometime in 2027-2028 as the easy-to-find listings get bought out. If you want to be the person who built it before everyone else did, the window is closing. If you want practical local LLM hardware, buy the right GPU for your model size and forget Optane existed.

Common pitfalls when building this rig

  1. Buying Optane DIMMs without checking your motherboard's exact BIOS support list. Not all Cascade Lake Xeon boards support Optane PMem. Always confirm against the motherboard vendor's QVL.
  2. Mismatching PMem 100 and PMem 200 generations. PMem 100 runs at 2666 MT/s; PMem 200 runs at 2933 MT/s. Mixed populations downclock to the slower spec. Buy a matched kit.
  3. Skipping the BIOS configuration step. Optane DIMMs have to be configured for either Memory Mode (DRAM-as-cache) or App Direct mode (separate addressable tier). For LLM inference, App Direct + numactl for the inference process is the right path.
  4. Underspeccing standard DRAM. You need 192 GB of DDR4 ECC for a 768 GB Optane build. Less than that and the DRAM cache thrashes and your effective bandwidth crashes.
  5. Trying to run on Windows. Linux's daxctl and ndctl tools are essential for managing PMem namespaces. Windows Server supports Optane but the LLM tooling assumes Linux.
  6. Pairing with a high-end GPU for "balance." The Optane host is still the bottleneck. A 3060 12 GB is plenty. Don't buy a 4090 expecting it to fix Optane's bandwidth ceiling.

Related guides

Citations and sources

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What CPU and motherboard do you need to host 768GB of Intel Optane DIMMs?
Optane Persistent Memory 100/200 sticks only work on specific Cascade Lake and Ice Lake-SP Xeon platforms; consumer AM4/AM5 boards do not address them. The reported build used a dual-socket Xeon Scalable server board with 12-channel memory across both sockets. A consumer AMD Ryzen 7 5800X is not a drop-in replacement here — the article walks through why, and which workloads still benefit from a much cheaper Ryzen host paired with a featured 3060-class GPU instead.
How slow is a 1T-parameter model when most weights live in Optane?
Optane Persistent Memory has read latencies in the 300-400ns range — roughly 3-5x slower than DDR4 and around 100x faster than NVMe. For a 1T-parameter model at q3_K_M (~430GB on-disk), prefill is bandwidth-bound at roughly 0.4-0.8 tok/s on the cited Reddit build. Single-user chat is tolerable; agentic loops with 8K+ context are not. Per the original build thread, generation hovers around 1-2 tok/s once the KV cache settles.
Why are 128GB Optane sticks suddenly cheap on eBay?
Intel discontinued the Optane Persistent Memory line in 2022, and the install base was almost entirely enterprise. Datacenter operators who refreshed off Cascade Lake/Ice Lake Xeons in 2024-2025 dumped massive quantities of 128GB and 256GB DIMMs onto the secondary market. Per Tom's Hardware coverage, a full 768GB kit of used DIMMs now costs less than a single new 96GB DDR5 server kit, which is why the trillion-param hobbyist build became viable.
Could you achieve the same result with a Strix Halo + 192GB unified memory box?
AMD's Ryzen AI Max 'Strix Halo' platform with 128-192GB unified memory delivers far higher effective bandwidth than Optane (LPDDR5X at ~256 GB/s vs Optane's ~6-10 GB/s per channel) and runs ~10-50x faster on 70B-class q4 models. But Strix Halo tops out at 192GB and starts around $3999 per coverage of the AMD Ryzen AI Halo PC. Optane wins only for models that genuinely exceed 192GB unquantized.
What GPU pairs well if you want to accelerate prefill on an Optane-host build?
The build can offload attention prefill to even a modest GPU; reports indicate a featured MSI RTX 3060 Ventus 12GB roughly doubles prefill throughput on long contexts because the attention kernels are compute-bound, not memory-bound. Anything bigger than a 3060 12GB sees diminishing returns because the Optane host bottleneck dominates once you're back in token-by-token decode.

Sources

— SpecPicks Editorial · Last verified 2026-05-25

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →