Skip to main content
Running a 1-Trillion-Parameter LLM on 768GB of Cheap Optane

Running a 1-Trillion-Parameter LLM on 768GB of Cheap Optane

What a 768 GB clearance-bin Optane build can — and can't — do for local inference, and where a $300 GPU still wins.

Cheap discontinued Optane DIMMs let you fit a 1T-parameter model in RAM — but throughput, latency, and the math vs. an RTX 3060 12GB tell a more complicated story.

Can you run a 1-trillion-parameter LLM on 768 GB of clearance-bin Intel Optane DIMMs? Yes — you can technically load the weights and produce tokens. But generation throughput is a small fraction of what even a $300 RTX 3060 12GB delivers on a model that fits in its VRAM, prefill latency is brutal, and the build only makes sense as an experiment, not a serving platform.

That qualifier is doing a lot of work. The viral Tom's Hardware build that put 768 GB of discontinued Optane Persistent Memory into a single Xeon server and ran a 1T-parameter model on it is interesting for one reason: Optane DIMMs are being dumped on the clearance market at prices that work out to a tiny fraction of what equivalent registered DDR5 costs. Hobbyists doing the arithmetic noticed that capacity, not bandwidth, is the bottleneck for fitting frontier-class LLM weights into a non-GPU machine, and Optane has more of it per dollar than any other option in 2026.

This piece is editorial synthesis of the public build reports, Intel's official Optane Persistent Memory documentation, and the llama.cpp issue tracker, where the CPU-offload code paths that make these builds work were originally landed. No independent first-party benchmarking is reported.

Key takeaways

  • Optane is a capacity play. 768 GB at clearance pricing can fit a quantized 1T-parameter model that a single consumer GPU cannot.
  • Generation throughput is in the single digits of tokens per second on the largest models, per public build reports — orders of magnitude slower than GPU inference.
  • Prefill (the first-token latency on a long prompt) gets dramatically worse than on GPU, because Optane bandwidth is well under DDR5 and far under GDDR6/HBM.
  • For 7B-13B models, a 12GB RTX 3060 at 4-bit beats this build on every axis that matters: latency, throughput, watts, and complexity.
  • Build complexity is real: you need a compatible Xeon Scalable platform with App Direct mode wired in BIOS. That is not a Saturday afternoon project.

What did the experiment actually demonstrate?

The headline-grabbing build was straightforward in concept: take a dual-socket Intel Xeon Scalable server (the kind that turns up on the used market for the price of a mid-range gaming PC), populate every DIMM slot with the largest available Optane Persistent Memory modules, configure App Direct mode in the BIOS so the OS sees the Optane as a flat memory tier alongside DDR4, and load a 1-trillion-parameter quantized model via llama.cpp with full CPU offload.

The build was not new science — llama.cpp's CPU inference path has supported memory-mapped weights for years. What was new was the capacity ceiling: 768 GB in a single chassis at a price that would barely buy 192 GB of registered DDR5. Once you can fit the weights, throughput becomes an engineering question rather than a feasibility question. The trick is that engineering question has an unflattering answer.

How does Optane DIMM bandwidth compare to DDR5 and to GPU VRAM?

The bandwidth gap is the entire story. Per Intel's published specs and independent measurements that surfaced when the platform was still in active development, Optane DCPMM in App Direct mode delivers a fraction of DDR4 bandwidth on the same channels, and a much smaller fraction of GDDR6 or HBM3 bandwidth. The numbers below are order-of-magnitude — exact figures vary by generation, channel count, and access pattern, but the relative scale holds.

Memory mediumApprox bandwidth (GB/s)Latency (ns)Capacity ceiling per channelApprox $/GB (2026 clearance)
Optane DCPMM (App Direct)5–10350–500512 GB$1–3
DDR4-3200 ECC RDIMM25 per channel90–120256 GB$5–8
DDR5-5600 ECC RDIMM44 per channel80–110256 GB$8–14
GDDR6 (RTX 3060 12GB)360 (aggregate)~15012 GBn/a (soldered)
GDDR6X / GDDR7 (RTX 5090 32GB)1700+~12032 GBn/a (soldered)
HBM3e (MI300X / H200)5000+~80192 GBn/a (soldered)

The Optane row is two orders of magnitude slower than the GPU rows. That gap drives every downstream throughput number you will see in build reports.

What token throughput can you realistically expect on CPU+Optane vs a GPU?

LLM token throughput is a function of memory bandwidth divided by the number of bytes the model has to read per token. On a quantized model of size $W$ bytes, generation reads roughly $W$ bytes per token (autoregressive decode reads every parameter once). Bandwidth ÷ parameter-bytes gives an upper bound on tokens per second.

Public reports of CPU+Optane builds on very large models land at single-digit to low-double-digit tok/s for generation, and prefill (the initial pass over the prompt) is much slower still because it has to read the weights once per layer per token in the prompt window. The numbers below are gathered from build threads and the llama.cpp benchmark database — treat them as ranges, not guarantees.

Model sizeQuantRTX 3060 12GB tok/s (generation)Xeon+Optane tok/s (generation)Notes
Llama 3.1 8Bq4_K_M50–708–15Fits in 12GB VRAM, GPU wins decisively.
Qwen3-14Bq4_K_M30–455–10Fits in 12GB with tight context, GPU still wins.
Qwen3-72Bq4_K_Mn/a (OOM)3–6Doesn't fit in 12GB; Optane wins by default.
Mixtral 8x22Bq4_K_Mn/a (OOM)2–4Doesn't fit in 12GB; Optane wins by default.
Llama 405Bq4_K_Mn/a (OOM)1–2Optane is the only option at this size on consumer budgets.
Hypothetical 1Tq4_K_Mn/a (OOM)<1Capacity demonstration, not a usable workload.

A 1 tok/s generation rate is conversationally unusable for most workflows. It is, however, non-zero on a $2,000 used Xeon, which is a remarkable thing on its own.

Quantization matrix: q2/q3/q4/q5/q6/q8/fp16

Quantization shrinks the weight footprint and so directly raises tokens per second on memory-bound platforms. For a 70B model the matrix looks roughly like this, again synthesized from llama.cpp benchmark threads and Tom's Hardware coverage of recent build reports:

Quant70B size (GB)Fits 12GB GPU?Fits 12GB + 64GB DDR?Optane required?Quality loss vs fp16
q2_K~26NoYes (partial)NoHigh, often unusable
q3_K_M~32NoYes (partial)NoNoticeable
q4_K_M~42NoYes (partial)NoSmall, recommended
q5_K_M~50NoYes (tight)MarginalVery small
q6_K~57NoNoYesNear-lossless
q8_0~75NoNoYesEffectively lossless
fp16~140NoNoYesReference

Once a quantized model fits in a single RTX 3060 12GB's VRAM, the conversation is over — the GPU wins. Once it does not, the question becomes how much RAM you can throw at it, and Optane provides the cheapest path to "a lot."

Prefill vs generation: why Optane crushes generation latency but not prefill

Generation is sequential: the model reads each weight tensor once per token. On a bandwidth-limited platform like Optane, that gives a strict bandwidth ÷ parameter-bytes ceiling. Prefill, by contrast, reads every weight tensor once per layer per token in the input prompt, which is dramatically more I/O. A 4K-token system prompt that takes 200 ms on a GPU can easily take 30+ seconds on an Optane build.

This asymmetry is why these rigs feel slow in chat use even when steady-state tok/s looks OK on paper: every long prompt incurs a noticeable wait before the first token streams out. For agent-style workloads where a long system prompt is reused across calls, KV-cache reuse can mitigate this; for one-shot chat it cannot.

Context-length impact: how KV-cache growth changes the RAM-vs-VRAM math

KV-cache size scales linearly with context length. At 128K context on a 70B model the KV cache alone can exceed 40 GB. On a 12GB GPU that is impossible — you offload the KV cache to system DDR and pay a large bandwidth penalty per token. On an Optane build the KV cache typically lives in DDR4 (which has acceptable bandwidth), with the weights on Optane (slow), so long contexts hurt less than you would expect — the weight reads dominate either way.

Where a 12GB GPU like the RTX 3060 still wins

For the model sizes most local-LLM users actually run — 7B, 8B, 13B, 14B quantized — a single 12GB RTX 3060 is comprehensively the better answer. Per the llama.cpp benchmarks, it delivers 30–70 tok/s on those models at q4_K_M with low first-token latency, and it costs less than the BIOS-compatible motherboard alone on a Xeon+Optane build. It also draws ~170 W under load instead of 400+ W for a fully populated dual-socket Xeon.

The perf-per-dollar math is brutal:

BuildApprox costBest-case gen tok/s (8B q4)$/tok/s
RTX 3060 12GB + B450 + Ryzen 5 5600 + 32GB DDR4$70060$11.7
Used Xeon + 768GB Optane + DDR4$2,500–3,50012$208

That's roughly an 18× worse cost-efficiency for the Optane build on the workload most people care about. Optane only catches up when the model genuinely doesn't fit on the GPU at any quant.

Real-world gotchas you will hit

These are the failure modes that show up consistently in build threads — none of them are dealbreakers, but all of them eat days.

  • BIOS App Direct mode is not enabled by default. You need a server-class motherboard with current firmware. Many used eBay listings ship with older firmware that doesn't expose the option.
  • NUMA layout matters enormously. A 1T-parameter model that crosses sockets without NUMA pinning can run 2-3× slower than the same build with proper numactl invocations.
  • Optane wears out. Persistent memory has a finite write endurance. For inference (read-heavy) this is fine, but if you fine-tune on the same machine you will burn through DIMMs faster than you expect.
  • Power supply sizing. A dual-socket Xeon with 24 populated DIMM slots pulls 400-600 W at load. Used 2U servers often ship with PSUs that just barely cover that envelope.
  • Cooling. Optane DIMMs run hot under sustained inference. 2U rack chassis with stock fan profiles are designed for short bursts; sustained 90% utilization will throttle.

When NOT to build this

Don't build this if you want to actually use a local LLM day-to-day. Don't build this if you don't already have a comfortable budget for a "throwaway experiment." Don't build this if your workload is 7-14B model chat — a single RTX 3060 12GB is faster, cheaper, quieter, and uses 3-5× less power. And don't build this if you don't already know how to navigate NUMA, App Direct mode, and used server hardware — the platform is unforgiving for first-time server builders.

Build it if you want bragging rights on running a frontier-class model locally and you accept that "running" will mean "watches one token appear every few seconds." That is a legitimate hobbyist goal, and Optane is the cheapest way to get there.

Who should actually build this

A narrow audience: persistent-memory tinkerers, researchers who specifically need to inspect activations on very large models without the cost of a multi-GPU H100 cluster, and homelab enthusiasts who already own a Xeon Scalable platform and want to give it a second life. For everyone else, the entry-level path is a budget GPU like the MSI RTX 3060 Ventus 2X 12G plus a fast NVMe drive like the WD Blue SN550 1TB to hold model files, and a Crucial BX500 1TB for the OS.

Bottom line

A 768 GB Optane build is the cheapest known path to fitting a 1T-parameter quantized model in a single chassis in 2026. It is also one of the slowest ways to actually use one. The numbers favor a $300 GPU for almost every real workload below ~70B parameters, and the engineering tax (server motherboard, NUMA tuning, App Direct BIOS, sustained-load cooling) is significant. Treat this build as a demonstration of what clearance-bin persistent memory can do, not as a serving platform.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How slow is CPU+Optane inference compared to a GPU?
Generation throughput on persistent-memory CPU rigs is typically a small fraction of a discrete GPU's. Public reports of these builds describe single-digit to low-double-digit tokens per second on very large models, where a 12GB GPU like the RTX 3060 runs 7-13B models at 30-60 tok/s. Optane's advantage is capacity, not speed, and you should set expectations around that gap before you commit a four-figure budget to clearance-bin memory.
Why use Optane instead of just buying more DDR5?
Optane persistent-memory DIMMs hit clearance pricing after Intel discontinued the line, so 768GB can cost far less than the equivalent registered DDR5. The tradeoff is lower bandwidth and higher latency than DDR5, which caps token throughput. It is a capacity play for experimenters and tinkerers, not a performance build, and you should not expect it to feel anything like a GPU.
Do I need a special motherboard for Optane DIMMs?
Yes. Optane persistent-memory DIMMs require a compatible Intel Xeon Scalable platform with a chipset and BIOS that enable App Direct or Memory Mode. Consumer AM4/AM5 and mainstream LGA boards will not accept them, which is why these builds use older second-hand server platforms. Plan on a Supermicro or used Dell/HPE 2U as part of the bill of materials.
Would an RTX 3060 12GB be a better starting point?
For most people, yes. A featured RTX 3060 12GB runs 7-13B quantized models comfortably at interactive speeds and costs far less than a Xeon-plus-Optane platform. The Optane route only makes sense if you specifically need to load models far larger than any single affordable GPU's VRAM, and you are willing to accept dramatically lower tokens per second to do it.
Does quantization change the Optane-vs-GPU decision?
Heavily. Aggressive 4-bit quantization shrinks a model's footprint enough that mid-size models fit in 12GB of VRAM, removing the need for Optane entirely. Optane only wins once a model — even quantized — exceeds what your GPU and system DDR can hold together, which starts well above the 70B class. For anyone running 7B-13B Llama derivatives or Qwen3-14B at q4_K_M, a single consumer GPU is the right answer.

Sources

— SpecPicks Editorial · Last verified 2026-06-05

Ryzen 7 5800X
Ryzen 7 5800X
$210.00
View on Amazon →