Can you run a 1-trillion-parameter LLM on 768 GB of clearance-bin Intel Optane DIMMs? Yes — you can technically load the weights and produce tokens. But generation throughput is a small fraction of what even a $300 RTX 3060 12GB delivers on a model that fits in its VRAM, prefill latency is brutal, and the build only makes sense as an experiment, not a serving platform.
That qualifier is doing a lot of work. The viral Tom's Hardware build that put 768 GB of discontinued Optane Persistent Memory into a single Xeon server and ran a 1T-parameter model on it is interesting for one reason: Optane DIMMs are being dumped on the clearance market at prices that work out to a tiny fraction of what equivalent registered DDR5 costs. Hobbyists doing the arithmetic noticed that capacity, not bandwidth, is the bottleneck for fitting frontier-class LLM weights into a non-GPU machine, and Optane has more of it per dollar than any other option in 2026.
This piece is editorial synthesis of the public build reports, Intel's official Optane Persistent Memory documentation, and the llama.cpp issue tracker, where the CPU-offload code paths that make these builds work were originally landed. No independent first-party benchmarking is reported.
Key takeaways
- Optane is a capacity play. 768 GB at clearance pricing can fit a quantized 1T-parameter model that a single consumer GPU cannot.
- Generation throughput is in the single digits of tokens per second on the largest models, per public build reports — orders of magnitude slower than GPU inference.
- Prefill (the first-token latency on a long prompt) gets dramatically worse than on GPU, because Optane bandwidth is well under DDR5 and far under GDDR6/HBM.
- For 7B-13B models, a 12GB RTX 3060 at 4-bit beats this build on every axis that matters: latency, throughput, watts, and complexity.
- Build complexity is real: you need a compatible Xeon Scalable platform with App Direct mode wired in BIOS. That is not a Saturday afternoon project.
What did the experiment actually demonstrate?
The headline-grabbing build was straightforward in concept: take a dual-socket Intel Xeon Scalable server (the kind that turns up on the used market for the price of a mid-range gaming PC), populate every DIMM slot with the largest available Optane Persistent Memory modules, configure App Direct mode in the BIOS so the OS sees the Optane as a flat memory tier alongside DDR4, and load a 1-trillion-parameter quantized model via llama.cpp with full CPU offload.
The build was not new science — llama.cpp's CPU inference path has supported memory-mapped weights for years. What was new was the capacity ceiling: 768 GB in a single chassis at a price that would barely buy 192 GB of registered DDR5. Once you can fit the weights, throughput becomes an engineering question rather than a feasibility question. The trick is that engineering question has an unflattering answer.
How does Optane DIMM bandwidth compare to DDR5 and to GPU VRAM?
The bandwidth gap is the entire story. Per Intel's published specs and independent measurements that surfaced when the platform was still in active development, Optane DCPMM in App Direct mode delivers a fraction of DDR4 bandwidth on the same channels, and a much smaller fraction of GDDR6 or HBM3 bandwidth. The numbers below are order-of-magnitude — exact figures vary by generation, channel count, and access pattern, but the relative scale holds.
| Memory medium | Approx bandwidth (GB/s) | Latency (ns) | Capacity ceiling per channel | Approx $/GB (2026 clearance) |
|---|---|---|---|---|
| Optane DCPMM (App Direct) | 5–10 | 350–500 | 512 GB | $1–3 |
| DDR4-3200 ECC RDIMM | 25 per channel | 90–120 | 256 GB | $5–8 |
| DDR5-5600 ECC RDIMM | 44 per channel | 80–110 | 256 GB | $8–14 |
| GDDR6 (RTX 3060 12GB) | 360 (aggregate) | ~150 | 12 GB | n/a (soldered) |
| GDDR6X / GDDR7 (RTX 5090 32GB) | 1700+ | ~120 | 32 GB | n/a (soldered) |
| HBM3e (MI300X / H200) | 5000+ | ~80 | 192 GB | n/a (soldered) |
The Optane row is two orders of magnitude slower than the GPU rows. That gap drives every downstream throughput number you will see in build reports.
What token throughput can you realistically expect on CPU+Optane vs a GPU?
LLM token throughput is a function of memory bandwidth divided by the number of bytes the model has to read per token. On a quantized model of size $W$ bytes, generation reads roughly $W$ bytes per token (autoregressive decode reads every parameter once). Bandwidth ÷ parameter-bytes gives an upper bound on tokens per second.
Public reports of CPU+Optane builds on very large models land at single-digit to low-double-digit tok/s for generation, and prefill (the initial pass over the prompt) is much slower still because it has to read the weights once per layer per token in the prompt window. The numbers below are gathered from build threads and the llama.cpp benchmark database — treat them as ranges, not guarantees.
| Model size | Quant | RTX 3060 12GB tok/s (generation) | Xeon+Optane tok/s (generation) | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | 50–70 | 8–15 | Fits in 12GB VRAM, GPU wins decisively. |
| Qwen3-14B | q4_K_M | 30–45 | 5–10 | Fits in 12GB with tight context, GPU still wins. |
| Qwen3-72B | q4_K_M | n/a (OOM) | 3–6 | Doesn't fit in 12GB; Optane wins by default. |
| Mixtral 8x22B | q4_K_M | n/a (OOM) | 2–4 | Doesn't fit in 12GB; Optane wins by default. |
| Llama 405B | q4_K_M | n/a (OOM) | 1–2 | Optane is the only option at this size on consumer budgets. |
| Hypothetical 1T | q4_K_M | n/a (OOM) | <1 | Capacity demonstration, not a usable workload. |
A 1 tok/s generation rate is conversationally unusable for most workflows. It is, however, non-zero on a $2,000 used Xeon, which is a remarkable thing on its own.
Quantization matrix: q2/q3/q4/q5/q6/q8/fp16
Quantization shrinks the weight footprint and so directly raises tokens per second on memory-bound platforms. For a 70B model the matrix looks roughly like this, again synthesized from llama.cpp benchmark threads and Tom's Hardware coverage of recent build reports:
| Quant | 70B size (GB) | Fits 12GB GPU? | Fits 12GB + 64GB DDR? | Optane required? | Quality loss vs fp16 |
|---|---|---|---|---|---|
| q2_K | ~26 | No | Yes (partial) | No | High, often unusable |
| q3_K_M | ~32 | No | Yes (partial) | No | Noticeable |
| q4_K_M | ~42 | No | Yes (partial) | No | Small, recommended |
| q5_K_M | ~50 | No | Yes (tight) | Marginal | Very small |
| q6_K | ~57 | No | No | Yes | Near-lossless |
| q8_0 | ~75 | No | No | Yes | Effectively lossless |
| fp16 | ~140 | No | No | Yes | Reference |
Once a quantized model fits in a single RTX 3060 12GB's VRAM, the conversation is over — the GPU wins. Once it does not, the question becomes how much RAM you can throw at it, and Optane provides the cheapest path to "a lot."
Prefill vs generation: why Optane crushes generation latency but not prefill
Generation is sequential: the model reads each weight tensor once per token. On a bandwidth-limited platform like Optane, that gives a strict bandwidth ÷ parameter-bytes ceiling. Prefill, by contrast, reads every weight tensor once per layer per token in the input prompt, which is dramatically more I/O. A 4K-token system prompt that takes 200 ms on a GPU can easily take 30+ seconds on an Optane build.
This asymmetry is why these rigs feel slow in chat use even when steady-state tok/s looks OK on paper: every long prompt incurs a noticeable wait before the first token streams out. For agent-style workloads where a long system prompt is reused across calls, KV-cache reuse can mitigate this; for one-shot chat it cannot.
Context-length impact: how KV-cache growth changes the RAM-vs-VRAM math
KV-cache size scales linearly with context length. At 128K context on a 70B model the KV cache alone can exceed 40 GB. On a 12GB GPU that is impossible — you offload the KV cache to system DDR and pay a large bandwidth penalty per token. On an Optane build the KV cache typically lives in DDR4 (which has acceptable bandwidth), with the weights on Optane (slow), so long contexts hurt less than you would expect — the weight reads dominate either way.
Where a 12GB GPU like the RTX 3060 still wins
For the model sizes most local-LLM users actually run — 7B, 8B, 13B, 14B quantized — a single 12GB RTX 3060 is comprehensively the better answer. Per the llama.cpp benchmarks, it delivers 30–70 tok/s on those models at q4_K_M with low first-token latency, and it costs less than the BIOS-compatible motherboard alone on a Xeon+Optane build. It also draws ~170 W under load instead of 400+ W for a fully populated dual-socket Xeon.
The perf-per-dollar math is brutal:
| Build | Approx cost | Best-case gen tok/s (8B q4) | $/tok/s |
|---|---|---|---|
| RTX 3060 12GB + B450 + Ryzen 5 5600 + 32GB DDR4 | $700 | 60 | $11.7 |
| Used Xeon + 768GB Optane + DDR4 | $2,500–3,500 | 12 | $208 |
That's roughly an 18× worse cost-efficiency for the Optane build on the workload most people care about. Optane only catches up when the model genuinely doesn't fit on the GPU at any quant.
Real-world gotchas you will hit
These are the failure modes that show up consistently in build threads — none of them are dealbreakers, but all of them eat days.
- BIOS App Direct mode is not enabled by default. You need a server-class motherboard with current firmware. Many used eBay listings ship with older firmware that doesn't expose the option.
- NUMA layout matters enormously. A 1T-parameter model that crosses sockets without NUMA pinning can run 2-3× slower than the same build with proper
numactlinvocations. - Optane wears out. Persistent memory has a finite write endurance. For inference (read-heavy) this is fine, but if you fine-tune on the same machine you will burn through DIMMs faster than you expect.
- Power supply sizing. A dual-socket Xeon with 24 populated DIMM slots pulls 400-600 W at load. Used 2U servers often ship with PSUs that just barely cover that envelope.
- Cooling. Optane DIMMs run hot under sustained inference. 2U rack chassis with stock fan profiles are designed for short bursts; sustained 90% utilization will throttle.
When NOT to build this
Don't build this if you want to actually use a local LLM day-to-day. Don't build this if you don't already have a comfortable budget for a "throwaway experiment." Don't build this if your workload is 7-14B model chat — a single RTX 3060 12GB is faster, cheaper, quieter, and uses 3-5× less power. And don't build this if you don't already know how to navigate NUMA, App Direct mode, and used server hardware — the platform is unforgiving for first-time server builders.
Build it if you want bragging rights on running a frontier-class model locally and you accept that "running" will mean "watches one token appear every few seconds." That is a legitimate hobbyist goal, and Optane is the cheapest way to get there.
Who should actually build this
A narrow audience: persistent-memory tinkerers, researchers who specifically need to inspect activations on very large models without the cost of a multi-GPU H100 cluster, and homelab enthusiasts who already own a Xeon Scalable platform and want to give it a second life. For everyone else, the entry-level path is a budget GPU like the MSI RTX 3060 Ventus 2X 12G plus a fast NVMe drive like the WD Blue SN550 1TB to hold model files, and a Crucial BX500 1TB for the OS.
Bottom line
A 768 GB Optane build is the cheapest known path to fitting a 1T-parameter quantized model in a single chassis in 2026. It is also one of the slowest ways to actually use one. The numbers favor a $300 GPU for almost every real workload below ~70B parameters, and the engineering tax (server motherboard, NUMA tuning, App Direct BIOS, sustained-load cooling) is significant. Treat this build as a demonstration of what clearance-bin persistent memory can do, not as a serving platform.
Related guides
- Best Budget GPU for Local LLM in 2026
- Can a Raspberry Pi 4 (8GB) Run a Local LLM in 2026?
- ExLlamaV2 vs llama.cpp on the RTX 3060 12GB
Citations and sources
- Tom's Hardware — PC components / RAM coverage
- Intel — Optane Persistent Memory product page
- llama.cpp — GitHub repository (CPU inference reference)
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
