Yes, you can technically load a trillion-parameter LLM on a 768GB Intel Optane DIMM rig — and yes, it will produce coherent output — but no, you should not buy one in 2026 if your goal is responsive local inference. Generation throughput is gated by memory bandwidth, not capacity, so an RTX 3060 12GB at $300 delivers roughly 30-50× the tokens-per-second of an Optane capacity build on the 8B-14B models that actually fit a single consumer card.
The viral build, in context
Tom's Hardware ran the headline earlier this month: a builder stuffed twelve 128GB Intel Optane Persistent Memory 200-series DIMMs into a dual-socket Xeon platform, totalling 768GB of byte-addressable, persistent storage exposed to the OS as system memory. Cost: roughly $1,400 for the DIMMs on the secondary market, plus the host platform. That capacity, paired with llama.cpp's mmap-everything loader, was enough to bring a 1-trillion-parameter mixture-of-experts model into address space. Cool demo. Also, completely the wrong tool if you actually want to use a model day-to-day.
The story is going viral because the headline number — "trillion-parameter LLM on a dead memory product" — is irresistible. It taps into the same reader who is asking themselves the real question: can I run something serious locally for under $2,000? Their answer is almost always yes, but the rig that does it has a GPU in it, not 768GB of Optane.
We are taking the cross-shop seriously here because both halves represent real reader intent. The Optane build represents the curiosity tail. The 12GB GPU build represents the buyer tail. We will walk through both, with numbers, and end with a verdict matrix that is honest about which side of the line you are on.
Key takeaways
- Capacity ≠ speed. Optane DIMMs give you 768GB of address space; they generate tokens at roughly 1-3 tok/s on a trillion-parameter MoE model, because memory bandwidth and CPU compute, not capacity, set the ceiling.
- The RTX 3060 12GB is the value pick for 8B-14B local LLMs. Expect 35-55 tok/s on 8B-class models at q4_K_M, and 12-22 tok/s on 13-14B models at q4 with light offload.
- Bandwidth math, not marketing math. Optane 200-series DIMMs run at roughly 6-8 GB/s sustained per channel; GDDR6 on a 3060 12GB runs at 360 GB/s. For dense matmul, that is a 45-60× bandwidth gap.
- $/token favors the GPU by an order of magnitude. At electricity prices around $0.15/kWh and realistic throughput, a 3060 12GB generates roughly 10-15× more tokens per dollar of operating cost than a dual-socket Xeon + Optane stack.
- Most readers want a GPU, not a memory rig. Unless you specifically need to inspect or batch-evaluate a model larger than a single GPU can hold, capacity rigs are a research tool, not a daily driver.
What the 768GB Optane rig actually ran, and how fast
According to the Tom's Hardware writeup, the build loaded a 1T-parameter MoE checkpoint quantized to roughly 4-bit, which lands the weights at a footprint just under the rig's 768GB. The OS treats Optane DIMMs as memory in App Direct mode, so llama.cpp's mmap loader pulls weight pages on demand. Reported throughput was in the 1-3 tokens-per-second range for generation, with first-token latency dominated by the prefill phase touching huge swaths of the weight set.
That is enough to verify that the model runs — sentences come out, the model demonstrably has more knowledge than a 70B distilled model — but it is not enough for anything an end user would call interactive. Three tokens per second is roughly 180 tokens per minute; a typical 600-token response takes three to four minutes. For agentic chains, that compounds into the tens of minutes per turn.
The viral framing buried the part of the story that matters: capacity unlocks loading, but bandwidth and compute determine latency. Optane's read bandwidth is approximately one order of magnitude lower than DDR5, and almost two orders of magnitude lower than the GDDR6 on a midrange consumer GPU. Tokens come out at whatever speed your matmul backend can pull weight pages through that bandwidth bottleneck. No clever software changes this.
Why memory bandwidth, not capacity, decides tokens-per-second
LLM inference at generation time is dominated by reading the full set of attention and feed-forward weights for each layer, every single token. For a dense model, that read happens once per token, regardless of how much VRAM or system RAM you have available. The throughput ceiling is therefore set by:
tokens_per_second_ceiling ≈ memory_bandwidth_GB_s / model_weight_size_GB
For a 7B model in q4_K_M (roughly 4.5GB of weights), an RTX 3060 12GB at 360 GB/s gives a theoretical ceiling of roughly 80 tok/s, of which real backends like llama.cpp hit 35-55 depending on context length. For a quantized 1T MoE model on Optane at 6-8 GB/s, the same math gives a theoretical ceiling of roughly 0.4-0.6 tok/s for dense access — except MoE routing means only one or two experts fire per token, so effective throughput climbs into the 1-3 tok/s range observed in the wild.
The intuition reverses what marketing copy implies. The Optane rig is "more memory," but every additional GB of model costs you tokens per second, because the bandwidth budget per token is fixed. The GPU is "less memory," but every byte of model sits on a far faster bus. For models that fit, the GPU is always faster. The only thing capacity buys you is the ability to run models that won't fit anywhere else — at unusable latency.
Spec-delta: Optane DIMM rig vs RTX 3060 12GB
| Spec | 768GB Optane rig (dual Xeon) | MSI RTX 3060 Ventus 2X 12G |
|---|---|---|
| Usable memory for weights | 768 GB | 12 GB |
| Memory bandwidth (sustained) | ~80 GB/s aggregate | 360 GB/s |
| Idle power draw | ~280 W | ~15 W |
| Load power draw | ~600 W | ~170 W |
| Approx. used hardware cost | ~$2,000 | ~$300 |
| Models that fit | up to ~1T params at q4 | up to ~14B params at q4 |
| Gen throughput on a model that fits both | n/a (3060 can't load 1T) | 35-55 tok/s on 8B q4 |
| Gen throughput on the same 8B q4 | ~8-12 tok/s (CPU+Optane) | 35-55 tok/s |
| $/100k tokens (electricity only, $0.15/kWh) | ~$0.50 | ~$0.04 |
The numbers above are conservative. Real measurements vary by model, quantization, runtime, and CPU SKU. The shape, however, is robust: the GPU is one order of magnitude faster on shared workloads and roughly an order of magnitude cheaper per token.
Quantization matrix for the RTX 3060 12GB
For the buyer end of the cross-shop, what matters is which model sizes fit comfortably in 12GB at which quantization. Numbers below are for a single-user chat workload at 4k-8k context, llama.cpp / Ollama defaults, KV cache included.
| Model | q2_K | q3_K_M | q4_K_M | q5_K_M | q6_K | q8_0 | fp16 |
|---|---|---|---|---|---|---|---|
| Llama 3.1 8B | 3.0 GB | 3.7 GB | 4.6 GB | 5.5 GB | 6.4 GB | 8.1 GB | 15.0 GB |
| Mistral Small 12B | 4.4 GB | 5.6 GB | 7.0 GB | 8.3 GB | 9.7 GB | 12.7 GB | 24.0 GB |
| Qwen 2.5 14B | 5.4 GB | 6.7 GB | 8.4 GB | 10.0 GB | 11.6 GB | 14.9 GB | 28.0 GB |
| Phi-4 14B | 5.5 GB | 6.8 GB | 8.5 GB | 10.1 GB | 11.7 GB | 15.0 GB | 28.2 GB |
| Gemma 4 27B | ~9.5 GB | ~12.0 GB | ~15.5 GB | ~18.4 GB | offload | offload | offload |
The 12GB buffer comfortably covers 8B at every quantization through q6, and 13-14B class at q4_K_M with room for an 8k context window. Above 14B parameters, you start spilling layers to system RAM, which collapses throughput from 25 tok/s to 4-6 tok/s — usable for batch, miserable for chat.
Quality loss across quantization tiers is smaller than you may remember from a year ago. On modern 8B-14B checkpoints, q4_K_M loses 1-3 points of MMLU vs fp16; q5_K_M is within noise. The era of "quantization wrecks the model" is over for the K-quants in this range.
Prefill vs generation: where each architecture wins and loses
Prefill — the first pass through your prompt — is dominated by compute and parallelism, not bandwidth. On a long prompt (say, a 4k-token instruction with a code attachment), the Optane build chokes harder than its already poor generation throughput suggests, because every prompt token requires touching the full attention weight set in sequence. Expect 30-90 seconds of first-token latency on the 768GB rig before a single output token appears.
The GPU eats prefill for breakfast. A 3060 12GB will plow through a 4k prompt in 1-3 seconds, then generate at 35-55 tok/s. That is the difference between a tool that feels responsive and a tool that feels like compiling C++ in 2002.
For batch jobs — offline summarization, dataset labelling, embedding extraction — neither architecture's prefill profile particularly matters; both will eventually finish. Pick whichever fits your model. For anything you sit and wait for, the GPU wins so decisively that the Optane rig is not in the same product category.
What you can actually run on an RTX 3060 12GB today
Going by community-measured llama.cpp throughput on Ampere, in 2026:
| Model | Quantization | VRAM used | Gen throughput |
|---|---|---|---|
| Llama 3.1 8B | q4_K_M | 4.6 GB | 50-58 tok/s |
| Llama 3.1 8B | q6_K | 6.4 GB | 42-50 tok/s |
| Mistral Small 12B | q4_K_M | 7.0 GB | 32-38 tok/s |
| Qwen 2.5 14B | q4_K_M | 8.4 GB | 22-28 tok/s |
| Phi-4 14B | q4_K_M | 8.5 GB | 24-30 tok/s |
| Gemma 4 27B | q4_K_M (offload) | 11.5 GB + 4 GB sys | 6-9 tok/s |
For agentic coding (Continue, Aider, Roo Code), the sweet spot is Qwen 2.5 14B Instruct at q4_K_M — fast enough for back-and-forth, smart enough for non-trivial diffs. For chat and summarization, Llama 3.1 8B at q5 or q6 is the comfort pick. The card supports KV-cache quantization, which on a 14B model frees ~1GB and lets you grow context to 8k-12k cleanly.
Perf-per-dollar and perf-per-watt math
Let's run the actual numbers on what each rig costs to operate at typical 2026 US electricity prices.
| Metric | 768GB Optane rig | RTX 3060 12GB rig |
|---|---|---|
| Idle wattage | ~280 W | ~80 W (full system) |
| Load wattage | ~600 W | ~270 W (full system) |
| Useful tokens/sec at load | 2 tok/s (1T MoE) | 50 tok/s (8B) |
| Tokens per kWh at load | ~12,000 | ~666,000 |
| Electricity cost per 100k tok | ~$1.25 | ~$0.023 |
| Used hardware acquisition | ~$2,000 | ~$300 |
| Payback period if running 24/7 chat | n/a (capacity stunt) | 1-2 years |
The 3060 generates roughly 55× more tokens per kWh than the Optane rig on the shared 8B workload. On the trillion-parameter workload that only the Optane rig can run, the GPU is irrelevant — but the more honest framing is that almost nobody needs to run a 1T model locally. If you do, you have a research budget and you are not reading a value-bracket article.
When the Optane capacity route actually makes sense
There is a narrow but legitimate case for the 768GB build:
- Research access to a very large checkpoint. You need to inspect attention patterns or do layer-by-layer analysis on a model that genuinely will not fit any consumer or even prosumer GPU configuration. Optane is a cheap way to get the weights into address space.
- Offline batch generation where latency is irrelevant. You are generating training data, distilling, or doing exhaustive prompt sweeps and you do not care if a single response takes five minutes. The capacity rig will eventually finish; cloud rental for the same model could exceed the hardware cost in weeks.
- You already had the hardware. Optane 200-series DIMMs are cheap on the secondary market because the product line was discontinued. If a dual-socket Xeon is gathering dust in a homelab, the marginal cost of stuffing it with 768GB of Optane and trying a stunt build is genuinely low.
If you are reading this and you do not match one of those three cases, you want a GPU. Almost certainly an RTX 3060 12GB if the budget is tight, an RTX 5070 or RTX 5080 if it is not.
Verdict matrix
Get the 768GB Optane route if:
- You need to inspect, evaluate, or generate from a model that genuinely exceeds 80GB of weights
- Latency is irrelevant to your workflow
- You already own a compatible dual-socket Xeon platform
- Your budget for hardware is tight but your budget for electricity over 18 months is not
Get the RTX 3060 12GB if:
- You want a responsive chat assistant on local hardware
- You are running agentic coding tools (Continue, Aider, Roo Code) and need <3s first-token latency
- You will use the rig as a normal PC the rest of the time
- You want the best dollars-per-token operating economics in the budget bracket
- You expect to live in the 8B-14B parameter band for the foreseeable future
For most readers, the recommendation is unambiguous. The MSI RTX 3060 Ventus 2X 12G or ZOTAC RTX 3060 Twin Edge on an AM4 Ryzen 7 5800X platform with an SN550 NVMe SSD gets you a complete, modern-feeling local-LLM workstation under $700 that runs the models 95% of buyers want to run, at speeds that feel like real software.
Real-world benchmark notes
Throughput numbers above are sourced from the llama.cpp project's GPU benchmark threads and supplemented by TechPowerUp's RTX 3060 specifications for theoretical bandwidth ceilings. Optane 200-series bandwidth is per Intel's product documentation; real-world sustained reads with App Direct mode and llama.cpp's mmap loader trend toward the lower end of the rated range because access patterns are random rather than sequential.
If you want to reproduce these numbers, the cleanest baseline is ollama run llama3.1:8b-instruct-q4_K_M with OLLAMA_NUM_GPU=999 to force everything to VRAM; pair with --verbose for tokens-per-second readout. On a stock RTX 3060 12GB at default power limits, expect within 5% of the table above on Linux with the open NVIDIA driver.
Common pitfalls
A few failure modes we see new local-LLM builders hit on each side of the cross-shop:
- Buying Optane DIMMs for a board that does not support App Direct mode. Optane PMem requires a Xeon Scalable platform with explicit firmware support — most consumer or single-socket Xeon E-class boards will not see the DIMMs at all. Check the QVL before clicking buy.
- Buying a $300 RTX 3060 12GB and pairing it with 8GB of system RAM. Even with a model fully in VRAM, llama.cpp uses host memory for the KV cache spillover and the model loader scratch space. Pair the card with at least 32GB of DDR4.
- Mistaking the 8GB RTX 3060 for the 12GB variant. They are different products with different memory buses (192-bit vs 192-bit but lower density). The 8GB card cannot run a 14B model at q4 without aggressive offload and is a bad value for local LLM work in 2026. The 12GB SKU's part number ends in
-12Gand is what you want. - Expecting MoE models to behave like dense models. On an Optane rig, MoE actually helps because only a few experts fire per token. On a 12GB GPU, MoE can be worse than a comparable dense model because the expert weight set is much larger than the active subset and you cannot cache it all.
Bottom line
The 768GB Optane build is a great story and a poor product. It demonstrates that capacity is not the bottleneck most people think it is, then makes the bottleneck — bandwidth — painfully visible. The 1T-parameter model runs; nobody using it would call it usable.
The RTX 3060 12GB remains the best on-ramp to local LLMs in 2026 because the 8B-14B band is genuinely useful, the card runs them quickly, and the total system cost is under $700 even with a fresh AM4 build. Buy the GPU. If you specifically need to load a model that does not fit, rent cloud GPUs by the hour instead of building a capacity rig — the math almost always favors rental until your usage hits dozens of hours per month.
Related guides
- Gemma 4 31B on a 12GB RTX 3060: Quantization, VRAM, and Real Numbers
- Ryzen AI Max+ 395 128GB vs RTX 3060 12GB for Local LLMs
- Ryzen 7 5800X + RTX 3060 12GB: Best 1440p AM4 Build
- Microsoft + Nvidia Agent PCs vs a DIY RTX 3060 12GB Local-Agent Rig
