Skip to main content
ChatGPT Now Saves Dossiers About You: Build a Private Local LLM Box

ChatGPT Now Saves Dossiers About You: Build a Private Local LLM Box

A 2026 editorial synthesis on private local llm box.

To run a private local LLM instead of ChatGPT for privacy in 2026, the practical floor is an RTX 3060 12GB host paired with a runner like Ollama or…

To run a private local LLM instead of ChatGPT for privacy in 2026, the practical floor is an RTX 3060 12GB host paired with a runner like Ollama or llama.cpp, a modern 8-core CPU, 32GB of system RAM, and an NVMe SSD for fast model loads. Prompts never leave the box, there is no server-side memory store, and 7B-to-14B-class open models at 4-bit quantization handle daily drafting, coding, and summarization with no recurring subscription.

Why this matters right now

Per reporting collected by The Decoder, ChatGPT's memory feature has expanded into narrative dossiers organized by buckets such as work, hobbies, and travel — a structured profile that persists across sessions and informs future responses. For users who value confidentiality on legal, medical, financial, or simply personal queries, the existence of such a profile is the trigger to look at local inference instead of a hosted chatbot.

The economics shifted in parallel. A used NVIDIA RTX 3060 12GB — the cheapest current-generation GeForce card with enough VRAM to host a useful assistant — has been hovering in the low hundreds of dollars throughout 2026, and open-weight model families like Llama 3, Qwen 3, Mistral, and DeepSeek have closed much of the quality gap for routine work. The combination means a one-time hardware purchase plus a few dollars per month of electricity now replaces a recurring cloud subscription for daily users, and the recurring privacy cost — your prompts being read, stored, and synthesized into a profile by a third party — drops to zero.

This synthesis covers what the cited reporting says ChatGPT is storing, what hardware actually clears the bar for a private assistant box, how to think about quantization on 12GB of VRAM, where a Raspberry Pi 4 Model B 8GB fits as an always-on companion, and when the cloud is still the right answer.

Key Takeaways

  • What leaves your machine: with Ollama or llama.cpp running on local hardware, prompt text and completions stay on-device. There is no server-side memory accumulating into a profile, no telemetry to a vendor, and no terms-of-service clause licensing your inputs for training.
  • Minimum practical hardware: an RTX 3060 12GB-class GPU, an 8-core/16-thread CPU like the AMD Ryzen 7 5800X (host CPU), 32GB of DDR4-3200 or faster, and a 1TB NVMe SSD such as the WD Blue SN550 1TB NVMe to keep model load times under a few seconds.
  • Realistic model sizes: 7B-class models at q4 fit with headroom for context. 13B-to-14B-class fits at q4 with shorter context windows. 30B-class models require offloading to system RAM and run slowly — usable, not snappy.
  • Always-on companion: a Raspberry Pi 4 Model B 8GB handles small quantized models (1B-3B parameters) for narrow lightweight tasks while the main box sleeps.
  • Break-even math: a $300-$400 used 3060 build plus single-digit-dollar monthly power can replace a $20/month subscription within roughly a year for daily users, and longer for casual users.
  • Quality envelope: flagship cloud models still pull ahead on the hardest reasoning and long-context jobs. For daily drafting, coding help, RAG over your documents, and summarization, modern open models running locally are close enough that most users will not notice on routine work.

What exactly is ChatGPT storing, per the reporting?

The Decoder has documented that ChatGPT's memory feature went beyond simple "remember my name" facts and now compiles narrative-style dossiers about each user. Per that reporting, the stored entries are organized into structured categories — work, hobbies, travel are explicitly cited groupings — and the model uses those entries to condition future responses without the user repeating context.

The mechanism is not unique to OpenAI; any hosted assistant that supports persistent memory operates similarly. The relevant points for a privacy-minded user are: (a) the dossier persists server-side, (b) it is editable from the settings UI but the contents are derived from the cumulative chat history the vendor already holds, and (c) the dossier informs answers across unrelated sessions, meaning a prompt about, say, a medical topic can be flavored by prior mentions of an employer or location.

None of that is alleged abuse — it is the documented design. The point is that the design itself is what some users now want to opt out of. Running inference locally is the cleanest opt-out: no dossier exists because no third party sees the prompts to build one.

What hardware do you need for a private 12GB-class inference box?

The "12GB-class" frame matters because most private-assistant workloads — a 7B-to-14B-class chat model at 4-bit quantization with an 8k-16k context window — fit cleanly in 12GB of VRAM with room for KV cache. Below that, you compromise either model size or context length. Above that (16GB-24GB cards) gives more headroom but does not unlock a fundamentally different class of assistant for routine work.

The reference spec the rest of this synthesis assumes:

ComponentRecommended partWhy
GPUZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB or MSI GeForce RTX 3060 Ventus 2X 12G12GB VRAM at the lowest current-gen price point; CUDA support across every major runner. Per TechPowerUp, the card is a 170W TDP 360GB/s memory-bandwidth GA106 part with 3,584 CUDA cores.
CPUAMD Ryzen 7 5800X (host CPU)8 cores / 16 threads on AM4 keeps platform cost low; sufficient for prompt preprocessing, embedding, and CPU-offload of layers when a model spills past 12GB.
RAM32GB DDR4-3200 (2×16GB)Enough to hold quantized weights when offloading to CPU, plus OS, browser, and a vector DB for RAG.
StorageWD Blue SN550 1TB NVMeA 7B q4 model is ~4GB on disk; a 13B q4 is ~7GB. NVMe keeps cold-load to a few seconds rather than tens of seconds on SATA.
PSU650W 80+ GoldComfortably feeds a 170W GPU + 105W CPU with margin.
Always-on companionRaspberry Pi 4 Model B 8GBPer the Raspberry Pi Foundation, the Pi 4 B 8GB uses a quad-core Cortex-A72 at 1.5GHz with LPDDR4-3200 RAM — adequate for 1B-3B-class quantized models handling narrow always-on tasks.

A realistic 2026 used-market build hits roughly $300-$450 depending on GPU sourcing. Pair with a $75-ish Raspberry Pi for the always-on layer and total outlay sits below a year of any premium-tier cloud chatbot subscription for daily users.

Which local models match a private-assistant use case?

The model decision is downstream of the use case. The two questions are: how big a model can fit in 12GB at the quantization you can tolerate, and what tokens-per-second feels responsive enough that the assistant stays in your workflow rather than out of it.

Community measurements across r/LocalLLaMA threads, llama.cpp benchmark sweeps, and ggml-org issue discussions converge on roughly the following picture for an RTX 3060 12GB. The numbers below are synthesized from community-published runs, not first-party measurements, and your results will vary with runner version, context length, batch size, and host CPU.

Model familyParamsQuantizationApprox. VRAMApprox. tok/sBest for
Llama 3.x Instruct7B-8Bq4_K_M~5.5GB40-60General assistant, coding
Qwen 37Bq4_K_M~5GB45-65Multilingual, structured output
Mistral 7B Instruct7Bq4_K_M~5GB45-65Lean and fast for chat
Llama 3.x Instruct13B-14Bq4_K_M~9-10GB20-30Better reasoning at lower throughput
DeepSeek14B (distilled)q4_K_M~9.5GB18-28Coding heavy
Phi-3.x Mini3.8Bq4~2.5GB80-110Snappy on the Pi for narrow jobs
Llama 3.x1B-3Bq4~1-2GB5-15 (Pi 4 8GB)Always-on Raspberry Pi tier

The pattern: 7B-class models give a genuinely useful generalist at 40-60 tok/s, which feels close to a cloud chatbot for short turns. 13B-14B-class trades half the throughput for better instruction following on harder prompts. The Pi tier exists for low-stakes always-on jobs — classification, routing, brief summarization — not as a replacement for the main box.

Quantization matrix for 12GB VRAM

Quantization is the lever that decides what fits and at what quality cost. The intuition: each bit of weight precision you drop roughly halves VRAM use and introduces a small quality regression that grows non-linearly below q4. Community-published perplexity sweeps on llama.cpp and Hugging Face model cards consistently show that q4_K_M is the sweet spot for chat workloads and that going below q3 starts to produce visible degradation on instruction-following benchmarks.

QuantBits/weight7B VRAM13B VRAMApprox. tok/s (7B, 3060)Quality vs fp16
fp1616.0~13GB~26GBN/A (does not fit)Baseline
q8_08.5~7.5GB~14GB30-40Near-baseline
q6_K6.6~6GB~11GB35-50Very close to baseline
q5_K_M5.7~5.2GB~9.5GB40-55Minor loss on hard prompts
q4_K_M4.8~4.5GB~8GB45-65Sweet spot for chat
q3_K_M3.9~3.6GB~6.5GB50-70Visible quality regression
q2_K2.6~2.6GB~5GB55-75Often unusable for instruction following

The takeaway for a 12GB card: stay at q4 or q5 unless you have a specific reason to drop lower. Going to q8 buys very little perceptible quality on routine prompts and costs throughput. Going below q4 mostly serves cases where you absolutely need a 30B-class model to fit; for 7B and 13B work, q4_K_M is the default.

Prefill vs generation: why a private assistant feels snappy or sluggish

Two separate phases dominate perceived latency. Prefill — the time to process the prompt and fill the KV cache — scales with prompt length and is highly parallel; on a 3060 it is fast for short prompts and noticeably slower as context grows past 8k tokens. Generation — emitting tokens one at a time — is memory-bandwidth-bound and is the number quoted as "tok/s" in benchmarks.

Per the TechPowerUp specs for the GA106, the RTX 3060 has 360GB/s of memory bandwidth, which is the ceiling on generation speed at a given quantization. That ceiling is why a 7B q4 model on a 3060 lands in the 40-60 tok/s band rather than the hundreds-of-tok/s a higher-bandwidth card achieves.

For a private assistant the practical implications are: keep the system prompt short, use RAG instead of pasting full documents when possible, and accept that very long contexts (32k+) on a 12GB card will feel slow during prefill. For chat, code completion, and document Q&A with retrieval, the latency profile is comfortably within "snappy enough" for daily use.

Can a low-power Raspberry Pi 4 8GB host small models for always-on tasks?

Yes, with caveats. Per the Raspberry Pi Foundation specifications, the Raspberry Pi 4 Model B 8GB ships with a quad-core 1.5GHz Cortex-A72, LPDDR4-3200 system memory, gigabit Ethernet, and dual-band Wi-Fi. Community measurements on llama.cpp running on the Pi 4 8GB indicate that 1B-class models at q4 land in the 5-15 tok/s range, 3B-class models drop to 1-4 tok/s, and anything larger spills into swap and becomes unusable.

The right framing is the Pi as a low-power always-on agent host, not as a chat replacement. Use cases that work:

  • Classifying inbound email or RSS items by topic and routing to a queue.
  • Summarizing daily logs or news feeds during idle hours.
  • Running a small embedding model for a local document index that the main box queries.
  • Hosting a tiny model that answers narrow factual questions ("when is trash day", "what is on my calendar tomorrow") without spinning up the GPU.

The main 3060 box stays asleep until a job warrants it, the Pi handles always-on coordination at single-digit watts, and total system idle power stays low.

Perf-per-dollar and perf-per-watt: one-time box vs ongoing cloud subscription

Headline numbers for a 2026 build, using the reference spec above and a US average electricity rate near $0.16/kWh:

  • Capital outlay: roughly $300-$450 for the main 3060 box (used GPU, used or new other parts) plus roughly $75 for the Pi 4 8GB. Total well under $600.
  • Idle power: roughly 30-50W for the main box at idle and 3-5W for the Pi. Continuous Pi operation costs single-digit dollars per year; intermittent GPU use adds a few dollars more.
  • Active power: the RTX 3060 reference is a 170W TDP card. Heavy use a few hours a day lands at single-digit dollars per month in incremental power.

Compared to a recurring premium cloud chatbot subscription, the break-even on capital is well under a year for daily users. For casual users — a handful of queries a week — a metered API may still be cheaper in dollar terms, but the local box also buys the privacy guarantee, which is the original reason for the build.

The perf-per-watt picture is less favorable than newer architectures: the GA106 is two generations old as of 2026, and current-generation cards deliver more tok/s per watt at higher capital cost. For users who care primarily about lowest-watt-per-token and have budget, a current-generation 16GB-class card is the next step up. For users who care primarily about lowest-dollar-to-private-assistant, the 3060 12GB remains the value floor.

Common pitfalls when building the private box

  • Buying an 8GB card to save money. 8GB forces either smaller models or aggressive quantization that hurts quality. The 4GB delta on the 3060 12GB is the difference between a useful assistant and a constant juggling act.
  • Skimping on system RAM. 16GB works until you try to offload a 13B model partially to CPU, at which point swap kills throughput. 32GB is the comfortable floor.
  • Slow storage. A SATA SSD adds tens of seconds to cold model loads. NVMe keeps the workflow fast.
  • Running fp16 because "quality." On a 12GB card, fp16 limits you to small models with no headroom. q4_K_M is the well-trodden default for a reason.
  • Treating the Pi as a replacement. The Pi is a companion for always-on narrow jobs; trying to host a 7B chat model on it will produce a frustrating experience.

When NOT to build a local box

A local box is the wrong answer if any of the following hold:

  • You need flagship-tier long-context reasoning (200k+ tokens of nuanced instruction following) for a small number of high-value queries. Cloud models still pull ahead there.
  • Your query volume is genuinely low — a few prompts a week — and the dollar math favors a metered API.
  • You are unwilling or unable to maintain the runner, manage model updates, and handle backups.
  • You need multi-user concurrent serving at scale; a single 3060 saturates fast under concurrent load.
  • You explicitly want the convenience of a hosted product that updates itself and works identically from any device.

The verdict

Stay on cloud if your query patterns are infrequent, you need the absolute frontier of reasoning capability on the hardest tasks, you do not want to maintain a runner and model library, and you are comfortable with the dossier-style memory model documented by The Decoder.

Build a local box if you query daily, you want documented privacy (no third party reads or stores your prompts), you can tolerate a 7B-13B-class assistant for routine work, and you have $300-$600 in build budget. The reference build — ZOTAC RTX 3060 Twin Edge 12GB or MSI RTX 3060 Ventus 2X 12G, Ryzen 7 5800X, 32GB DDR4, WD Blue SN550 1TB NVMe, 650W PSU — is the well-trodden value floor.

Hybrid if your routine work is privacy-sensitive but you occasionally need frontier reasoning. Run a local model for daily drafting, code help, and summarization on private inputs; reserve cloud calls (ideally via metered API rather than a memory-enabled chatbot) for the small subset of tasks that genuinely require the frontier.

Bottom line

As of 2026, the cheapest defensible private-assistant build is a used RTX 3060 12GB, a Ryzen 7 5800X host, 32GB of DDR4, a WD Blue SN550 1TB NVMe, and a Raspberry Pi 4 Model B 8GB as the always-on companion. Run a 7B-to-14B-class open model at q4_K_M via Ollama or llama.cpp on the main box, run a 1B-3B model on the Pi for narrow always-on jobs, and the dossier problem documented by The Decoder becomes a non-issue because no third party ever sees the prompts. Total spend under $600, single-digit dollars per month in power, and your assistant's memory is yours.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Does a local LLM really keep my data private?
Yes — when you run inference on your own hardware with a local runner like Ollama or llama.cpp, your prompts never leave the machine. There is no telemetry to a vendor and no server-side memory store building a profile of you. The tradeoff is that you maintain the model, updates, and backups yourself, and you lose the convenience of a hosted product that just works in a browser.
What is the cheapest GPU that makes this practical?
The RTX 3060 12GB remains the value floor for local inference because the 12GB of VRAM lets you run useful 7B-to-14B-class models at q4 without offloading to system RAM. Cheaper 8GB cards force smaller models or heavier quantization that hurts quality. For a private assistant that has to feel responsive, the extra 4GB on the 3060 is the difference between fitting a model and constantly swapping.
Can I run an always-on assistant on a Raspberry Pi instead?
A Raspberry Pi 4 8GB can run small quantized models for narrow always-on tasks — summarizing, routing, light Q&A — but token throughput is modest and large models are out of reach. The sensible pattern is a Pi handling lightweight, low-latency jobs around the clock while the RTX 3060 box wakes for heavier reasoning, keeping idle power draw low.
How much does the electricity cost compared with a subscription?
An RTX 3060 pulls roughly 170W under load and far less at idle, so an occasionally-used local box costs a few dollars a month in power. Compared with a recurring cloud subscription, a local rig usually breaks even within months for daily users, though casual users who query infrequently may still find a metered API cheaper overall.
Do I lose quality going local versus ChatGPT?
For everyday drafting, coding help, and summarization, modern open models running locally are close enough that most users won't notice on routine tasks. Where flagship cloud models still pull ahead is long-context reasoning and the hardest agentic problems. A practical setup keeps a local model for private daily work and reserves cloud calls for the occasional task that genuinely needs the frontier.

Sources

— SpecPicks Editorial · Last verified 2026-06-05