The clean 2026 self-hosted local AI stack is Open-WebUI as the front end, Ollama as the model server, and an RTX 3060 12 GB as the inference card. Per community benchmarks on r/LocalLLaMA, the stack hosts 7-14B models at usable speeds with daily-driver reliability. This piece walks through the spec sheet, the quant choices, the install pattern, and the honest tradeoffs.
Why this specific stack
Three reasons this combination keeps showing up in community recommendations:
- Ollama is the easiest mature model server. It wraps llama.cpp, manages model downloads through a registry, and exposes a REST API that everything else integrates with. Setup is one command.
- Open-WebUI is the polished front end for Ollama. It supports chat, RAG over uploaded documents, web search integration, and tool use - the full feature set a self-hosted user wants.
- The RTX 3060 12 GB is the sweet spot card. Cheaper than 16 GB cards, dramatically more capable than 8 GB cards, and the 12 GB of VRAM hosts the 7-14B model class that delivers genuinely useful daily chat.
Key takeaways
- 12 GB of VRAM is enough for daily 14B-class local LLM use at q4_K_M quantization.
- Ollama + Open-WebUI is the consensus stack on r/LocalLLaMA for first-time self-hosters.
- A complete rig at ~$950 pays back against cloud AI subscriptions inside 4-9 months for moderate users.
- NVMe storage matters for model swap latency, not steady-state inference speed.
- 8 GB GPU variants are not adequate; spend the extra $40-60 for the 12 GB SKU.
The hardware build
A clean self-hosted stack:
| Component | Specification | Approx cost |
|---|---|---|
| GPU | MSI RTX 3060 Ventus 2X 12G | $300 |
| CPU | Ryzen 7 5800X | $200 |
| Primary SSD (model store) | WD Blue SN550 1 TB NVMe | $70 |
| Secondary SSD (boot/logs) | Crucial BX500 1 TB SATA | $60 |
| Motherboard | B550 mid-tier ATX | $130 |
| RAM | 32 GB DDR4-3600 | $80 |
| PSU | 650 W 80+ Gold | $80 |
| Case + fans | mid-tower with good airflow | $80 |
| Total | ~$1,000 |
Saving on cooler ($35-70 needed for the 5800X to behave) lands the total closer to $1,050. For builders who already have a desktop, only the GPU + NVMe needs to be added (~$370).
Why 12 GB matters specifically
The 12 GB threshold is where most modern open-source LLM workflows become unconstrained. Below it, you choose between model size, context length, and additional features (vision encoders, embedders). At 12 GB, a 14B q4_K_M model fits with an 8K context and leaves room for everything else.
| VRAM | Practical model ceiling | Reasonable use |
|---|---|---|
| 4 GB | 3B q4 | basic chat only |
| 6 GB | 7B q4_0 | chat, no big context |
| 8 GB | 7B q4_K_M / 8B q4 | daily chat, short documents |
| 12 GB | 14B q4_K_M, 8K context | daily driver tier |
| 16 GB | 14B q6 or 24B q3 | quality bump, no big leap |
| 24 GB | 32B q4_K_M, 4K context | small leap to higher quality |
| 48 GB+ | 70B q4 | frontier-adjacent |
The interesting takeaways from this curve: 8 GB to 12 GB is the most impactful single jump. 12 GB to 16 GB is small. 24 GB to 48 GB is large but expensive. The 12 GB RTX 3060 sits on the right side of the most-impactful boundary.
Model picks that work well on the stack
Community recommendations from r/LocalLLaMA threads, tested on RTX 3060 12 GB hardware:
| Model | Quant | VRAM | Use case | Notes |
|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | ~5.5 GB | general chat | strong default |
| Qwen 2.5 14B | q4_K_M | ~9.5 GB | chat + reasoning | best general 14B |
| Qwen 2.5 Coder 14B | q4_K_M | ~9.5 GB | code generation | tool-use friendly |
| DeepSeek-Coder-V2 16B | q4_K_M | ~10.5 GB | code | tight but works |
| Mistral Small 22B | q3_K_S | ~10.5 GB | reasoning | very tight, lower quant |
| Llama 3.1 8B Instruct | q5_K_M | ~6.5 GB | quality chat | slower but cleaner |
| Nomic Embed | f16 | ~0.5 GB | embeddings | RAG-pair model |
Pair a 14B chat model with a small embed model and you have a complete chat + RAG stack on a single 12 GB card.
Performance benchmark synthesis
Per benchmarks published on r/LocalLLaMA and the Ollama Discord:
| Model | Quant | Prompt tok/s | Gen tok/s | Realistic turn latency (8K context) |
|---|---|---|---|---|
| Llama 3.1 8B | q4_K_M | ~1100 | ~60 | ~14 s |
| Qwen 2.5 14B | q4_K_M | ~600 | ~28 | ~26 s |
| Qwen 2.5 Coder 14B | q4_K_M | ~600 | ~28 | ~26 s |
| Mistral Small 22B | q3_K_S | ~480 | ~22 | ~32 s |
| Llama 3.1 8B | q5_K_M | ~900 | ~50 | ~17 s |
For interactive chat, the 8B q4_K_M model is the responsiveness sweet spot. For quality work, the 14B q4_K_M models are worth the longer turn latency.
Software install pattern
The clean install workflow on Ubuntu 24.04:
- Install NVIDIA driver 550+ via the official Ubuntu repository.
- Install Docker Engine with NVIDIA container toolkit.
- Pull Ollama:
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama. - Pull Open-WebUI:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main. - Open
http://localhost:3000in a browser. Create an admin account. Pull a model from the Open-WebUI UI. - Verify GPU utilization with
nvidia-smiwhile a model is generating.
Total setup time: 15-30 minutes on a fresh Ubuntu install. The model download (4-9 GB per model) typically dominates the wall-clock.
Quantization choice for 14B models
| Quant | VRAM | Tok/s on 3060 | Quality (vs fp16) | Use case |
|---|---|---|---|---|
| q2_K | ~5 GB | ~35 | -20 to -30 percent | avoid |
| q3_K_M | ~7 GB | ~32 | -10 to -15 percent | fits comfortably, lossy |
| q4_K_M | ~9 GB | ~28 | -3 to -6 percent | the right pick |
| q5_K_M | ~10 GB | ~24 | -1 to -3 percent | quality-first |
| q6_K | ~11.5 GB | ~21 | -1 percent | almost no headroom |
| q8_0 | ~14 GB | does not fit | - | needs 16 GB+ card |
q4_K_M is the consistent recommendation. Bumping to q5_K_M gives a small quality improvement but reduces the context budget meaningfully. q6_K is technically possible but leaves no room for the KV cache to grow.
Prefill vs generation profiling
A typical chat turn: 2-4K tokens of prompt (system instructions + chat history + new user message), 300-800 tokens of model response. Consumer GPUs handle this profile well because prefill is much faster than generation.
On the RTX 3060 with a 14B q4_K_M model, prefill rates land near 600 tok/s versus 28 tok/s generation. A 3K-token prompt processes in 5 seconds; the 500-token response that follows takes 18 seconds. Round-trip 23 seconds per turn is the practical floor.
For RAG workloads with longer prompts (5-8K tokens after document retrieval), prefill still dominates - 13 second prefill, 18 second generation, 31 seconds total.
Context length impact
A 14B q4_K_M model with 8K context uses roughly 9.5 GB at idle. Stretching to 16K context pushes VRAM near 11 GB and KV cache starts to dominate. Push past 16K and the card cannot keep up.
The practical move: keep context at 8K, improve retrieval quality so the relevant chunks fit cleanly rather than dumping more raw context at the model.
Local vs cloud economic comparison
| Dimension | RTX 3060 12 GB local | ChatGPT Plus / Claude Pro |
|---|---|---|
| Monthly cost | electricity (~$15) | $20 |
| Annual cost | ~$180 | $240 |
| Per-token cost | ~$0.0004 per 1K | bundled |
| Privacy | full | provider-dependent |
| Model choice | any open-weight model | provider's models only |
| Reasoning depth | 14B class | frontier |
| Setup time | ~30 minutes | instant |
The local rig wins on privacy and on flexibility. The cloud subscriptions win on reasoning depth and on instant readiness. For builders running daily AI workloads, the local rig pays back inside 12 months even after factoring the GPU + NVMe additional spend.
Storage choice matters - here is why
A 1 TB NVMe drive for the model store is not about steady-state inference speed. Once loaded, the model lives in VRAM and the SSD is idle. NVMe matters for cold-start time - loading a 9 GB model file into RAM takes ~5 seconds on NVMe versus ~30 seconds on SATA.
For builders who swap between multiple models per session, that delta multiplies. Five model swaps per day saves 2-3 minutes daily on NVMe. For pure single-model users, the Crucial BX500 SATA SSD is a perfectly adequate budget pick.
Common pitfalls
- Running both Ollama and Open-WebUI as host services rather than Docker containers. Works but harder to upgrade cleanly.
- Pulling too many models. The 1 TB store fills fast at 5-10 GB per model.
- Skipping the GPU verification step. First-time setups occasionally end up running on CPU when NVIDIA driver isn't loaded properly. Confirm with
nvidia-smiduring generation. - Using a SATA SSD for the model store. Works fine for steady-state but adds ~25 seconds per model swap.
- Trying to run frontier-class 70B models. Will not work. Pick a model class the GPU can host.
When to skip self-hosting
Use a cloud subscription if your usage is bursty or low-volume, if you need frontier reasoning depth for one-off complex tasks, if you cannot tolerate occasional setup-and-maintenance burden, or if your privacy needs are met by the provider's terms. The local rig wins on volume, on privacy-critical workloads, and on long-term cost economics.
Bottom line
Open-WebUI plus Ollama on an RTX 3060 12 GB is the 2026 sweet spot for self-hosted local AI. Pair it with a Ryzen 7 5800X, a 1 TB NVMe drive for the model store, and a secondary 1 TB SATA SSD for boot and logs. The complete build lands near $1,000 and runs 14B-class models at usable speeds with full daily-driver reliability.
Citations and sources
- Open-WebUI on GitHub - canonical project repository and documentation.
- Ollama official website - canonical model server documentation and model registry.
- TechPowerUp - GeForce RTX 3060 specifications - GPU specifications reference.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
