LM Studio runs cleanly on a 12GB RTX 3060 in 2026. With CUDA wheels installed and a quantised 7-14B GGUF loaded, expect 35-50 tok/s on Llama 3.1 8B q4_K_M and 20-25 tok/s on Qwen 3 14B q4_K_M — fast enough for interactive chat, coding completions and small RAG. Here is the exact setup and the numbers you should see.
What this guide gets you
A working LM Studio install on a 12GB RTX 3060 system, with one chat model loaded, one coding model on standby, the local OpenAI-compatible server running on port 1234, and a published expectation for what tokens-per-second you should see on each common model. Setup time is roughly 20 minutes if you already have an NVIDIA driver installed.
Why LM Studio on a 12GB card is the easy on-ramp
LM Studio is the closest thing local LLMs have to a "double-click installer" experience in 2026. It ships GGUF-only, bundles llama.cpp under the hood, exposes a local OpenAI-compatible API, and has a Hugging Face model browser built in. On a 3060 12GB, the CUDA backend is the only one you need to touch — Vulkan is the fallback for non-NVIDIA boxes.
The card itself — the same MSI Ventus 2X or ZOTAC Twin Edge sold all through 2022-2026 — is well characterised: TechPowerUp lists it at 192-bit GDDR6, 360 GB/s, 170W TGP. That bandwidth is the practical ceiling on tok/s for memory-bound chat workloads.
Step 1 — host prerequisites
- GPU driver: NVIDIA 550+ on Linux, 560+ on Windows. Older drivers work but skip the newer compute-capability paths.
- CUDA runtime: LM Studio bundles its own CUDA libs; you do not need a system-wide CUDA toolkit install.
- System RAM: 16GB minimum, 32GB recommended. Spillover for larger models hits system RAM hard.
- Storage: 100GB free for a starter model collection. NVMe is strongly preferred — see Best SSD for Local LLM Model Storage in 2026.
- CPU: anything from a Ryzen 5 3600 era onward is fine; offload performance is the only thing CPU choice influences for inference.
Quick sanity check before you install anything:
You want to see your 3060 listed and the driver version. If nvidia-smi fails, fix the driver before you go further.
Step 2 — install LM Studio
Download the official installer from lmstudio.ai — Linux AppImage, Windows .exe, or macOS .dmg. The Linux AppImage runs without root.
On first launch:
- Skip the optional telemetry prompts unless you specifically want to share usage.
- In Settings → Hardware, confirm the CUDA runtime is selected. The dropdown should list your 3060 with the GPU memory total.
- Set GPU layers to a sensible default — leave at "auto" for now; we will override per-model.
- Set flash attention to "on". This costs nothing and saves ~10% KV-cache memory at 4K context on the 3060.
Step 3 — pull the right starter models
Open the Discover tab and search Hugging Face. The starter pack for a 12GB card:
| Use | Model | Quant | Disk | VRAM |
|---|---|---|---|---|
| Chat | Llama 3.1 8B Instruct | q4_K_M | ~4.6 GB | ~7 GB |
| Coding | Qwen 2.5 Coder 7B | q4_K_M | ~4.6 GB | ~7 GB |
| Big reasoning | Qwen 3 14B | q4_K_M | ~8.4 GB | ~11.5 GB |
| Summarisation | Phi-3 Mini 3.8B | q5_K_M | ~2.8 GB | ~4 GB |
| Embeddings | bge-large-en | f16 | ~1.3 GB | ~2 GB |
Pull them one at a time. The Discover tab shows you exact disk size before you click download — useful for the 100GB starter SSD budget.
Step 4 — load Llama 3.1 8B and verify throughput
In the Chat tab:
- Click the model selector dropdown at the top, pick Llama 3.1 8B Instruct q4_K_M.
- In the right-hand load dialog set:
- GPU offload layers: max (LM Studio will fit as many as possible)
- Context length: 8192
- Flash attention: on
- Batch size: 512
- Hit Load. Loading takes 8-12 seconds on an NVMe SSD; ~25-40 seconds from a SATA drive.
Once loaded, send a short prompt ("Write a haiku about the RTX 3060") and watch the bottom-right stats. You should see:
- Time to first token: 0.3-0.5 s
- Tokens per second: 35-50 tok/s
If you see materially less than 35 tok/s, the most common cause is layers not fully offloaded to GPU. Open the load dialog again and confirm "GPU offload layers" matches the model's total layer count (32 for Llama 3.1 8B).
Step 5 — load Qwen 3 14B at the 12GB ceiling
Same flow, with Qwen 3 14B q4_K_M selected and context length set to 4096 (not 8K — the KV cache at 8K + the larger weights will not fit cleanly on a 12GB card).
Expected numbers:
- Time to first token: 0.6-0.9 s
- Tokens per second: 20-25 tok/s
This is the practical ceiling for resident 14B chat on a 12GB card. If you see numbers significantly below 20 tok/s, you are almost certainly hitting partial offload. Drop the context length to 2048 and try again — if throughput jumps back into the 20-25 zone, you were over the VRAM budget.
Step 6 — enable the OpenAI-compatible server
LM Studio's killer feature for builders is the local server. Open the Developer tab and click Start Server.
You now have an OpenAI-compatible endpoint at http://localhost:1234/v1. Hit it from any OpenAI SDK by swapping the base URL — Aider, Continue, Roo Code, Codex CLI, OpenWebUI, AnythingLLM all "just work" against this endpoint.
A trivial curl test:
The response should land in under 2 seconds for a one-token answer.
What tok/s you should actually see — published reference
These are the numbers from LM Studio's own benchmark notes and community measurements on r/LocalLLaMA's monthly 3060 throughput threads. They are not first-party — use them as a calibration target, not a guarantee.
| Model | Quant | Context | Expected tok/s on 3060 12GB |
|---|---|---|---|
| Llama 3.1 8B Instruct | q4_K_M | 8K | 35-50 |
| Llama 3.1 8B Instruct | q5_K_M | 8K | 30-42 |
| Llama 3.1 8B Instruct | q8_0 | 8K | 22-30 |
| Qwen 2.5 Coder 7B | q4_K_M | 8K | 38-52 |
| Qwen 3 14B | q4_K_M | 4K | 20-25 |
| Phi-3 Medium 14B | q4 | 4K | 22-28 |
| Mistral Nemo 12B | q4_K_M | 8K | 24-30 |
If you are 30%+ below any of these numbers, walk through the troubleshooting section.
Common pitfalls when running LM Studio on a 3060 12GB
- Desktop session eating 1-2GB. Linux KDE/GNOME or a Windows desktop with a couple of browser tabs eats GPU memory before LM Studio loads anything. Run headless via SSH if you can, or use integrated graphics for the desktop and the 3060 only for inference.
- Partial offload. If LM Studio shows you 28/32 layers offloaded, you are paying a 4-8x throughput penalty on those 4 layers. Reduce context length or pick a smaller quant.
- Wrong flash-attention setting on older models. A few Llama 1 / Llama 2 era GGUFs do not support flash attention. If load fails, toggle it off and reload.
servermode dropping responses. If you see the server hang on long prompts, check the load-dialog "n_predict" and "max tokens" caps — LM Studio's default 4K answer cap will truncate longer agent loops silently.- Loading a fp16 GGUF "to be safe". fp16 takes 2-4x the VRAM of q4 and is rarely worth the quality bump for chat. Stay on q4_K_M / q5_K_M unless you have a measurable need.
Real-world workflows that fit
- Local chat replacement for ChatGPT free tier. Llama 3.1 8B q5 + 8K context handles 90% of casual questions.
- Inline code completions in VS Code (Continue). Qwen 2.5 Coder 7B q4 against the LM Studio server, 4K context.
- Aider / Codex CLI pair-programming. Qwen 3 14B q4 at 4K context, or a remote-hosted Qwen 2.5 Coder 32B if 4K is too tight.
- RAG over a personal note vault. bge-large embeddings + Llama 3.1 8B chat, served from the same LM Studio instance.
Troubleshooting the most common slowdowns
If you load a 7-8B model on your 3060 12GB and see less than 30 tok/s, walk through this checklist:
- Check GPU offload layer count. The load dialog should report all 32 layers (Llama 3.1 8B) offloaded. Anything less and CPU is running residual layers at 5-10x slower than GPU.
- Look at VRAM pressure. Open
nvidia-smi -l 1in a terminal while loading. If usage climbs to 11.5+ GB and stalls, you are over the budget — reduce context length or quant. - Confirm flash attention is on. A 5-10% throughput hit and 10-15% extra KV memory on long contexts.
- Verify the model is GGUF, not GPTQ or AWQ. LM Studio is GGUF-first; other quant formats fall back to slower paths.
- Check Windows power profile. "Balanced" caps the 3060 around 130W; switch to "High performance" for the full 170W.
- Update the NVIDIA driver. Driver 550+ on Linux / 560+ on Windows includes CUDA 12.4 optimisations that matter for the 3060's GA106 silicon.
If none of those move the needle, the most common deeper issue is PCIe slot allocation. Some B450/B550 boards drop the primary GPU slot from x16 to x8 when an NVMe drive is populated in M.2_2. PCIe Gen3 x8 still works fine for inference, but if the slot has been mis-detected as x4 or x1, throughput collapses. Check via nvidia-smi -q | grep "PCI link width".
Real workflows that fit comfortably
A solid one-week LM Studio rotation on a 3060 12GB:
- Daily driver chat: Llama 3.1 8B q5_K_M, 8K context, flash attention on. ~35-45 tok/s.
- Inline VS Code autocomplete via Continue: Qwen 2.5 Coder 7B q4_K_M, 4K context, called via the LM Studio OpenAI-compatible server. <300 ms first-token latency.
- Long-form coding via Aider: Qwen 3 14B q4_K_M, 4K context. ~22 tok/s, fits in 11.5 GB.
- Weekend RAG over personal note vault: bge-large-en embeddings (loaded server-side) + Llama 3.1 8B for query expansion. Combined ~10 GB VRAM.
- Saturday image generation: unload chat model, load SDXL Turbo via ComfyUI. The 3060 handles 1024px renders in 12-18 seconds.
That rotation covers the practical 12GB ceiling without ever pushing the card past the swap-thrash threshold.
Updating LM Studio and your model library
LM Studio ships fast — expect minor updates every 2-4 weeks during 2026. The update flow:
- Settings → Software updates → Check now.
- If a new llama.cpp backend is available, accept the prompt to update; older GGUFs continue to work.
- After major updates, re-test your most-used model loads — occasionally a flash-attention or KV-cache flag changes and your existing throughput numbers shift.
For model library hygiene, prune unused GGUFs monthly. Hugging Face hosts the originals; you can always re-download. Keeping a 1TB drive at 60-70% full is a meaningful uplift over a 90%+ full drive.
When LM Studio is the wrong tool
- Multi-GPU tensor parallel. LM Studio's CUDA backend does single-GPU well. For dual-3060 or stacked 3090 builds, drop to llama.cpp or vLLM directly.
- Production agent serving multiple users. LM Studio is single-user friendly; vLLM is the choice for concurrency.
- Headless servers. Possible, but LM Studio expects a desktop session; on pure servers prefer Ollama or llama.cpp's
llama-server.
Related guides on SpecPicks
- Ollama vs llama.cpp vs vLLM on an RTX 3060
- Best Coding LLM on RTX 3060 12GB + 32GB RAM in 2026
- Is 12GB VRAM Still Enough for Local LLMs in 2026?
- Best GPU for Local LLMs Under $300
- Best SSD for Local LLM Model Storage in 2026
Citations and sources
- LM Studio official site and docs — supported runtimes, CUDA bundle, OpenAI-compatible server endpoint.
- TechPowerUp — GeForce RTX 3060 specs — memory bus, bandwidth, TGP used in throughput math.
- llama.cpp project — GGUF and quantisation notes — quant naming, KV-cache flags, flash-attention support matrix.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
