Yes, you can run a useful text-to-SQL model locally on a 12GB GPU as of 2026. A 7B SQL-specialist quantized to Q4_K_M lands around 5GB of VRAM and serves 50–80 tokens per second on an RTX 3060 12GB, leaving room for a multi-thousand-token schema. A 13B at Q4 fits closer to 9–10GB and runs 25–35 tok/s. Frontier hosted systems still win on complex joins, but a 12GB card covers most analyst workloads.
Why analysts want on-prem text-to-SQL after Gemini-SQL2
Google's Gemini-SQL2 announcement in 2026 lit up text-to-SQL benchmark leaderboards again, and every data team you talk to is asking the same follow-up: do we have to send our schema to a hosted API to get those numbers? For a regulated bank, a hospital data warehouse, or any analytics team whose customers table has PII, the answer has to be no. The schema itself is sensitive — column names like applicant_ssn_last4 or prior_auth_denial_reason describe the business in ways legal teams do not want leaking into a vendor's training pipeline, even with an enterprise no-train flag.
That is why the open-weight text-to-SQL stack matters more in 2026 than it did a year ago. Defog's SQLCoder family, NaturalSQL, Snowflake's Arctic Text2SQL-R1, and a wave of 7B and 13B fine-tunes have all caught up enough to handle daily analyst work without leaving the building. The question for the hardware-conscious reader is whether you need a $1,600 RTX 4090 or whether the same RTX 3060 12GB that has anchored value PC builds for years is enough.
It is. With sensible quantization, a ZOTAC Gaming GeForce RTX 3060 12GB paired with an AMD Ryzen 7 5700X and a fast NVMe like the WD Blue SN550 1TB NVMe SSD runs a 13B SQL specialist at usable latency. This guide walks through which models fit, how accurate they actually are versus Gemini-SQL2, where the 12GB ceiling bites, and when you should still reach for a hosted model. The Zotac and MSI 3060 cards remain the bargain of the local-LLM era because of that 12GB pool — and text-to-SQL is one of the cleanest examples of why those four extra gigabytes matter.
Key takeaways
- A 7B SQL-tuned model at Q4_K_M needs roughly 5GB of VRAM and serves 50–80 tok/s on an RTX 3060 12GB.
- A 13B SQL-tuned model at Q4_K_M needs roughly 9–10GB of VRAM and serves 25–35 tok/s on the same card.
- Schema-in-prompt is the dominant cost: 40 tables can mean 4–8k context tokens, and prefill is what makes your first token feel slow.
- Open SQL specialists beat GPT-3.5-turbo on Spider and close most of the gap to GPT-4 class on internal join benchmarks; Gemini-SQL2 still leads on hard multi-table reasoning.
- Break-even versus Gemini-SQL2 pricing (~$0.002 per 1k tokens) lands around 50 queries a day for a small team.
- vLLM with Marlin kernels gives the smoothest single-user latency on a 3060; llama.cpp is the simplest deploy.
Step 0 — diagnose your schema size first
Before you buy any hardware, count tables. The first decision is not which model to download, it is whether your schema fits in context at all. A 7B local model is enough when your working schema is roughly 20 tables or fewer and your queries usually touch one to three joins. That is the sweet spot where Defog SQLCoder-7B and NaturalSQL-7B routinely match analyst-written SQL on internal tests.
If your warehouse exposes 200 tables and analysts roam across all of them, no 12GB local model is going to be reliable without a retrieval step in front. You either need schema-link retrieval (more on that below) to trim the prompt to the 5–10 relevant tables, or you stay on a hosted frontier model that has the context budget and reasoning depth to handle the whole graph. Pretending a 13B at Q4 will plan a six-way join across 200 tables sets you up for silently wrong SQL.
A useful rule of thumb as of 2026: if your prompt budget (system + schema + few-shot + user question) stays under 4,000 tokens, you have lots of headroom on a 12GB card. Under 8,000 is workable with KV cache attention. Above that, plan for retrieval.
What Gemini-SQL2 actually claimed, and the open-weight gap
The Gemini-SQL2 benchmark numbers Google published — see the Google Research blog — are headline-grabbing because they push exact-match execution accuracy on BIRD-bench into the high 70s, with strong gains on multi-hop joins and ambiguous column references. That is a genuine step up from the prior generation. It is also a hosted, closed-weights specialist that costs roughly $0.002 per 1k tokens through the Gemini API, with the usual caveats about data residency and training opt-outs.
The honest summary of the open-weight gap as of 2026 looks like this:
- On Spider (single-database, school-textbook style), well-tuned 7B SQL specialists hit 75–80% execution accuracy, roughly matching GPT-3.5-turbo and within a few points of GPT-4-class performance.
- On BIRD-bench (real-world warehouses with messy column names), open 13B models land in the high 50s to low 60s, several points behind Gemini-SQL2.
- On internal multi-table benchmarks with 6+ way joins, the gap widens. Local models hallucinate join keys; frontier hosted models infer them.
Translation: for the 80% of analyst queries that read like "show me orders by region last quarter where status was shipped," a 7B local model is fine. For the 20% that involve nested CTEs, window functions, and cross-database reasoning, you should still expect to fall back to a hosted model or to a human analyst.
Which open text-to-SQL models fit a 12GB card
Almost all of the interesting open SQL specialists today come in 7B, 13B, or 32B sizes. The 32Bs do not fit on a 12GB card even at aggressive quantization — they belong on a 24GB RTX 3090 or 4090. The 7B and 13B classes are the relevant universe for a 3060.
| Model | Params | Quant | VRAM (model only) | Fits with 8k schema? |
|---|---|---|---|---|
| Defog SQLCoder-7B-2 | 7B | Q4_K_M | ~4.8GB | Yes, comfortably |
| NaturalSQL-7B | 7B | Q5_K_M | ~5.6GB | Yes |
| Arctic Text2SQL-R1-7B | 7B | Q4_K_M | ~5.0GB | Yes |
| SQLCoder-13B | 13B | Q4_K_M | ~9.0GB | Yes, tight |
| SQLCoder-13B | 13B | Q5_K_M | ~10.4GB | Marginal — drop schema or quant |
| Arctic Text2SQL-R1-32B | 32B | Q4_K_M | ~20GB | No (needs 24GB+) |
The numbers above assume a single user and a KV cache budget of 1–2GB for context up to 8k tokens. Push to 16k context and the 13B Q4 starts to spill — either step down to Q4_0 or trim the schema.
For framing, the RTX 3060 12GB TechPowerUp spec sheet confirms the 12GB GDDR6 pool and 360GB/s memory bandwidth that gates throughput at these sizes. That bandwidth, not raw FLOPs, is what makes the 3060 a sane local-LLM card.
Why 12GB matters more here than for image gen
Image generation pipelines tend to load a fixed-size model and run it many times. Text-to-SQL is the opposite: the model is small but the prompt is huge because the whole schema goes in. Every kilobyte of schema you add eats KV cache. An 8GB card can technically load a Q4 7B SQL model, but it runs out of room the moment you paste a real warehouse schema. The 12GB card on the MSI GeForce RTX 3060 Ventus 2X 12GB is the difference between "I can demo this" and "I can put this in front of analysts."
Quantization matrix: what gives, what breaks
Quantization trades VRAM for accuracy. For a 13B SQL specialist the curve looks roughly like this in 2026:
| Quant | VRAM (13B) | Spider exec acc (approx) | Verdict on 12GB card |
|---|---|---|---|
| Q3_K_S | ~6.5GB | 71% | Fits easily, accuracy drop is noticeable |
| Q4_0 | ~7.5GB | 75% | Good balance, lots of room for context |
| Q4_K_M | ~9.0GB | 77% | Recommended default |
| Q5_K_M | ~10.4GB | 78% | Squeezed; drop context |
| Q6_K | ~11.6GB | 79% | Will OOM with any real schema |
| Q8_0 | ~14GB | 79% | Does not fit |
The practical takeaway: Q4_K_M is the sweet spot for a 13B on a 3060. The accuracy delta to Q6 or Q8 on SQL tasks is in the single percentage points, and you would happily trade those points for the 2–5GB of headroom that buys you a larger schema and a faster KV cache.
Prefill vs generation on long schema prompts
This is the part most local-LLM guides skip and the part that matters most for text-to-SQL. Generation throughput (the 50–80 tok/s or 25–35 tok/s numbers) describes how fast the model emits SQL tokens after it has read your prompt. Prefill describes how long it takes to read the prompt in the first place.
On an RTX 3060 with a 13B Q4 model, prefill runs at roughly 800–1,200 tokens per second. So an 8,000-token schema prompt takes around 7–10 seconds before you see your first SQL keyword. The actual SQL output is usually 50–200 tokens, which generates in 2–6 seconds. The user perceives the whole query as 10–15 seconds, and most of that is prefill.
That has two consequences. First, KV cache reuse is huge — if you send the same schema 20 times in a session and your inference server caches it, the second through twentieth queries feel instant. Both llama.cpp and vLLM support this. Second, schema trimming via retrieval pays for itself fast: cutting an 8k prompt to a 2k prompt cuts prefill from 8 seconds to 2.
vLLM vs llama.cpp on the 3060
For a single-user analyst tool, vLLM with Marlin kernels delivers about 10–15% better tokens/second than llama.cpp at the same quantization on the 3060, and it handles continuous batching better if you ever scale to a small team. llama.cpp wins on simplicity — one binary, GGUF files, runs anywhere — and is the right choice for a personal box. If you are putting this behind an internal API, run vLLM.
Context-length math: fitting 40 tables
A typical CREATE TABLE statement with 10 columns, descriptive names, and a few comments runs about 150–250 tokens. Forty tables is therefore 6,000–10,000 tokens of schema. Add a system prompt (200 tokens), a few-shot of three example queries (1,500 tokens), and the user question (50 tokens), and you are at 8,000–12,000 tokens before the model generates a single character.
On a 12GB card running a 13B Q4 model, you have realistically 2–2.5GB of VRAM for KV cache after the model loads. That is enough for 8–12k tokens of context depending on attention layout. The conclusion: you can fit a 40-table schema, but you are at the edge. At 80 tables you must either prune the schema, swap to a 7B (more VRAM left for context), or run an embedding-based retrieval step that picks the 5–10 most likely relevant tables before the prompt is built.
Perf-per-dollar versus Gemini-SQL2
This is where the local rig wins for high-volume teams. A reasonable build looks like:
- ZOTAC RTX 3060 12GB: ~$280 as of 2026
- AMD Ryzen 7 5700X: ~$160
- 32GB DDR4: ~$70
- WD Blue SN550 1TB NVMe: ~$60
- Motherboard, PSU, case, cooler: ~$280
That is roughly $850 for a complete on-prem text-to-SQL box. At Gemini-SQL2 pricing of about $0.002 per 1k tokens and a typical query (schema + question + answer) of 8,000 tokens, you spend about $0.016 per query through the API. Fifty queries a day for a year is roughly $290 in API spend. A hundred queries a day for a year is about $580.
So the break-even is not "use local for everything." It is: if your team runs more than about 50 schema-heavy queries per day, the hardware pays for itself inside a year, and every subsequent year is essentially free except for power. If you run fewer than that, the API is cheaper than amortizing the box.
Verdict matrix
Run it local if:
- Your schema or sample rows contain regulated data (HIPAA, GDPR, SOC2 scope).
- You run more than 50 schema-heavy queries per day across the team.
- Your schema fits comfortably in 20–40 tables or you have a retrieval layer.
- You can tolerate occasional join hallucinations and you validate generated SQL before execution.
Stay on a hosted model if:
- Query volume is low (a handful per day) and there is no data-residency rule.
- Your workload regularly involves 200+ tables, six-way joins, or window functions over CTEs.
- You need the absolute top of the benchmark, not 90% of it.
- You do not have anyone on the team who can babysit an inference server.
A hybrid is often the right answer in 2026: local SQLCoder-13B Q4 on a 3060 for the 80% of bread-and-butter queries, with a Gemini-SQL2 fallback button for the gnarly ones.
Related guides
- Gemini-SQL2 vs Local Text-to-SQL on an RTX 3060
- vLLM vs llama.cpp for Single-User on RTX 3060 12GB
- Per-LLM Model Hardware Requirements Guide
- OpenAI Codex Price War and the Case for Local Coding on a 3060
Bottom line
A 12GB GPU runs useful text-to-SQL today, and an RTX 3060 12GB paired with a Ryzen 7 5700X is the cheapest sensible way to get there as of 2026. A 7B SQL specialist at Q4 leaves headroom for an 8k-token schema and hits 50–80 tok/s; a 13B at Q4 sits at the edge of the 12GB pool but matches GPT-3.5-class accuracy on Spider. You will lose to Gemini-SQL2 on six-way joins across a 200-table warehouse, and you should treat every generated query as a draft to validate. For schema-sensitive teams running tens to hundreds of queries a day, the local box pays for itself inside a year, and your data never leaves the building.
