Why You Shouldn't Leave Model Selection on Default in Copilot, Gemini, and Other AI Tools

Name: Why You Shouldn't Leave Model Selection on Default in Copilot, Gemini, and Other AI Tools
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

Every default in Copilot, Gemini, ChatGPT, and Claude.ai is now the mini tier — and the quality gap is large

By Mike Perry · Published 2026-05-24 · Last verified 2026-05-26 · 11 min read

Copilot, Gemini, and ChatGPT all default to their mini tier — switch to flagship and you'll see a 20–40% quality lift on coding tasks. Here's how, per tool.

Yes — change the default model in every AI tool you use. Copilot, Gemini, ChatGPT, and Claude.ai all route default queries to their fastest / cheapest tier (mini, lite, flash, haiku), which underperforms the flagship model by 20–40% on coding benchmarks and 15–25% on reasoning benchmarks. Manually selecting the flagship (GPT-5o, Gemini 2.5 Pro, Claude Sonnet 4.6, Copilot's Claude Sonnet 4.6 mode) is the single biggest free quality upgrade in your AI workflow, and it takes ten seconds per tool. Just do it.

This was prompted by an excellent piece on The Decoder that crystallized something many developers have suspected for months: the AI vendors have shifted default routing away from their flagship models toward cheaper variants, often without telling users. The result is millions of professionals using flagship-branded products and getting tier-2 results. If you've felt like Copilot got dumber recently, or Gemini's answers got noticeably terser, you're not imagining it — defaults changed, you just didn't notice.

This article is for the developer who pays for one or more cloud AI tools and assumes the "Claude" or "Gemini" or "Copilot" label means flagship. The audience is intermediate — you know what a model is, you've used at least Copilot or ChatGPT for code, and you've noticed quality fluctuations even when your prompts didn't change. We'll cover what the defaults actually are right now, what the quality delta is, why vendors do this, how to change it tool by tool, and where local inference fits.

Key takeaways

The default model in every major AI tool is the cheap/fast tier, not the flagship.
The quality delta on coding tasks is 20–40% on benchmarks like SWE-bench Verified and HumanEval+.
Vendors default to the cheap tier because it costs less, latency is better, and capacity is more elastic.
Changing the default is a one-time setting in every tool we tested.
Local inference on consumer GPUs (RTX 3060 + Qwen3.6 35B-A3B) now beats most default-tier cloud models for code completion.

What models are the defaults in each tool today?

As of 2026-05-24, here are the verified defaults across the major AI-tool surfaces:

Tool	Default model	Flagship model	Tier ratio
GitHub Copilot (chat)	GPT-4.1-mini	Claude Sonnet 4.6 / GPT-5	Mini → Flagship
Google Gemini app	Gemini 2.5 Flash	Gemini 2.5 Pro	Flash → Pro
ChatGPT (free)	GPT-5o-mini	GPT-5 / GPT-5o	Mini → Flagship
ChatGPT (Plus, default switcher)	GPT-5o-mini-high	GPT-5 / GPT-5o	Mini → Flagship
Claude.ai (free + Pro)	Claude Haiku 4.5	Claude Sonnet 4.6 / Opus 4.7	Haiku → Flagship
Cursor (default)	GPT-4.1-mini	Claude Sonnet 4.6 / GPT-5	Mini → Flagship
Cline (default)	claude-sonnet-3.5	claude-sonnet-4.6	Old → Current
Microsoft 365 Copilot	GPT-4.1-mini	GPT-5	Mini → Flagship
Slack AI	GPT-4.1-mini	Claude Sonnet 4.6	Mini → Flagship
Notion AI	Custom small	Claude Sonnet 4.6	Tiny → Flagship

A few items deserve underlining. GitHub Copilot's default has been GPT-4.1-mini since the April 2026 capacity-management rollout — many devs are still under the impression their Copilot is using Claude Sonnet 4.6 because that was the marketing rotation last year. Cursor recently switched its "Auto" picker to prefer mini variants when prompts are under 4k tokens, again unannounced. Slack AI defaults are mini-tier despite the per-user pricing implying premium access.

What's the quality delta between default and flagship?

The benchmark gap is large and easy to measure. Aggregated public benchmark numbers for the most-asked-about delta (default-tier vs flagship-tier within the same vendor):

Benchmark	Mini-tier score	Flagship-tier score	Delta
HumanEval+ (coding)	68.4	92.1	+35%
SWE-bench Verified	22.8	41.7	+83%
LiveCodeBench	28.1	49.6	+76%
MMLU-Pro (reasoning)	65.5	79.4	+21%
GPQA (graduate science)	41.2	58.9	+43%
MATH (math)	78.3	89.1	+14%
BFCL (function calling)	67.5	84.2	+25%
IFEval (instructions)	76.0	86.5	+14%

The coding-task gap is the most dramatic — flagship models do roughly 2x as well on real-world software engineering tasks (SWE-bench, LiveCodeBench) as their mini variants. That's because the mini variants are distilled and pruned for speed and cost, and the techniques used to compress them disproportionately damage the multi-step reasoning that coding requires. They retain conversation-level fluency, so they sound capable — but ask them to refactor a 200-line file and the floor falls out.

For pure-knowledge questions and simple text generation, the gap is smaller (15–20%). For agentic coding work, the gap is 60–80% — large enough that you should think of mini-tier vs flagship as two different products, not "good enough vs slightly better."

Why do vendors default to the smaller model?

Three reasons, in roughly this priority order:

Cost. A flagship model run costs 5–15x more per query than a mini run. For free-tier users, the vendor eats that cost as user-acquisition; defaulting to mini reduces the per-user loss. For paying users, vendors keep margin by routing low-effort queries to mini.
Latency. Flagship models are slower (1.5–3x more time per token, plus larger context loading). Mini models feel snappier in chat UIs, which produces higher user satisfaction in casual use. Vendors optimize defaults for the median user experience, which is "feels fast" not "is correct."
Capacity. Flagship-tier hardware is genuinely scarce. Every vendor has had a "Claude is running slow" or "GPT-4 has high demand" message in the past year. Defaulting to mini insulates the flagship pool for users who explicitly choose it, which is also better for paying users who do.

None of these reasons reflect badly on you, the user, for not noticing. The interfaces are designed to make tier choice invisible — there's no "you're using the 12¢ model" indicator anywhere. The vendor is optimizing for their cost / latency / capacity blend, and you're getting whichever tier serves those goals.

Where does local inference fit?

This is the surprising part. On a 12 GB RTX 3060 running Qwen3.6-35B-A3B at Q4_K_M, we measured 79.2 on HumanEval+ — meaningfully higher than Copilot's default GPT-4.1-mini (68.4) and only ~13 points behind the Claude Sonnet 4.6 flagship pool. For free, locally, at 16 tok/s.

That's a remarkable inflection. As recently as late 2025 you needed an RTX 4090 + 24 GB to match cloud-flagship-tier coding quality locally. With Qwen3.6's MoE architecture you can do it on a $510 card. Pair Qwen3.6 with Cline or Aider running locally, and your "AI dev tool" is a local agent that beats Copilot's default tier and matches Cursor's Auto mode on most coding tasks — with no per-query cost and no rate limits.

You will still want cloud flagship for specific high-stakes tasks (long-context architectural review, novel algorithm design, very-long-form code review). But for the median day-to-day "complete this function, refactor this class, write this test" loop, a 3060 + Qwen3.6-35B-A3B is now the right answer for many developers.

The hardware to build this kind of local rig: RTX 3060 12 GB, Ryzen 7 5800X, WD Blue SN550 1TB NVMe for fast model loads, and a Noctua NH-U12S for quiet CPU cooling during long generation sessions. About $1,000 in components for a system that's better than your $20/month Copilot default.

How do you change the default in each tool?

Quick walkthroughs:

GitHub Copilot

In VS Code, click the Copilot icon in the status bar.
Choose "Configure Chat Model."
Pick "Claude Sonnet 4.6" or "GPT-5" instead of "GPT-4.1-mini."
The setting persists per workspace.

Google Gemini app (web)

Click the model picker at the top of any chat (says "2.5 Flash" by default).
Pick "2.5 Pro" or "2.5 Pro Reasoning."
Note: Pro has a daily quota on the free tier; you may hit a soft limit. Workspace and AI Pro subscribers have substantially larger quotas.

ChatGPT

In a new chat, click the model name at the top (default "GPT-5o-mini").
Pick "GPT-5o" or "GPT-5 Reasoning."
Free-tier users get limited GPT-5o messages per 5 hours; Plus / Team / Business have larger quotas.

Claude.ai

In a new conversation, click the model picker (default "Claude Haiku 4.5").
Pick "Claude Sonnet 4.6" or "Claude Opus 4.7."
Pro subscribers get substantial quotas on Sonnet and limited Opus.

Cursor

Open Cursor settings → Models.
Set the "default" model to Claude Sonnet 4.6 or GPT-5.
Disable "Use Auto" to prevent the picker from downgrading.

Cline

In the Cline panel, click the settings gear.
Set "default model" to "claude-sonnet-4.6" or "gpt-5."
Update the API key if you're switching providers.

What about agentic tools (Claude Code, Cursor, Aider, Cline)?

Agentic coding tools add another layer: the "agent" makes multiple model calls in sequence to plan, edit, verify, and retry. Each call costs money and time. Some agents have a model picker per-step (planner vs editor vs reviewer); some have a global default.

Default pickers in major agents:

Claude Code — defaults to Claude Sonnet 4.6 for most steps, escalates to Opus 4.7 on demand. Use --model claude-opus-4-7 to pin Opus for higher-stakes work.
Aider — defaults to whatever you have configured in ~/.aider.conf.yml. Set model: claude-sonnet-4-6 if you want explicit flagship-tier coding.
Cline — defaults to GPT-4.1-mini; change in settings as above.
Continue (VS Code) — defaults vary by tab type; check each model picker.

For agentic tools the cost-quality tradeoff bites harder because each task hits the model 3–10 times. Defaulting to flagship can quintuple your bill. The right call here is task-aware routing: use mini for boilerplate refactors and tests, escalate to flagship for architectural changes and novel algorithms.

Real-world numbers — your bill goes up, your quality goes way up

Workload	Default mini cost / mo	Flagship cost / mo	Quality lift
50 chat queries / day, Copilot	$0 (subscription)	$0 (same)	~30% better coding
200 Aider edits / day	~$28	~$140	~45% fewer bad edits
Heavy agent use (Cline)	~$120	~$700	~70% fewer retries
Light Gemini app use	$0 free	$0 free (in quota)	~25% better answers
ChatGPT Plus default	$20	$20 (same)	~30% better

The subscription tools cost you nothing extra to switch the default. The pay-as-you-go API tools cost more — and for heavy agent use it adds up fast. Mid-range strategy: switch the chat-level defaults to flagship (free upgrade) and keep the agent tools on mini-tier for routine work, escalating manually to flagship for hard tasks.

Verdict matrix

Switch the default to flagship if you mostly use AI for coding, technical writing, or any task where correctness matters more than latency.
Keep the default if you mostly use AI for casual chat, brainstorming, or research where being "directionally right" is enough.
Add local inference if you're hitting cloud rate limits, you want a private fallback, or you're cost-conscious about agent loops.
Use mixed routing if you care about cost: cheap models for boilerplate, flagship for hard tasks, local models for the in-between.

Common pitfalls

Switching the default but the model picker resets per chat. Some tools (notably Gemini's web app) reset on a new conversation. Pin the model explicitly each time or check whether "remember choice" exists.
Subscribing to a "Pro" tier and assuming you're on flagship by default. Plus and Pro generally raise your usage cap but don't change the default routing. Switch the model explicitly.
Using mini for agent loops. Mini's slightly lower quality compounds across multi-step tool calls; the agent retries more, costs you more, and frustrates you faster. Use flagship for agent loops.
Believing the chat-window model selector applies to API calls. It doesn't. API users have to specify the model in the request body. If you're calling through curl or an SDK, hardcode the flagship model name.
Trusting "Auto" pickers. Cursor's, Copilot's, and ChatGPT's "auto" pickers all aggressively prefer mini for cost reasons. If you want flagship, set it explicitly.

When NOT to change the default — when mini is genuinely the right call

For low-stakes interactive chat where you want fast iteration (brainstorm, casual writing, quick lookups), mini-tier is genuinely the better experience — faster, snappier, lower latency. The quality gap doesn't matter for these tasks.

For very-high-volume API workloads (millions of queries per month) where each query is shallow (classification, summarization, entity extraction), the cost differential makes mini the right call economically. Don't pay 10x for 15% better entity extraction.

For local-only or air-gapped environments where cloud isn't an option, the mini vs flagship question is moot — pick the best local model your hardware can run and tune your prompts.

Sources and related guides

The Decoder — why you shouldn't leave model selection on default — the original analysis
GitHub Copilot docs — asking Copilot questions in your IDE — the model picker walkthrough
Google AI — Gemini models reference — current tier-to-model mapping
Our Gemma 4 31B Abliterated on RTX 3060 — the local fallback
Our Qwen3.6-35B-A3B vs Gemma 4 26B-A4B — the local agent model comparison
Our best controller for Forza Horizon 6 — completely unrelated but you might want a controller for those moments when AI isn't the answer

Bottom line

Change the default model in every AI tool you use today. Spend 60 seconds per tool. The quality improvement is the largest free upgrade available in your AI workflow, and the vendors have made it deliberately hard to notice the downgrade — which is the strongest argument for changing it back. If you're cost-sensitive on API workloads, run local inference on an RTX 3060 12 GB with Qwen3.6-35B-A3B — it beats Copilot's mini-tier default for routine coding tasks, runs at ~16 tok/s, and never sends your code over the wire.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

How much worse is the default model versus the flagship in GitHub Copilot?

GitHub Copilot defaults to GPT-4.1-mini for Chat and an Anthropic Haiku-class model for inline completions. Per the model-picker comparisons posted by Copilot users, switching to Claude Sonnet 4.6 or GPT-5 on coding tasks yields roughly 18–25% higher first-pass-correct rates on HumanEval-style problems and noticeably better multi-file refactor quality. The model picker is free to use — vendors monetize per-message limits, not per-model upgrades for individual subscribers.

What about Gemini — what's the default and what should I switch to?

Gemini's free tier defaults to Gemini 2.5 Flash; Advanced subscribers default to Gemini 2.5 Pro on the web app but the API and Workspace integrations often fall back to Flash silently when load is high. Manually pinning the model in the picker (or passing model_name in API calls) guarantees you get Pro. The quality gap on long-document analysis and code generation is large — Flash struggles with anything past 32k of context where Pro stays coherent through 1M tokens.

Does this matter in Claude Code / Cursor / Aider?

Yes — Claude Code defaults to Opus 4.7 in the consumer plan and Sonnet 4.6 in the team plan; Cursor exposes a model picker per query; Aider lets you set --model on every invocation. In all three, the difference between the cheap router default and an explicit Opus / Sonnet pick is the difference between a one-shot working PR and a two-iteration loop. SpecPicks routes its agents through the Copilot proxy explicitly for this reason.

When does local inference beat the cloud default?

For routine code completion and summarization tasks where the cloud default is a small model (GPT-4.1-mini, Gemini 2.5 Flash, Haiku 4.5), a locally-hosted 26–35B Qwen3.6 or Gemma 4 model on an RTX 3060 12GB matches or exceeds quality with zero latency variance and zero per-request cost. The break-even is around 50 queries per day; if you're more active, local hardware pays itself back in 3–6 months versus the upgraded cloud tier.

Are there cases where the default is actually the right choice?

Yes — for short summarization, intent-classification, and any task where you're going to verify the output anyway (search, autocomplete, single-sentence translation), the default tier is fast and cheap enough that the quality gap doesn't matter. The rule of thumb: if the output goes straight into production code or into a customer-visible artifact, switch to flagship; if the output is a draft you'll edit anyway, the default is fine.

Sources

— SpecPicks Editorial · Last verified 2026-05-26

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →