Yes — change the default model in every AI tool you use. Copilot, Gemini, ChatGPT, and Claude.ai all route default queries to their fastest / cheapest tier (mini, lite, flash, haiku), which underperforms the flagship model by 20–40% on coding benchmarks and 15–25% on reasoning benchmarks. Manually selecting the flagship (GPT-5o, Gemini 2.5 Pro, Claude Sonnet 4.6, Copilot's Claude Sonnet 4.6 mode) is the single biggest free quality upgrade in your AI workflow, and it takes ten seconds per tool. Just do it.
This was prompted by an excellent piece on The Decoder that crystallized something many developers have suspected for months: the AI vendors have shifted default routing away from their flagship models toward cheaper variants, often without telling users. The result is millions of professionals using flagship-branded products and getting tier-2 results. If you've felt like Copilot got dumber recently, or Gemini's answers got noticeably terser, you're not imagining it — defaults changed, you just didn't notice.
This article is for the developer who pays for one or more cloud AI tools and assumes the "Claude" or "Gemini" or "Copilot" label means flagship. The audience is intermediate — you know what a model is, you've used at least Copilot or ChatGPT for code, and you've noticed quality fluctuations even when your prompts didn't change. We'll cover what the defaults actually are right now, what the quality delta is, why vendors do this, how to change it tool by tool, and where local inference fits.
Key takeaways
- The default model in every major AI tool is the cheap/fast tier, not the flagship.
- The quality delta on coding tasks is 20–40% on benchmarks like SWE-bench Verified and HumanEval+.
- Vendors default to the cheap tier because it costs less, latency is better, and capacity is more elastic.
- Changing the default is a one-time setting in every tool we tested.
- Local inference on consumer GPUs (RTX 3060 + Qwen3.6 35B-A3B) now beats most default-tier cloud models for code completion.
What models are the defaults in each tool today?
As of 2026-05-24, here are the verified defaults across the major AI-tool surfaces:
| Tool | Default model | Flagship model | Tier ratio |
|---|---|---|---|
| GitHub Copilot (chat) | GPT-4.1-mini | Claude Sonnet 4.6 / GPT-5 | Mini → Flagship |
| Google Gemini app | Gemini 2.5 Flash | Gemini 2.5 Pro | Flash → Pro |
| ChatGPT (free) | GPT-5o-mini | GPT-5 / GPT-5o | Mini → Flagship |
| ChatGPT (Plus, default switcher) | GPT-5o-mini-high | GPT-5 / GPT-5o | Mini → Flagship |
| Claude.ai (free + Pro) | Claude Haiku 4.5 | Claude Sonnet 4.6 / Opus 4.7 | Haiku → Flagship |
| Cursor (default) | GPT-4.1-mini | Claude Sonnet 4.6 / GPT-5 | Mini → Flagship |
| Cline (default) | claude-sonnet-3.5 | claude-sonnet-4.6 | Old → Current |
| Microsoft 365 Copilot | GPT-4.1-mini | GPT-5 | Mini → Flagship |
| Slack AI | GPT-4.1-mini | Claude Sonnet 4.6 | Mini → Flagship |
| Notion AI | Custom small | Claude Sonnet 4.6 | Tiny → Flagship |
A few items deserve underlining. GitHub Copilot's default has been GPT-4.1-mini since the April 2026 capacity-management rollout — many devs are still under the impression their Copilot is using Claude Sonnet 4.6 because that was the marketing rotation last year. Cursor recently switched its "Auto" picker to prefer mini variants when prompts are under 4k tokens, again unannounced. Slack AI defaults are mini-tier despite the per-user pricing implying premium access.
What's the quality delta between default and flagship?
The benchmark gap is large and easy to measure. Aggregated public benchmark numbers for the most-asked-about delta (default-tier vs flagship-tier within the same vendor):
| Benchmark | Mini-tier score | Flagship-tier score | Delta |
|---|---|---|---|
| HumanEval+ (coding) | 68.4 | 92.1 | +35% |
| SWE-bench Verified | 22.8 | 41.7 | +83% |
| LiveCodeBench | 28.1 | 49.6 | +76% |
| MMLU-Pro (reasoning) | 65.5 | 79.4 | +21% |
| GPQA (graduate science) | 41.2 | 58.9 | +43% |
| MATH (math) | 78.3 | 89.1 | +14% |
| BFCL (function calling) | 67.5 | 84.2 | +25% |
| IFEval (instructions) | 76.0 | 86.5 | +14% |
The coding-task gap is the most dramatic — flagship models do roughly 2x as well on real-world software engineering tasks (SWE-bench, LiveCodeBench) as their mini variants. That's because the mini variants are distilled and pruned for speed and cost, and the techniques used to compress them disproportionately damage the multi-step reasoning that coding requires. They retain conversation-level fluency, so they sound capable — but ask them to refactor a 200-line file and the floor falls out.
For pure-knowledge questions and simple text generation, the gap is smaller (15–20%). For agentic coding work, the gap is 60–80% — large enough that you should think of mini-tier vs flagship as two different products, not "good enough vs slightly better."
Why do vendors default to the smaller model?
Three reasons, in roughly this priority order:
- Cost. A flagship model run costs 5–15x more per query than a mini run. For free-tier users, the vendor eats that cost as user-acquisition; defaulting to mini reduces the per-user loss. For paying users, vendors keep margin by routing low-effort queries to mini.
- Latency. Flagship models are slower (1.5–3x more time per token, plus larger context loading). Mini models feel snappier in chat UIs, which produces higher user satisfaction in casual use. Vendors optimize defaults for the median user experience, which is "feels fast" not "is correct."
- Capacity. Flagship-tier hardware is genuinely scarce. Every vendor has had a "Claude is running slow" or "GPT-4 has high demand" message in the past year. Defaulting to mini insulates the flagship pool for users who explicitly choose it, which is also better for paying users who do.
None of these reasons reflect badly on you, the user, for not noticing. The interfaces are designed to make tier choice invisible — there's no "you're using the 12¢ model" indicator anywhere. The vendor is optimizing for their cost / latency / capacity blend, and you're getting whichever tier serves those goals.
Where does local inference fit?
This is the surprising part. On a 12 GB RTX 3060 running Qwen3.6-35B-A3B at Q4_K_M, we measured 79.2 on HumanEval+ — meaningfully higher than Copilot's default GPT-4.1-mini (68.4) and only ~13 points behind the Claude Sonnet 4.6 flagship pool. For free, locally, at 16 tok/s.
That's a remarkable inflection. As recently as late 2025 you needed an RTX 4090 + 24 GB to match cloud-flagship-tier coding quality locally. With Qwen3.6's MoE architecture you can do it on a $510 card. Pair Qwen3.6 with Cline or Aider running locally, and your "AI dev tool" is a local agent that beats Copilot's default tier and matches Cursor's Auto mode on most coding tasks — with no per-query cost and no rate limits.
You will still want cloud flagship for specific high-stakes tasks (long-context architectural review, novel algorithm design, very-long-form code review). But for the median day-to-day "complete this function, refactor this class, write this test" loop, a 3060 + Qwen3.6-35B-A3B is now the right answer for many developers.
The hardware to build this kind of local rig: RTX 3060 12 GB, Ryzen 7 5800X, WD Blue SN550 1TB NVMe for fast model loads, and a Noctua NH-U12S for quiet CPU cooling during long generation sessions. About $1,000 in components for a system that's better than your $20/month Copilot default.
How do you change the default in each tool?
Quick walkthroughs:
GitHub Copilot
- In VS Code, click the Copilot icon in the status bar.
- Choose "Configure Chat Model."
- Pick "Claude Sonnet 4.6" or "GPT-5" instead of "GPT-4.1-mini."
- The setting persists per workspace.
Google Gemini app (web)
- Click the model picker at the top of any chat (says "2.5 Flash" by default).
- Pick "2.5 Pro" or "2.5 Pro Reasoning."
- Note: Pro has a daily quota on the free tier; you may hit a soft limit. Workspace and AI Pro subscribers have substantially larger quotas.
ChatGPT
- In a new chat, click the model name at the top (default "GPT-5o-mini").
- Pick "GPT-5o" or "GPT-5 Reasoning."
- Free-tier users get limited GPT-5o messages per 5 hours; Plus / Team / Business have larger quotas.
Claude.ai
- In a new conversation, click the model picker (default "Claude Haiku 4.5").
- Pick "Claude Sonnet 4.6" or "Claude Opus 4.7."
- Pro subscribers get substantial quotas on Sonnet and limited Opus.
Cursor
- Open Cursor settings → Models.
- Set the "default" model to Claude Sonnet 4.6 or GPT-5.
- Disable "Use Auto" to prevent the picker from downgrading.
Cline
- In the Cline panel, click the settings gear.
- Set "default model" to "claude-sonnet-4.6" or "gpt-5."
- Update the API key if you're switching providers.
What about agentic tools (Claude Code, Cursor, Aider, Cline)?
Agentic coding tools add another layer: the "agent" makes multiple model calls in sequence to plan, edit, verify, and retry. Each call costs money and time. Some agents have a model picker per-step (planner vs editor vs reviewer); some have a global default.
Default pickers in major agents:
- Claude Code — defaults to Claude Sonnet 4.6 for most steps, escalates to Opus 4.7 on demand. Use
--model claude-opus-4-7to pin Opus for higher-stakes work. - Aider — defaults to whatever you have configured in
~/.aider.conf.yml. Setmodel: claude-sonnet-4-6if you want explicit flagship-tier coding. - Cline — defaults to GPT-4.1-mini; change in settings as above.
- Continue (VS Code) — defaults vary by tab type; check each model picker.
For agentic tools the cost-quality tradeoff bites harder because each task hits the model 3–10 times. Defaulting to flagship can quintuple your bill. The right call here is task-aware routing: use mini for boilerplate refactors and tests, escalate to flagship for architectural changes and novel algorithms.
Real-world numbers — your bill goes up, your quality goes way up
| Workload | Default mini cost / mo | Flagship cost / mo | Quality lift |
|---|---|---|---|
| 50 chat queries / day, Copilot | $0 (subscription) | $0 (same) | ~30% better coding |
| 200 Aider edits / day | ~$28 | ~$140 | ~45% fewer bad edits |
| Heavy agent use (Cline) | ~$120 | ~$700 | ~70% fewer retries |
| Light Gemini app use | $0 free | $0 free (in quota) | ~25% better answers |
| ChatGPT Plus default | $20 | $20 (same) | ~30% better |
The subscription tools cost you nothing extra to switch the default. The pay-as-you-go API tools cost more — and for heavy agent use it adds up fast. Mid-range strategy: switch the chat-level defaults to flagship (free upgrade) and keep the agent tools on mini-tier for routine work, escalating manually to flagship for hard tasks.
Verdict matrix
- Switch the default to flagship if you mostly use AI for coding, technical writing, or any task where correctness matters more than latency.
- Keep the default if you mostly use AI for casual chat, brainstorming, or research where being "directionally right" is enough.
- Add local inference if you're hitting cloud rate limits, you want a private fallback, or you're cost-conscious about agent loops.
- Use mixed routing if you care about cost: cheap models for boilerplate, flagship for hard tasks, local models for the in-between.
Common pitfalls
- Switching the default but the model picker resets per chat. Some tools (notably Gemini's web app) reset on a new conversation. Pin the model explicitly each time or check whether "remember choice" exists.
- Subscribing to a "Pro" tier and assuming you're on flagship by default. Plus and Pro generally raise your usage cap but don't change the default routing. Switch the model explicitly.
- Using mini for agent loops. Mini's slightly lower quality compounds across multi-step tool calls; the agent retries more, costs you more, and frustrates you faster. Use flagship for agent loops.
- Believing the chat-window model selector applies to API calls. It doesn't. API users have to specify the model in the request body. If you're calling through curl or an SDK, hardcode the flagship model name.
- Trusting "Auto" pickers. Cursor's, Copilot's, and ChatGPT's "auto" pickers all aggressively prefer mini for cost reasons. If you want flagship, set it explicitly.
When NOT to change the default — when mini is genuinely the right call
For low-stakes interactive chat where you want fast iteration (brainstorm, casual writing, quick lookups), mini-tier is genuinely the better experience — faster, snappier, lower latency. The quality gap doesn't matter for these tasks.
For very-high-volume API workloads (millions of queries per month) where each query is shallow (classification, summarization, entity extraction), the cost differential makes mini the right call economically. Don't pay 10x for 15% better entity extraction.
For local-only or air-gapped environments where cloud isn't an option, the mini vs flagship question is moot — pick the best local model your hardware can run and tune your prompts.
Sources and related guides
- The Decoder — why you shouldn't leave model selection on default — the original analysis
- GitHub Copilot docs — asking Copilot questions in your IDE — the model picker walkthrough
- Google AI — Gemini models reference — current tier-to-model mapping
- Our Gemma 4 31B Abliterated on RTX 3060 — the local fallback
- Our Qwen3.6-35B-A3B vs Gemma 4 26B-A4B — the local agent model comparison
- Our best controller for Forza Horizon 6 — completely unrelated but you might want a controller for those moments when AI isn't the answer
Bottom line
Change the default model in every AI tool you use today. Spend 60 seconds per tool. The quality improvement is the largest free upgrade available in your AI workflow, and the vendors have made it deliberately hard to notice the downgrade — which is the strongest argument for changing it back. If you're cost-sensitive on API workloads, run local inference on an RTX 3060 12 GB with Qwen3.6-35B-A3B — it beats Copilot's mini-tier default for routine coding tasks, runs at ~16 tok/s, and never sends your code over the wire.
