Every LLM app you write can pretend it's calling OpenAI. LiteLLM makes that true by routing OpenAI-shaped requests to whatever backend you actually want.
Why you want this
- One API shape in your code.
model="gpt-4o"vsmodel="claude-opus-4"vsmodel="ollama/qwen3:32b"— same client library, same request format. - Rate limiting + spend tracking in one place.
- Fallback routing: primary fails → secondary picks up.
- Cost routing: cheap questions hit Ollama, expensive questions hit Claude.
Install
Create config.yaml:
Now any OpenAI client pointed at http://localhost:4000 with your master key works — regardless of which backend answers.
What this unlocks
Dev/prod parity: write code against gpt-4o, test against local-qwen for free while developing. Flip one config line for production.
Spend alerts: LiteLLM's built-in Prometheus exporter lets you alert on "$50 of Claude Opus in the last hour."
Multi-tenant: give different teams different master keys, meter each one independently.
Related
Production config — multi-provider with fallbacks and spend caps
A hardened config.yaml that our SpecPicks pipeline runs in production (minus secrets):
Key decisions:
- Postgres for spend persistence. Without it, restarts wipe your spend history.
- Redis for caching. 300-second TTL matches Anthropic's
ephemeralcache window. - Fallback ladder goes cloud → cloud → local. Primary-model Slack alert at $300 spend over 30 days.
- Virtual keys per team. Issue via
/key/generateendpoint; track spend per key.
How public benchmarks show and compared
The SpecPicks pipeline runs exactly this config (minus the Slack alerting URL) against ~50k generations/month. Patterns are cross-referenced with the LiteLLM documentation and community discussions. For the self-hosted-Claude-proxy complement, see our self-hosted Claude proxy guide.
Virtual-key workflow (issuing per-team access)
Now team-a's developers set OPENAI_API_KEY=sk-proxy-xxx and OPENAI_BASE_URL=http://litellm-internal:4000. They see no change in their code; LiteLLM handles everything.
Caching patterns that actually work
- Long system prompt, short user turns. The biggest wins. Stable system prompt becomes a cache hit on every turn. 40-60% cost reduction common.
- Document Q&A with stable corpus. If the retrieved chunks are stable across requests, cache hits are high. Implement retrieval deterministically to maximise this.
- Agent tool definitions. Lots of agents have 500-2000 tokens of tool schema. Perfect cache-hit candidate.
Cache misses are common in:
- Agent loops with timestamps or random IDs embedded in prompts.
- Multi-user workflows where user ID is interpolated early in the system prompt.
- Streaming use cases that hash differently per chunk — unintuitive but observed.
Three failure modes
1. LiteLLM shows zero cache hits. Redis is unreachable or misconfigured. Check docker logs litellm | grep -i cache; expected: Cache: initialized Redis connection.
2. Spend tracking shows 0.0 USD forever. database_url isn't set. LiteLLM silently runs in-memory without it.
3. Stream responses hang. Some backends don't support streaming with all params. Set drop_params: true in litellm_settings — lets unsupported params (e.g. OpenAI-specific seed on Anthropic) be silently dropped instead of erroring.
Alternatives
- OpenRouter — hosted multi-provider proxy, pay-per-token markup. Zero ops, higher $/token.
- Portkey — hosted + open-source gateway, observability-focused.
- Hand-rolled FastAPI proxy — see our self-hosted Claude proxy guide for skeleton.
- Anthropic / OpenAI SDK direct — zero proxy overhead, no caching / fallback / spend tracking.
Frequently asked questions
Is LiteLLM production-grade?
Yes in 2026. It powers inference routing at several well-known LLM-app companies. Watch for memory-leak behavior in versions with Redis caching enabled pre-v1.50 — pin a recent version.
What's the latency overhead?
In same-host deployment: <20 ms median, <200 ms p99. Cross-region can add 50-200 ms.
Can I use LiteLLM with Ollama?
Yes — model: "ollama/<model>" + api_base: "http://host:11434". LiteLLM treats Ollama as just another OpenAI-compatible backend.
Does LiteLLM handle streaming?
Yes, for every supported backend. Set stream: true in the request; LiteLLM passes chunks through transparently.
Can I chain LiteLLM + another proxy?
Yes. Common pattern: LiteLLM in front of a mix of Anthropic, OpenAI, and a local Claude proxy. The Claude proxy handles Claude Max; LiteLLM handles everything else + fallback.
Sources
- LiteLLM documentation — authoritative reference.
- r/LocalLLaMA — community deployment patterns and issues.
- Anthropic — Claude Code best practices — pairing reference.
- Open WebUI GitHub — typical downstream UI for LiteLLM.
Related guides
- Self-hosted Claude proxy
- Open WebUI — self-hosted ChatGPT for your local models
- AI coding assistants ranked
- Best GPU for an AI rig
— SpecPicks Editorial · Last verified 2026-04-21
