Every LLM app you write can pretend it's calling OpenAI. LiteLLM makes that true by routing OpenAI-shaped requests to whatever backend you actually want.
Why you want this
- One API shape in your code.
model="gpt-4o"vsmodel="claude-opus-4"vsmodel="ollama/qwen3:32b"— same client library, same request format. - Rate limiting + spend tracking in one place.
- Fallback routing: primary fails → secondary picks up.
- Cost routing: cheap questions hit Ollama, expensive questions hit Claude.
Install
pip install "litellm[proxy]"
Create config.yaml:
model_list:
- model_name: "gpt-4o"
litellm_params:
model: "openai/gpt-4o"
api_key: "sk-..."
- model_name: "claude-opus-4"
litellm_params:
model: "anthropic/claude-opus-4-20250514"
api_key: "sk-ant-..."
- model_name: "local-qwen"
litellm_params:
model: "ollama/qwen3:32b"
api_base: "http://localhost:11434"
router_settings:
fallbacks:
- gpt-4o: [claude-opus-4, local-qwen]
general_settings:
master_key: "sk-proxy-your-secret-here"
litellm --config config.yaml --port 4000
Now any OpenAI client pointed at http://localhost:4000 with your master key works — regardless of which backend answers.
What this unlocks
Dev/prod parity: write code against gpt-4o, test against local-qwen for free while developing. Flip one config line for production.
Spend alerts: LiteLLM's built-in Prometheus exporter lets you alert on "$50 of Claude Opus in the last hour."
Multi-tenant: give different teams different master keys, meter each one independently.
Related
Production config — multi-provider with fallbacks and spend caps
A hardened config.yaml that our SpecPicks pipeline runs in production (minus secrets):
model_list:
- model_name: "gpt-4o"
litellm_params:
model: "openai/gpt-4o"
api_key: os.environ/OPENAI_API_KEY
rpm: 500 # per-key rate limit
tpm: 300000
- model_name: "claude-opus"
litellm_params:
model: "anthropic/claude-opus-4-20250514"
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: "claude-sonnet"
litellm_params:
model: "anthropic/claude-sonnet-4-20250514"
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: "local-qwen-32b"
litellm_params:
model: "ollama/qwen3:32b"
api_base: "http://ollama:11434"
router_settings:
fallbacks:
- gpt-4o: [claude-opus, claude-sonnet, local-qwen-32b]
- claude-opus: [claude-sonnet, gpt-4o, local-qwen-32b]
retry_policy:
num_retries: 2
retry_after: 5
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: "postgresql://litellm:secret@postgres/litellm"
alerting: ["slack"]
alerting_threshold: 300
budget_duration: "30d"
max_budget: 500 # USD per month, per virtual key
litellm_settings:
cache: true
cache_params:
type: "redis"
host: "redis"
port: 6379
ttl: 300
drop_params: true # ignore unsupported params silently
set_verbose: false
Key decisions:
- Postgres for spend persistence. Without it, restarts wipe your spend history.
- Redis for caching. 300-second TTL matches Anthropic's
ephemeralcache window. - Fallback ladder goes cloud → cloud → local. Primary-model Slack alert at $300 spend over 30 days.
- Virtual keys per team. Issue via
/key/generateendpoint; track spend per key.
How we tested and compared
The SpecPicks pipeline runs exactly this config (minus the Slack alerting URL) against ~50k generations/month. Patterns are cross-referenced with the LiteLLM documentation and community discussions. For the self-hosted-Claude-proxy complement, see our self-hosted Claude proxy guide.
Virtual-key workflow (issuing per-team access)
# Create a key with a $50/month cap for "team-a"
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer ${MASTER_KEY}" \
-H "Content-Type: application/json" \
-d '{
"metadata": {"team": "team-a"},
"max_budget": 50,
"budget_duration": "30d",
"models": ["gpt-4o", "claude-sonnet"]
}'
# → returns sk-proxy-xxx that team-a uses as their "OpenAI API key"
Now team-a's developers set OPENAI_API_KEY=sk-proxy-xxx and OPENAI_BASE_URL=http://litellm-internal:4000. They see no change in their code; LiteLLM handles everything.
Caching patterns that actually work
- Long system prompt, short user turns. The biggest wins. Stable system prompt becomes a cache hit on every turn. 40-60% cost reduction common.
- Document Q&A with stable corpus. If the retrieved chunks are stable across requests, cache hits are high. Implement retrieval deterministically to maximise this.
- Agent tool definitions. Lots of agents have 500-2000 tokens of tool schema. Perfect cache-hit candidate.
Cache misses are common in:
- Agent loops with timestamps or random IDs embedded in prompts.
- Multi-user workflows where user ID is interpolated early in the system prompt.
- Streaming use cases that hash differently per chunk — unintuitive but observed.
Three failure modes
1. LiteLLM shows zero cache hits. Redis is unreachable or misconfigured. Check docker logs litellm | grep -i cache; expected: Cache: initialized Redis connection.
2. Spend tracking shows 0.0 USD forever. database_url isn't set. LiteLLM silently runs in-memory without it.
3. Stream responses hang. Some backends don't support streaming with all params. Set drop_params: true in litellm_settings — lets unsupported params (e.g. OpenAI-specific seed on Anthropic) be silently dropped instead of erroring.
Alternatives
- OpenRouter — hosted multi-provider proxy, pay-per-token markup. Zero ops, higher $/token.
- Portkey — hosted + open-source gateway, observability-focused.
- Hand-rolled FastAPI proxy — see our self-hosted Claude proxy guide for skeleton.
- Anthropic / OpenAI SDK direct — zero proxy overhead, no caching / fallback / spend tracking.
Frequently asked questions
Is LiteLLM production-grade?
Yes in 2026. It powers inference routing at several well-known LLM-app companies. Watch for memory-leak behavior in versions with Redis caching enabled pre-v1.50 — pin a recent version.
What's the latency overhead?
In same-host deployment: <20 ms median, <200 ms p99. Cross-region can add 50-200 ms.
Can I use LiteLLM with Ollama?
Yes — model: "ollama/<model>" + api_base: "http://host:11434". LiteLLM treats Ollama as just another OpenAI-compatible backend.
Does LiteLLM handle streaming?
Yes, for every supported backend. Set stream: true in the request; LiteLLM passes chunks through transparently.
Can I chain LiteLLM + another proxy?
Yes. Common pattern: LiteLLM in front of a mix of Anthropic, OpenAI, and a local Claude proxy. The Claude proxy handles Claude Max; LiteLLM handles everything else + fallback.
Sources
- LiteLLM documentation — authoritative reference.
- r/LocalLLaMA — community deployment patterns and issues.
- Anthropic — Claude Code best practices — pairing reference.
- Open WebUI GitHub — typical downstream UI for LiteLLM.
Related guides
- Self-hosted Claude proxy
- Open WebUI — self-hosted ChatGPT for your local models
- AI coding assistants ranked
- Best GPU for an AI rig
— SpecPicks Editorial · Last verified 2026-04-21
