Skip to main content
Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

One API shape in your code. model="gpt-4o" vs model="claude-opus-4" vs model="ollama/qwen3:32b" — same client library, same request format. Rate

Every LLM app you write can pretend it's calling OpenAI. LiteLLM makes that true by routing OpenAI-shaped requests to whatever backend you actually want.

Why you want this

  • One API shape in your code. model="gpt-4o" vs model="claude-opus-4" vs model="ollama/qwen3:32b" — same client library, same request format.
  • Rate limiting + spend tracking in one place.
  • Fallback routing: primary fails → secondary picks up.
  • Cost routing: cheap questions hit Ollama, expensive questions hit Claude.

Install

bash
pip install "litellm[proxy]"

Create config.yaml:

yaml
model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "sk-..."
  - model_name: "claude-opus-4"
    litellm_params:
      model: "anthropic/claude-opus-4-20250514"
      api_key: "sk-ant-..."
  - model_name: "local-qwen"
    litellm_params:
      model: "ollama/qwen3:32b"
      api_base: "http://localhost:11434"

router_settings:
  fallbacks:
    - gpt-4o: [claude-opus-4, local-qwen]

general_settings:
  master_key: "sk-proxy-your-secret-here"
bash
litellm --config config.yaml --port 4000

Now any OpenAI client pointed at http://localhost:4000 with your master key works — regardless of which backend answers.

What this unlocks

Dev/prod parity: write code against gpt-4o, test against local-qwen for free while developing. Flip one config line for production.

Spend alerts: LiteLLM's built-in Prometheus exporter lets you alert on "$50 of Claude Opus in the last hour."

Multi-tenant: give different teams different master keys, meter each one independently.

Related

Production config — multi-provider with fallbacks and spend caps

A hardened config.yaml that our SpecPicks pipeline runs in production (minus secrets):

yaml
model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500    # per-key rate limit
      tpm: 300000
  - model_name: "claude-opus"
    litellm_params:
      model: "anthropic/claude-opus-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "claude-sonnet"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "local-qwen-32b"
    litellm_params:
      model: "ollama/qwen3:32b"
      api_base: "http://ollama:11434"

router_settings:
  fallbacks:
    - gpt-4o: [claude-opus, claude-sonnet, local-qwen-32b]
    - claude-opus: [claude-sonnet, gpt-4o, local-qwen-32b]
  retry_policy:
    num_retries: 2
    retry_after: 5

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: "postgresql://litellm:secret@postgres/litellm"
  alerting: ["slack"]
  alerting_threshold: 300
  budget_duration: "30d"
  max_budget: 500   # USD per month, per virtual key

litellm_settings:
  cache: true
  cache_params:
    type: "redis"
    host: "redis"
    port: 6379
    ttl: 300
  drop_params: true   # ignore unsupported params silently
  set_verbose: false

Key decisions:

  • Postgres for spend persistence. Without it, restarts wipe your spend history.
  • Redis for caching. 300-second TTL matches Anthropic's ephemeral cache window.
  • Fallback ladder goes cloud → cloud → local. Primary-model Slack alert at $300 spend over 30 days.
  • Virtual keys per team. Issue via /key/generate endpoint; track spend per key.

How public benchmarks show and compared

The SpecPicks pipeline runs exactly this config (minus the Slack alerting URL) against ~50k generations/month. Patterns are cross-referenced with the LiteLLM documentation and community discussions. For the self-hosted-Claude-proxy complement, see our self-hosted Claude proxy guide.

Virtual-key workflow (issuing per-team access)

bash
# Create a key with a $50/month cap for "team-a"
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer ${MASTER_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {"team": "team-a"},
    "max_budget": 50,
    "budget_duration": "30d",
    "models": ["gpt-4o", "claude-sonnet"]
  }'
# → returns sk-proxy-xxx that team-a uses as their "OpenAI API key"

Now team-a's developers set OPENAI_API_KEY=sk-proxy-xxx and OPENAI_BASE_URL=http://litellm-internal:4000. They see no change in their code; LiteLLM handles everything.

Caching patterns that actually work

  • Long system prompt, short user turns. The biggest wins. Stable system prompt becomes a cache hit on every turn. 40-60% cost reduction common.
  • Document Q&A with stable corpus. If the retrieved chunks are stable across requests, cache hits are high. Implement retrieval deterministically to maximise this.
  • Agent tool definitions. Lots of agents have 500-2000 tokens of tool schema. Perfect cache-hit candidate.

Cache misses are common in:

  • Agent loops with timestamps or random IDs embedded in prompts.
  • Multi-user workflows where user ID is interpolated early in the system prompt.
  • Streaming use cases that hash differently per chunk — unintuitive but observed.

Three failure modes

1. LiteLLM shows zero cache hits. Redis is unreachable or misconfigured. Check docker logs litellm | grep -i cache; expected: Cache: initialized Redis connection.

2. Spend tracking shows 0.0 USD forever. database_url isn't set. LiteLLM silently runs in-memory without it.

3. Stream responses hang. Some backends don't support streaming with all params. Set drop_params: true in litellm_settings — lets unsupported params (e.g. OpenAI-specific seed on Anthropic) be silently dropped instead of erroring.

Alternatives

  • OpenRouter — hosted multi-provider proxy, pay-per-token markup. Zero ops, higher $/token.
  • Portkey — hosted + open-source gateway, observability-focused.
  • Hand-rolled FastAPI proxy — see our self-hosted Claude proxy guide for skeleton.
  • Anthropic / OpenAI SDK direct — zero proxy overhead, no caching / fallback / spend tracking.

Frequently asked questions

Is LiteLLM production-grade?

Yes in 2026. It powers inference routing at several well-known LLM-app companies. Watch for memory-leak behavior in versions with Redis caching enabled pre-v1.50 — pin a recent version.

What's the latency overhead?

In same-host deployment: <20 ms median, <200 ms p99. Cross-region can add 50-200 ms.

Can I use LiteLLM with Ollama?

Yes — model: "ollama/<model>" + api_base: "http://host:11434". LiteLLM treats Ollama as just another OpenAI-compatible backend.

Does LiteLLM handle streaming?

Yes, for every supported backend. Set stream: true in the request; LiteLLM passes chunks through transparently.

Can I chain LiteLLM + another proxy?

Yes. Common pattern: LiteLLM in front of a mix of Anthropic, OpenAI, and a local Claude proxy. The Claude proxy handles Claude Max; LiteLLM handles everything else + fallback.

Sources

  1. LiteLLM documentation — authoritative reference.
  2. r/LocalLLaMA — community deployment patterns and issues.
  3. Anthropic — Claude Code best practices — pairing reference.
  4. Open WebUI GitHub — typical downstream UI for LiteLLM.

Related guides


— SpecPicks Editorial · Last verified 2026-04-21

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What are the main benefits of using LiteLLM as a self-hosted proxy?
LiteLLM offers a unified API shape for multiple LLM backends, enabling developers to use the same client library regardless of the model. It includes features like rate limiting, spend tracking, fallback routing, and cost-based routing. These capabilities make it easier to manage multiple models and optimize costs while maintaining consistent application behavior.
How does LiteLLM handle fallback routing between models?
LiteLLM allows users to configure fallback routes in its `config.yaml`. If the primary model fails, requests are automatically routed to secondary models in the specified order. This ensures high availability and reliability, even if one backend becomes temporarily unavailable.
What caching strategies work best with LiteLLM?
Effective caching strategies include using long, stable system prompts and deterministic retrieval for document Q&A tasks. These patterns maximize cache hits, reducing costs by up to 40-60%. However, cache misses are more common in scenarios like agent loops with dynamic prompts or streaming use cases with variable chunk hashes.
What are the common failure modes when using LiteLLM?
Three common issues include zero cache hits due to Redis misconfiguration, spend tracking showing 0.0 USD if `database_url` is unset, and stream responses hanging when unsupported parameters are passed to certain backends. These can be resolved by checking logs, setting a persistent database, and enabling the `drop_params` option.
Can LiteLLM be used in multi-tenant environments?
Yes, LiteLLM supports multi-tenancy by allowing the creation of virtual keys for different teams. Each key can have its own budget, duration, and model access restrictions. This enables independent spend tracking and access control for multiple teams sharing the same LiteLLM instance.

Sources

— SpecPicks Editorial · Last verified 2026-05-20

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →