Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Point your apps at a single endpoint; LiteLLM routes to Claude, OpenAI, Ollama, or vLLM based on model name.

Every LLM app you write can pretend it's calling OpenAI. LiteLLM makes that true by routing OpenAI-shaped requests to whatever backend you actually want.

Why you want this

  • One API shape in your code. model="gpt-4o" vs model="claude-opus-4" vs model="ollama/qwen3:32b" — same client library, same request format.
  • Rate limiting + spend tracking in one place.
  • Fallback routing: primary fails → secondary picks up.
  • Cost routing: cheap questions hit Ollama, expensive questions hit Claude.

Install

pip install "litellm[proxy]"

Create config.yaml:

model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "sk-..."
  - model_name: "claude-opus-4"
    litellm_params:
      model: "anthropic/claude-opus-4-20250514"
      api_key: "sk-ant-..."
  - model_name: "local-qwen"
    litellm_params:
      model: "ollama/qwen3:32b"
      api_base: "http://localhost:11434"

router_settings:
  fallbacks:
    - gpt-4o: [claude-opus-4, local-qwen]

general_settings:
  master_key: "sk-proxy-your-secret-here"
litellm --config config.yaml --port 4000

Now any OpenAI client pointed at http://localhost:4000 with your master key works — regardless of which backend answers.

What this unlocks

Dev/prod parity: write code against gpt-4o, test against local-qwen for free while developing. Flip one config line for production.

Spend alerts: LiteLLM's built-in Prometheus exporter lets you alert on "$50 of Claude Opus in the last hour."

Multi-tenant: give different teams different master keys, meter each one independently.

Related

Production config — multi-provider with fallbacks and spend caps

A hardened config.yaml that our SpecPicks pipeline runs in production (minus secrets):

model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500    # per-key rate limit
      tpm: 300000
  - model_name: "claude-opus"
    litellm_params:
      model: "anthropic/claude-opus-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "claude-sonnet"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "local-qwen-32b"
    litellm_params:
      model: "ollama/qwen3:32b"
      api_base: "http://ollama:11434"

router_settings:
  fallbacks:
    - gpt-4o: [claude-opus, claude-sonnet, local-qwen-32b]
    - claude-opus: [claude-sonnet, gpt-4o, local-qwen-32b]
  retry_policy:
    num_retries: 2
    retry_after: 5

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: "postgresql://litellm:secret@postgres/litellm"
  alerting: ["slack"]
  alerting_threshold: 300
  budget_duration: "30d"
  max_budget: 500   # USD per month, per virtual key

litellm_settings:
  cache: true
  cache_params:
    type: "redis"
    host: "redis"
    port: 6379
    ttl: 300
  drop_params: true   # ignore unsupported params silently
  set_verbose: false

Key decisions:

  • Postgres for spend persistence. Without it, restarts wipe your spend history.
  • Redis for caching. 300-second TTL matches Anthropic's ephemeral cache window.
  • Fallback ladder goes cloud → cloud → local. Primary-model Slack alert at $300 spend over 30 days.
  • Virtual keys per team. Issue via /key/generate endpoint; track spend per key.

How we tested and compared

The SpecPicks pipeline runs exactly this config (minus the Slack alerting URL) against ~50k generations/month. Patterns are cross-referenced with the LiteLLM documentation and community discussions. For the self-hosted-Claude-proxy complement, see our self-hosted Claude proxy guide.

Virtual-key workflow (issuing per-team access)

# Create a key with a $50/month cap for "team-a"
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer ${MASTER_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "metadata": {"team": "team-a"},
    "max_budget": 50,
    "budget_duration": "30d",
    "models": ["gpt-4o", "claude-sonnet"]
  }'
# → returns sk-proxy-xxx that team-a uses as their "OpenAI API key"

Now team-a's developers set OPENAI_API_KEY=sk-proxy-xxx and OPENAI_BASE_URL=http://litellm-internal:4000. They see no change in their code; LiteLLM handles everything.

Caching patterns that actually work

  • Long system prompt, short user turns. The biggest wins. Stable system prompt becomes a cache hit on every turn. 40-60% cost reduction common.
  • Document Q&A with stable corpus. If the retrieved chunks are stable across requests, cache hits are high. Implement retrieval deterministically to maximise this.
  • Agent tool definitions. Lots of agents have 500-2000 tokens of tool schema. Perfect cache-hit candidate.

Cache misses are common in:

  • Agent loops with timestamps or random IDs embedded in prompts.
  • Multi-user workflows where user ID is interpolated early in the system prompt.
  • Streaming use cases that hash differently per chunk — unintuitive but observed.

Three failure modes

1. LiteLLM shows zero cache hits. Redis is unreachable or misconfigured. Check docker logs litellm | grep -i cache; expected: Cache: initialized Redis connection.

2. Spend tracking shows 0.0 USD forever. database_url isn't set. LiteLLM silently runs in-memory without it.

3. Stream responses hang. Some backends don't support streaming with all params. Set drop_params: true in litellm_settings — lets unsupported params (e.g. OpenAI-specific seed on Anthropic) be silently dropped instead of erroring.

Alternatives

  • OpenRouter — hosted multi-provider proxy, pay-per-token markup. Zero ops, higher $/token.
  • Portkey — hosted + open-source gateway, observability-focused.
  • Hand-rolled FastAPI proxy — see our self-hosted Claude proxy guide for skeleton.
  • Anthropic / OpenAI SDK direct — zero proxy overhead, no caching / fallback / spend tracking.

Frequently asked questions

Is LiteLLM production-grade?

Yes in 2026. It powers inference routing at several well-known LLM-app companies. Watch for memory-leak behavior in versions with Redis caching enabled pre-v1.50 — pin a recent version.

What's the latency overhead?

In same-host deployment: <20 ms median, <200 ms p99. Cross-region can add 50-200 ms.

Can I use LiteLLM with Ollama?

Yes — model: "ollama/<model>" + api_base: "http://host:11434". LiteLLM treats Ollama as just another OpenAI-compatible backend.

Does LiteLLM handle streaming?

Yes, for every supported backend. Set stream: true in the request; LiteLLM passes chunks through transparently.

Can I chain LiteLLM + another proxy?

Yes. Common pattern: LiteLLM in front of a mix of Anthropic, OpenAI, and a local Claude proxy. The Claude proxy handles Claude Max; LiteLLM handles everything else + fallback.

Sources

  1. LiteLLM documentation — authoritative reference.
  2. r/LocalLLaMA — community deployment patterns and issues.
  3. Anthropic — Claude Code best practices — pairing reference.
  4. Open WebUI GitHub — typical downstream UI for LiteLLM.

Related guides


— SpecPicks Editorial · Last verified 2026-04-21

— SpecPicks Editorial · Last verified 2026-04-22