Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Name: Self-hosting an OpenAI-compatible LLM gateway with LiteLLM
Item: New brothread 63 Colors Compatible with Brother Polyester Embroidery Machine Thread Kit 500M (550Y) Each Spool Compatible with Brother Babylock Janome Singer Pfaff Bernina Embroidery Sewing Machine
Author: SpecPicks Editorial

By SpecPicks Editorial · Published 2026-04-21 · Last verified 2026-07-06 · 2 min read

One API shape in your code. model="gpt-4o" vs model="claude-opus-4" vs model="ollama/qwen3:32b" — same client library, same request format. Rate

Every LLM app you write can pretend it's calling OpenAI. LiteLLM makes that true by routing OpenAI-shaped requests to whatever backend you actually want.

Why you want this

One API shape in your code. model="gpt-4o" vs model="claude-opus-4" vs model="ollama/qwen3:32b" — same client library, same request format.
Rate limiting + spend tracking in one place.
Fallback routing: primary fails → secondary picks up.
Cost routing: cheap questions hit Ollama, expensive questions hit Claude.

Install

bash

pip install "litellm[proxy]"

Create config.yaml:

yaml

model_list:
 - model_name: "gpt-4o"
 litellm_params:
 model: "openai/gpt-4o"
 api_key: "sk-..."
 - model_name: "claude-opus-4"
 litellm_params:
 model: "anthropic/claude-opus-4-20250514"
 api_key: "sk-ant-..."
 - model_name: "local-qwen"
 litellm_params:
 model: "ollama/qwen3:32b"
 api_base: "http://localhost:11434"

router_settings:
 fallbacks:
 - gpt-4o: [claude-opus-4, local-qwen]

general_settings:
 master_key: "sk-proxy-your-secret-here"

bash

litellm --config config.yaml --port 4000

Now any OpenAI client pointed at http://localhost:4000 with your master key works — regardless of which backend answers.

What this unlocks

Dev/prod parity: write code against gpt-4o, test against local-qwen for free while developing. Flip one config line for production.

Spend alerts: LiteLLM's built-in Prometheus exporter lets you alert on "$50 of Claude Opus in the last hour."

Multi-tenant: give different teams different master keys, meter each one independently.

Production config — multi-provider with fallbacks and spend caps

A hardened config.yaml that our SpecPicks pipeline runs in production (minus secrets):

yaml

model_list:
 - model_name: "gpt-4o"
 litellm_params:
 model: "openai/gpt-4o"
 api_key: os.environ/OPENAI_API_KEY
 rpm: 500 # per-key rate limit
 tpm: 300000
 - model_name: "claude-opus"
 litellm_params:
 model: "anthropic/claude-opus-4-20250514"
 api_key: os.environ/ANTHROPIC_API_KEY
 - model_name: "claude-sonnet"
 litellm_params:
 model: "anthropic/claude-sonnet-4-20250514"
 api_key: os.environ/ANTHROPIC_API_KEY
 - model_name: "local-qwen-32b"
 litellm_params:
 model: "ollama/qwen3:32b"
 api_base: "http://ollama:11434"

router_settings:
 fallbacks:
 - gpt-4o: [claude-opus, claude-sonnet, local-qwen-32b]
 - claude-opus: [claude-sonnet, gpt-4o, local-qwen-32b]
 retry_policy:
 num_retries: 2
 retry_after: 5

general_settings:
 master_key: os.environ/LITELLM_MASTER_KEY
 database_url: "postgresql://litellm:secret@postgres/litellm"
 alerting: ["slack"]
 alerting_threshold: 300
 budget_duration: "30d"
 max_budget: 500 # USD per month, per virtual key

litellm_settings:
 cache: true
 cache_params:
 type: "redis"
 host: "redis"
 port: 6379
 ttl: 300
 drop_params: true # ignore unsupported params silently
 set_verbose: false

Key decisions:

Postgres for spend persistence. Without it, restarts wipe your spend history.
Redis for caching. 300-second TTL matches Anthropic's ephemeral cache window.
Fallback ladder goes cloud → cloud → local. Primary-model Slack alert at $300 spend over 30 days.
Virtual keys per team. Issue via /key/generate endpoint; track spend per key.

How public benchmarks show and compared

The SpecPicks pipeline runs exactly this config (minus the Slack alerting URL) against ~50k generations/month. Patterns are cross-referenced with the LiteLLM documentation and community discussions. For the self-hosted-Claude-proxy complement, see our self-hosted Claude proxy guide.

Virtual-key workflow (issuing per-team access)

bash

# Create a key with a $50/month cap for "team-a"
curl -X POST http://localhost:4000/key/generate \
 -H "Authorization: Bearer ${MASTER_KEY}" \
 -H "Content-Type: application/json" \
 -d '{
 "metadata": {"team": "team-a"},
 "max_budget": 50,
 "budget_duration": "30d",
 "models": ["gpt-4o", "claude-sonnet"]
 }'
# → returns sk-proxy-xxx that team-a uses as their "OpenAI API key"

Now team-a's developers set OPENAI_API_KEY=sk-proxy-xxx and OPENAI_BASE_URL=http://litellm-internal:4000. They see no change in their code; LiteLLM handles everything.

Caching patterns that actually work

Long system prompt, short user turns. The biggest wins. Stable system prompt becomes a cache hit on every turn. 40-60% cost reduction common.
Document Q&A with stable corpus. If the retrieved chunks are stable across requests, cache hits are high. Implement retrieval deterministically to maximise this.
Agent tool definitions. Lots of agents have 500-2000 tokens of tool schema. Perfect cache-hit candidate.

Cache misses are common in:

Agent loops with timestamps or random IDs embedded in prompts.
Multi-user workflows where user ID is interpolated early in the system prompt.
Streaming use cases that hash differently per chunk — unintuitive but observed.

Three failure modes

1. LiteLLM shows zero cache hits. Redis is unreachable or misconfigured. Check docker logs litellm | grep -i cache; expected: Cache: initialized Redis connection.

2. Spend tracking shows 0.0 USD forever. database_url isn't set. LiteLLM silently runs in-memory without it.

3. Stream responses hang. Some backends don't support streaming with all params. Set drop_params: true in litellm_settings — lets unsupported params (e.g. OpenAI-specific seed on Anthropic) be silently dropped instead of erroring.

Alternatives

OpenRouter — hosted multi-provider proxy, pay-per-token markup. Zero ops, higher $/token.
Portkey — hosted + open-source gateway, observability-focused.
Hand-rolled FastAPI proxy — see our self-hosted Claude proxy guide for skeleton.
Anthropic / OpenAI SDK direct — zero proxy overhead, no caching / fallback / spend tracking.

Frequently asked questions

Is LiteLLM production-grade?

Yes in 2026. It powers inference routing at several well-known LLM-app companies. Watch for memory-leak behavior in versions with Redis caching enabled pre-v1.50 — pin a recent version.

What's the latency overhead?

In same-host deployment: <20 ms median, <200 ms p99. Cross-region can add 50-200 ms.

Can I use LiteLLM with Ollama?

Yes — model: "ollama/<model>" + api_base: "http://host:11434". LiteLLM treats Ollama as just another OpenAI-compatible backend.

Does LiteLLM handle streaming?

Yes, for every supported backend. Set stream: true in the request; LiteLLM passes chunks through transparently.

Can I chain LiteLLM + another proxy?

Yes. Common pattern: LiteLLM in front of a mix of Anthropic, OpenAI, and a local Claude proxy. The Claude proxy handles Claude Max; LiteLLM handles everything else + fallback.

Sources

LiteLLM documentation — authoritative reference.
r/LocalLLaMA — community deployment patterns and issues.
Anthropic — Claude Code best practices — pairing reference.
Open WebUI GitHub — typical downstream UI for LiteLLM.

Related guides

— SpecPicks Editorial · Last verified 2026-04-21

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

One CPU To Rule Them All - Ryzen 7 9800X3D Review — Linus Tech Tips on YouTube

Frequently asked questions

What are the main benefits of using LiteLLM as a self-hosted proxy?

LiteLLM offers a unified API shape for multiple LLM backends, enabling developers to use the same client library regardless of the model. It includes features like rate limiting, spend tracking, fallback routing, and cost-based routing. These capabilities make it easier to manage multiple models and optimize costs while maintaining consistent application behavior.

How does LiteLLM handle fallback routing between models?

LiteLLM allows users to configure fallback routes in its `config.yaml`. If the primary model fails, requests are automatically routed to secondary models in the specified order. This ensures high availability and reliability, even if one backend becomes temporarily unavailable.

What caching strategies work best with LiteLLM?

Effective caching strategies include using long, stable system prompts and deterministic retrieval for document Q&A tasks. These patterns maximize cache hits, reducing costs by up to 40-60%. However, cache misses are more common in scenarios like agent loops with dynamic prompts or streaming use cases with variable chunk hashes.

What are the common failure modes when using LiteLLM?

Three common issues include zero cache hits due to Redis misconfiguration, spend tracking showing 0.0 USD if `database_url` is unset, and stream responses hanging when unsupported parameters are passed to certain backends. These can be resolved by checking logs, setting a persistent database, and enabling the `drop_params` option.

Can LiteLLM be used in multi-tenant environments?

Yes, LiteLLM supports multi-tenancy by allowing the creation of virtual keys for different teams. Each key can have its own budget, duration, and model access restrictions. This enables independent spend tracking and access control for multiple teams sharing the same LiteLLM instance.

Sources

— SpecPicks Editorial · Last verified 2026-07-06

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Why you want this

Install

What this unlocks

Related

Production config — multi-provider with fallbacks and spend caps

How public benchmarks show and compared

Virtual-key workflow (issuing per-team access)

Caching patterns that actually work

Three failure modes

Alternatives

Frequently asked questions

Is LiteLLM production-grade?

What's the latency overhead?

Can I use LiteLLM with Ollama?

Does LiteLLM handle streaming?

Can I chain LiteLLM + another proxy?

Sources

Related guides

Products mentioned in this article

New brothread 63 Colors Compatible with Brother Polyester Embroidery Machine…

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

Intel Core i5-9600K Desktop Processor 6 Cores up to 4.6 GHz Turbo unlocked…

Intel Core i5-9600K Desktop Processor 6 Cores up to 4.6 GHz Turbo unlocked…

Intel Core i9-9900K Desktop Processor 8 Cores up to 5.0 GHz Turbo Unlocked…

Watch a review

Frequently asked questions

Sources

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Self-hosting an OpenAI-compatible LLM gateway with LiteLLM

Why you want this

Install

What this unlocks

Related

Production config — multi-provider with fallbacks and spend caps

How public benchmarks show and compared

Virtual-key workflow (issuing per-team access)

Caching patterns that actually work

Three failure modes

Alternatives

Frequently asked questions

Is LiteLLM production-grade?

What's the latency overhead?

Can I use LiteLLM with Ollama?

Does LiteLLM handle streaming?

Can I chain LiteLLM + another proxy?

Sources

Related guides

New brothread 63 Colors Compatible with Brother Polyester Embroidery Machine…

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

AMD RYZEN 7 9800X3D 8-Core, 16-Thread Desktop Processor

Intel Core i5-9600K Desktop Processor 6 Cores up to 4.6 GHz Turbo unlocked…

Intel Core i5-9600K Desktop Processor 6 Cores up to 4.6 GHz Turbo unlocked…

Intel Core i9-9900K Desktop Processor 8 Cores up to 5.0 GHz Turbo Unlocked…

📹 Watch a review

Frequently asked questions

Sources

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review