Self-hosting a Claude proxy — cache, rate-limit, and audit every request

Self-hosting a Claude proxy — cache, rate-limit, and audit every request

copilot-api, LiteLLM, and a simple auth layer in front of the Anthropic API.

Building a production proxy in front of Anthropic's Claude API: where to cache, how to rate-limit, what to log, and the three ways teams ship this pattern in 2026.

A self-hosted Claude proxy sits in front of the Anthropic API and does three things no direct integration does cleanly: caches prompts your application sends repeatedly, enforces per-team / per-project rate limits, and gives you a single audit log of every inference your organisation paid for. This guide covers the three proxy shapes worth building in 2026 — copilot-api (if you're a Claude Max subscriber), LiteLLM (if you want multi-provider), and a hand-rolled FastAPI layer (if you want exact control).

This pattern is old (the OpenAI world has been doing this since 2023), but the cost delta makes it significantly more valuable on Claude: a 40% cache hit rate on a 200k-token system prompt knocks roughly 38% off your bill.

Key takeaways

  • If you have Claude Max, run copilot-api. It turns your Max subscription into an OpenAI-compatible endpoint and you pay nothing per-token beyond the flat subscription.
  • If you want multi-provider, LiteLLM in proxy mode is the default. One config, every model, fallbacks included.
  • If you want fine control, a FastAPI proxy is ~200 lines of code and gets you exactly the audit/cache/rate-limit semantics you want.
  • Cache at the prompt level, not the request level. Claude's native prompt caching (available on the Anthropic API) saves ~90% on cached prefix tokens when used correctly.
  • Rate-limit per team, not per key. Anthropic keys are hard to share safely; a proxy lets you issue virtual keys per project without giving every project an Anthropic root token.

The three shapes

1. copilot-api — proxy your Max subscription

copilot-api is a community-built proxy that exposes GitHub Copilot and Claude Max subscriptions as OpenAI-compatible endpoints. If you already pay for Claude Max ($100 or $200/mo depending on tier), this is the cheapest possible Claude access for internal tools — you pay zero per-token.

npx copilot-api@latest start --port 4141

Then point any OpenAI-compatible client at http://localhost:4141/v1/ with any key. Our own pipeline in the SpecPicks stack uses this exact pattern for code generation — see openclaw-agents/config/models.json in the NSC Dashboard repo.

Trade-offs:

  • Rate-limited to whatever Claude Max limits are at the time (generous but real).
  • Runs against your personal subscription — if you're a company, you want the next option.
  • No prompt-level caching (that's an API-tier feature).

2. LiteLLM proxy — multi-provider gateway

LiteLLM is the industry-standard proxy. Config once, route OpenAI / Anthropic / Gemini / Ollama all through one URL:

model_list:
  - model_name: "claude-opus"
    litellm_params:
      model: "anthropic/claude-opus-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "claude-sonnet"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: "local-qwen"
    litellm_params:
      model: "ollama/qwen3:32b"
      api_base: "http://localhost:11434"

router_settings:
  fallbacks:
    - claude-opus: [claude-sonnet, local-qwen]

general_settings:
  master_key: "sk-proxy-..."

litellm_settings:
  cache: true
  cache_params:
    type: "redis"
    host: "localhost"
    port: 6379
pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000

Features:

  • Virtual keys — issue sk-proxy-team-a-xxx per team, track spend per key.
  • Caching — Redis-backed prompt cache; works across requests.
  • Fallbacks — if Opus is rate-limited, automatically fail over to Sonnet or a local model.
  • Spend tracking — built-in Prometheus exporter, Postgres-backed persistence for audit.

Trade-offs:

  • More moving parts (Redis + Postgres recommended for prod).
  • Config file grows fast; the admin UI helps but isn't essential.

3. Hand-rolled FastAPI proxy

When you want exact audit semantics ("every prompt containing PII gets logged to a separate S3 bucket") a 200-line FastAPI proxy is the right answer. The skeleton:

from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import APIKeyHeader
import httpx, time

app = FastAPI()
api_key_header = APIKeyHeader(name="Authorization")

ANTHROPIC_KEY = os.environ["ANTHROPIC_API_KEY"]
CLIENT = httpx.AsyncClient(base_url="https://api.anthropic.com", timeout=300)
VIRTUAL_KEYS = {"sk-proxy-team-a-xxx": "team-a", ...}

async def auth(key: str = Depends(api_key_header)):
    if key not in VIRTUAL_KEYS:
        raise HTTPException(401)
    return VIRTUAL_KEYS[key]

@app.post("/v1/messages")
async def messages(body: dict, team: str = Depends(auth)):
    t0 = time.time()
    r = await CLIENT.post("/v1/messages",
        json=body,
        headers={"x-api-key": ANTHROPIC_KEY,
                 "anthropic-version": "2023-06-01"})
    await log_usage(team, body, r.json(), time.time() - t0)
    return r.json()

This is intentionally minimal. Real implementations add: streaming passthrough, Redis-backed rate limiting (slowapi or limits), prompt caching via the Anthropic cache_control blocks, and retry logic on 529 (overload) responses.

How we tested and compared

The SpecPicks pipeline runs a LiteLLM + copilot-api hybrid — copilot-api serves the Claude Max subscription for code-generation workloads (unlimited tokens, flat subscription cost) while LiteLLM handles Anthropic API fallback for production when Max throughput isn't enough. Numbers in this article come from that production workload: cache hit rate of 34-41% on system-prompt prefixes across 50k generations, 38% token-cost reduction post-caching, <200ms proxy overhead at p99. See NSC Dashboard for the full architecture.

Reference for the patterns: LiteLLM documentation, Anthropic's Claude Code best practices (which describes how Anthropic themselves use Claude in their engineering org), and community threads on r/LocalLLaMA where teams compare proxy stacks.

Prompt caching — the biggest single win

Anthropic's native prompt cache saves up to 90% on cached prefix tokens. Usage is straightforward with the Anthropic SDK — mark a message block as cacheable:

client.messages.create(
    model="claude-opus-4-20250514",
    system=[{"type": "text", "text": long_system_prompt,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": user_turn}],
)

Rules:

  • Cache keys are content-addressed; if a single byte of the prefix differs, you miss.
  • Cache TTL is 5 minutes by default (ephemeral); 1-hour caches cost 2× write but are free to read.
  • Order matters: put your cacheable content first, variable content last.

A LiteLLM proxy passes cache_control through transparently; a hand-rolled proxy needs to forward the header correctly (anthropic-beta: prompt-caching-2024-07-31).

Three failure modes

1. 529 "Overloaded" errors spike during peak hours. Anthropic rate-limits at the account level regardless of proxy. Fix: fallback ladder — primary Opus → Sonnet → local model → queue-and-retry with exponential backoff starting at 5 seconds.

2. Cache hit rate is <5% when you expected 40%. Your "prefix" is changing. Common causes: a timestamp in the system prompt, user name interpolated at position 0 of the conversation, or a random seed in your agent's tool definitions. Log the prefix hash and diff consecutive requests.

3. LiteLLM spend tracking reports are empty. general_settings.database_url isn't set, so LiteLLM is only tracking in-memory and losing state on every restart. Add a Postgres connection and spend persists.

Frequently asked questions

Can I run a Claude proxy in production without breaking Anthropic's terms of service?

Yes for the direct-API approach (LiteLLM or hand-rolled in front of api.anthropic.com) — that's a first-class supported pattern. The copilot-api approach uses Claude Max, which is a consumer product; read Anthropic's terms for your specific use case before shipping copilot-api to external users. For internal-only tools we've found it works fine within the spirit of the subscription.

How do I cache prompts without the native Anthropic feature?

Use LiteLLM's cache: true — it hashes the full request and serves a cached response for N minutes. This is less efficient than Anthropic's prefix caching (you need an identical full prompt, not just an identical prefix) but works for any provider.

What's the typical cost saving from a proxy?

On our workloads: 30-40% via prompt caching, 10-15% via automatic fallback to cheaper models for simple queries, and soft savings from catching runaway loops in agent code via rate limits. Net: ~40-50% versus direct API use.

Does the proxy add latency?

In our deployment (LiteLLM + Redis on the same Docker network as the app): <20 ms median, <200 ms p99. Cross-region proxies can add 50-200 ms; keep the proxy geographically close to the calling app.

Can I use a proxy with Claude Code?

Yes — set ANTHROPIC_BASE_URL=http://localhost:4000 (or wherever your proxy runs) and ANTHROPIC_API_KEY=sk-proxy-.... Claude Code respects both env vars. This is how teams share a single Anthropic account across many developers with per-user spend tracking.

Sources

  1. LiteLLM documentation — canonical reference for proxy configuration, caching, and fallbacks.
  2. Anthropic — Claude Code best practices — reference for how Anthropic recommends Claude be deployed in engineering teams.
  3. Open WebUI GitHub repository — adjacent project, same "local gateway" pattern applied to inference UIs.
  4. r/LocalLLaMA — active discussions of proxy stacks and caching patterns.

Related guides


— SpecPicks Editorial · Last verified 2026-04-21

— SpecPicks Editorial · Last verified 2026-04-22