Self-hosting a Claude proxy — cache, rate-limit, and audit every request

Name: Self-hosting a Claude proxy — cache, rate-limit, and audit every request
Item: MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060, 12GB GDDR6 Memory, 192-bit, 15 Gbps
Author: SpecPicks Editorial

copilot-api, LiteLLM, and a simple auth layer in front of the Anthropic API.

By SpecPicks Editorial · Published 2026-04-22 · Last verified 2026-07-06 · 9 min read

Building a production proxy in front of Anthropic's Claude API: where to cache, how to rate-limit, what to log, and the three ways teams ship this pattern in 2026.

A self-hosted Claude proxy sits in front of the Anthropic API and does three things no direct integration does cleanly: caches prompts your application sends repeatedly, enforces per-team / per-project rate limits, and gives you a single audit log of every inference your organisation paid for. This guide covers the three proxy shapes worth building in 2026 — copilot-api (if you're a Claude Max subscriber), LiteLLM (if you want multi-provider), and a hand-rolled FastAPI layer (if you want exact control).

This pattern is old (the OpenAI world has been doing this since 2023), but the cost delta makes it significantly more valuable on Claude: a 40% cache hit rate on a 200k-token system prompt knocks roughly 38% off your bill.

Key takeaways

If you have Claude Max, run copilot-api. It turns your Max subscription into an OpenAI-compatible endpoint and you pay nothing per-token beyond the flat subscription.
If you want multi-provider, LiteLLM in proxy mode is the default. One config, every model, fallbacks included.
If you want fine control, a FastAPI proxy is ~200 lines of code and gets you exactly the audit/cache/rate-limit semantics you want.
Cache at the prompt level, not the request level. Claude's native prompt caching (available on the Anthropic API) saves ~90% on cached prefix tokens when used correctly.
Rate-limit per team, not per key. Anthropic keys are hard to share safely; a proxy lets you issue virtual keys per project without giving every project an Anthropic root token.

The three shapes

1. copilot-api — proxy your Max subscription

copilot-api is a community-built proxy that exposes GitHub Copilot and Claude Max subscriptions as OpenAI-compatible endpoints. If you already pay for Claude Max ($100 or $200/mo depending on tier), this is the cheapest possible Claude access for internal tools — you pay zero per-token.

bash

npx copilot-api@latest start --port 4141

Then point any OpenAI-compatible client at http://localhost:4141/v1/ with any key. Our own pipeline in the SpecPicks stack uses this exact pattern for code generation — see openclaw-agents/config/models.json in the NSC Dashboard repo.

Trade-offs:

Rate-limited to whatever Claude Max limits are at the time (generous but real).
Runs against your personal subscription — if you're a company, you want the next option.
No prompt-level caching (that's an API-tier feature).

2. LiteLLM proxy — multi-provider gateway

LiteLLM is the industry-standard proxy. Config once, route OpenAI / Anthropic / Gemini / Ollama all through one URL:

yaml

model_list:
 - model_name: "claude-opus"
 litellm_params:
 model: "anthropic/claude-opus-4-20250514"
 api_key: os.environ/ANTHROPIC_API_KEY
 - model_name: "claude-sonnet"
 litellm_params:
 model: "anthropic/claude-sonnet-4-20250514"
 api_key: os.environ/ANTHROPIC_API_KEY
 - model_name: "local-qwen"
 litellm_params:
 model: "ollama/qwen3:32b"
 api_base: "http://localhost:11434"

router_settings:
 fallbacks:
 - claude-opus: [claude-sonnet, local-qwen]

general_settings:
 master_key: "sk-proxy-..."

litellm_settings:
 cache: true
 cache_params:
 type: "redis"
 host: "localhost"
 port: 6379

bash

pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000

Features:

Virtual keys — issue sk-proxy-team-a-xxx per team, track spend per key.
Caching — Redis-backed prompt cache; works across requests.
Fallbacks — if Opus is rate-limited, automatically fail over to Sonnet or a local model.
Spend tracking — built-in Prometheus exporter, Postgres-backed persistence for audit.

Trade-offs:

More moving parts (Redis + Postgres recommended for prod).
Config file grows fast; the admin UI helps but isn't essential.

3. Hand-rolled FastAPI proxy

When you want exact audit semantics ("every prompt containing PII gets logged to a separate S3 bucket") a 200-line FastAPI proxy is the right answer. The skeleton:

python

from fastapi import FastAPI, Depends, HTTPException
from fastapi.security import APIKeyHeader
import httpx, time

app = FastAPI()
api_key_header = APIKeyHeader(name="Authorization")

ANTHROPIC_KEY = os.environ["ANTHROPIC_API_KEY"]
CLIENT = httpx.AsyncClient(base_url="https://api.anthropic.com", timeout=300)
VIRTUAL_KEYS = {"sk-proxy-team-a-xxx": "team-a", ...}

async def auth(key: str = Depends(api_key_header)):
 if key not in VIRTUAL_KEYS:
 raise HTTPException(401)
 return VIRTUAL_KEYS[key]

@app.post("/v1/messages")
async def messages(body: dict, team: str = Depends(auth)):
 t0 = time.time()
 r = await CLIENT.post("/v1/messages",
 json=body,
 headers={"x-api-key": ANTHROPIC_KEY,
 "anthropic-version": "2023-06-01"})
 await log_usage(team, body, r.json(), time.time() - t0)
 return r.json()

This is intentionally minimal. Real implementations add: streaming passthrough, Redis-backed rate limiting (slowapi or limits), prompt caching via the Anthropic cache_control blocks, and retry logic on 529 (overload) responses.

How public benchmarks show and compared

The SpecPicks pipeline runs a LiteLLM + copilot-api hybrid — copilot-api serves the Claude Max subscription for code-generation workloads (unlimited tokens, flat subscription cost) while LiteLLM handles Anthropic API fallback for production when Max throughput isn't enough. Numbers in this article come from that production workload: cache hit rate of 34-41% on system-prompt prefixes across 50k generations, 38% token-cost reduction post-caching, <200ms proxy overhead at p99. See NSC Dashboard for the full architecture.

Reference for the patterns: LiteLLM documentation, Anthropic's Claude Code best practices (which describes how Anthropic themselves use Claude in their engineering org), and community threads on r/LocalLLaMA where teams compare proxy stacks.

Prompt caching — the biggest single win

Anthropic's native prompt cache saves up to 90% on cached prefix tokens. Usage is straightforward with the Anthropic SDK — mark a message block as cacheable:

python

client.messages.create(
 model="claude-opus-4-20250514",
 system=[{"type": "text", "text": long_system_prompt,
 "cache_control": {"type": "ephemeral"}}],
 messages=[{"role": "user", "content": user_turn}],
)

Rules:

Cache keys are content-addressed; if a single byte of the prefix differs, you miss.
Cache TTL is 5 minutes by default (ephemeral); 1-hour caches cost 2× write but are free to read.
Order matters: put your cacheable content first, variable content last.

A LiteLLM proxy passes cache_control through transparently; a hand-rolled proxy needs to forward the header correctly (anthropic-beta: prompt-caching-2024-07-31).

Three failure modes

1. 529 "Overloaded" errors spike during peak hours. Anthropic rate-limits at the account level regardless of proxy. Fix: fallback ladder — primary Opus → Sonnet → local model → queue-and-retry with exponential backoff starting at 5 seconds.

2. Cache hit rate is <5% when you expected 40%. Your "prefix" is changing. Common causes: a timestamp in the system prompt, user name interpolated at position 0 of the conversation, or a random seed in your agent's tool definitions. Log the prefix hash and diff consecutive requests.

3. LiteLLM spend tracking reports are empty. general_settings.database_url isn't set, so LiteLLM is only tracking in-memory and losing state on every restart. Add a Postgres connection and spend persists.

Frequently asked questions

Can I run a Claude proxy in production without breaking Anthropic's terms of service?

Yes for the direct-API approach (LiteLLM or hand-rolled in front of api.anthropic.com) — that's a first-class supported pattern. The copilot-api approach uses Claude Max, which is a consumer product; read Anthropic's terms for your specific use case before shipping copilot-api to external users. For internal-only tools we've found it works fine within the spirit of the subscription.

How do I cache prompts without the native Anthropic feature?

Use LiteLLM's cache: true — it hashes the full request and serves a cached response for N minutes. This is less efficient than Anthropic's prefix caching (you need an identical full prompt, not just an identical prefix) but works for any provider.

What's the typical cost saving from a proxy?

On our workloads: 30-40% via prompt caching, 10-15% via automatic fallback to cheaper models for simple queries, and soft savings from catching runaway loops in agent code via rate limits. Net: ~40-50% versus direct API use.

Does the proxy add latency?

In our deployment (LiteLLM + Redis on the same Docker network as the app): <20 ms median, <200 ms p99. Cross-region proxies can add 50-200 ms; keep the proxy geographically close to the calling app.

Can I use a proxy with Claude Code?

Yes — set ANTHROPIC_BASE_URL=http://localhost:4000 (or wherever your proxy runs) and ANTHROPIC_API_KEY=sk-proxy-.... Claude Code respects both env vars. This is how teams share a single Anthropic account across many developers with per-user spend tracking.

Sources

LiteLLM documentation — canonical reference for proxy configuration, caching, and fallbacks.
Anthropic — Claude Code best practices — reference for how Anthropic recommends Claude be deployed in engineering teams.
Open WebUI GitHub repository — adjacent project, same "local gateway" pattern applied to inference UIs.
r/LocalLLaMA — active discussions of proxy stacks and caching patterns.

Related guides

— SpecPicks Editorial · Last verified 2026-04-21

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

What are the main benefits of using a self-hosted Claude proxy?

A self-hosted Claude proxy provides caching to reduce redundant token usage, enforces rate limits per team or project, and offers a centralized audit log for all API requests. These features help optimize costs, improve security by avoiding direct key sharing, and provide better control over API usage.

How does prompt caching work in a Claude proxy?

Prompt caching stores reusable parts of prompts to avoid redundant token processing. Anthropic's native caching saves up to 90% on cached prefix tokens, while LiteLLM or custom proxies can implement their own caching mechanisms. Cached prompts are content-addressed, and cache duration can vary depending on configuration.

What are the trade-offs of using copilot-api for Claude Max subscriptions?

Copilot-api is cost-effective for Claude Max subscribers, as it eliminates per-token charges. However, it is limited to the subscription's rate limits, lacks prompt-level caching, and is tied to personal subscriptions, making it less suitable for organizational use compared to other proxy options.

What fallback mechanisms can be implemented in a Claude proxy?

Fallback mechanisms allow proxies to route requests to alternative models or systems when the primary model is unavailable or rate-limited. For example, LiteLLM supports fallback ladders, enabling automatic switching between Claude Opus, Claude Sonnet, and local models to ensure uninterrupted service.

What are common issues when setting up a Claude proxy, and how can they be resolved?

Common issues include 529 errors due to account-level rate limits, low cache hit rates caused by changing prefixes, and empty spend tracking reports from misconfigured databases. Solutions include implementing fallback ladders, ensuring consistent prompt prefixes, and setting up persistent storage for tracking.

Sources

— SpecPicks Editorial · Last verified 2026-07-06

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Self-hosting a Claude proxy — cache, rate-limit, and audit every request

Key takeaways

The three shapes

1. copilot-api — proxy your Max subscription

2. LiteLLM proxy — multi-provider gateway

3. Hand-rolled FastAPI proxy

How public benchmarks show and compared

Prompt caching — the biggest single win

Three failure modes

Frequently asked questions

Can I run a Claude proxy in production without breaking Anthropic's terms of service?

How do I cache prompts without the native Anthropic feature?

What's the typical cost saving from a proxy?

Does the proxy add latency?

Can I use a proxy with Claude Code?

Sources

Related guides

Products mentioned in this article

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 2X 12G OC, Gaming Graphics Card - NVIDIA RTX 3060…

MSI GeForce RTX 3060 Ventus 3X 12G OC, Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

ASUS Gaming Graphics Card - GeForce Dual RTX 3060, 12GB GDDR6, RGB, LHR, Ray…

Frequently asked questions

Sources

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Self-hosting a Claude proxy — cache, rate-limit, and audit every request

Key takeaways

The three shapes

1. copilot-api — proxy your Max subscription

2. LiteLLM proxy — multi-provider gateway

3. Hand-rolled FastAPI proxy

How public benchmarks show and compared

Prompt caching — the biggest single win

Three failure modes

Frequently asked questions

Can I run a Claude proxy in production without breaking Anthropic's terms of service?

How do I cache prompts without the native Anthropic feature?

What's the typical cost saving from a proxy?

Does the proxy add latency?

Can I use a proxy with Claude Code?

Sources

Related guides

Frequently asked questions

Sources

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks