LLM Proxy: One OpenAI-Compatible Endpoint, Keys, Fallbacks, and Cost Tracking

An LLM proxy sits between your application and provider APIs, exposing one OpenAI-compatible endpoint so any client built for OpenAI works unchanged. It centralizes virtual keys, rate limits, fallbacks, caching, cost tracking, and load balancing across providers. A proxy handles access and governance; a router decides which model to use. Morph exposes both at api.morphllm.com.

100+

LLM APIs callable through LiteLLM Proxy

1,600+

Models routed by Portkey

400+

Models on OpenRouter from 50+ providers

~25ms

OpenRouter edge overhead

What an LLM Proxy Does

An LLM proxy is one network hop that all of your model traffic passes through. Per LiteLLM's docs, it exposes a single OpenAI-compatible endpoint so any client built for OpenAI works without code changes. You point your SDK at the proxy's base URL instead of the provider's, and the proxy handles the rest.

The proxy translates each request to the target provider's native format and translates the response back into OpenAI Chat Completions format. This is why a codebase written against the OpenAI SDK can call Anthropic, Bedrock, VertexAI, Cohere, or a self-hosted vLLM endpoint without touching application code.

On top of translation, the proxy is where governance lives. It is the single place to issue per-project virtual keys, enforce rate limits and budgets, cache identical requests, log every call, and track spend. Without a proxy, each of these has to be configured separately in every client against every provider.

Unified endpoint

One OpenAI-compatible base URL fronts every provider. Clients written for OpenAI work unchanged. Swap providers by changing a model string, not the SDK.

Virtual keys and budgets

Issue a separate key per project or team. Attach rate limits and spend caps to each key. Revoke one key without rotating provider credentials.

Fallbacks and load balancing

Load balance across primary deployments, then fall back to a different provider when all primaries are unavailable. A 429 or 503 on one provider routes to the next.

Caching and cost tracking

Serve identical requests from cache instead of the provider. Track cost, latency, and token usage per request for chargeback and observability.

Proxy vs Router vs Gateway

These three terms get used interchangeably, which hides a real distinction. The proxy and the gateway describe the same layer (unified access plus governance). The router is a different layer: it decides which model a request should use.

Layer	Job	Decides the model?	Example
Proxy / Gateway	Unified OpenAI-compatible access plus keys, rate limits, fallbacks, caching, cost tracking	No	LiteLLM Proxy, Portkey, Cloudflare AI Gateway
Router	Picks which model to send each request to, by difficulty or policy	Yes	Morph Model Router
Provider	Serves the actual model inference behind the proxy	No	OpenAI, Anthropic, Bedrock, vLLM

A proxy can run without a router: you tell it the exact model, and it just handles access and governance. A router needs something to send its decision to, which is often a proxy. The two compose cleanly. The router answers "which model," the proxy answers "how do I reach it, with what key, and what happens if it fails."

Morph documents both layers separately

The LLM gateway page covers the unified access concept. The LLM router page covers the model-selection decision. This page is specifically about the proxy pattern: one endpoint, keys, fallbacks, caching, and cost tracking. Read all three to see how access, decision, and inference separate.

The OpenAI-Compatible Base-URL Swap

The defining feature of an LLM proxy is that adopting it changes one value: the base URL. Your existing OpenAI client code stays the same. You repoint it at the proxy, pass a proxy key, and the proxy fans out to whichever provider the model string names.

Point the OpenAI SDK at a proxy instead of OpenAI

import OpenAI from "openai"

// Before: talking directly to OpenAI
const direct = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
})

// After: same SDK, same calls, pointed at the proxy
const viaProxy = new OpenAI({
  baseURL: "https://your-proxy.example.com/v1",  // the LLM proxy
  apiKey: process.env.PROXY_VIRTUAL_KEY,         // a virtual key issued by the proxy
})

// The proxy routes by the model string. No SDK change.
const res = await viaProxy.chat.completions.create({
  model: "anthropic/claude-sonnet-4",  // proxy translates to Anthropic
  messages: [{ role: "user", content: "Summarize this PR" }],
})

// Switch providers by changing the model string only:
//   "gpt-5"                     -> OpenAI
//   "anthropic/claude-sonnet-4" -> Anthropic
//   "gemini/gemini-pro"         -> Google
//   "hosted-vllm/my-model"      -> self-hosted vLLM

Because the response always comes back in OpenAI Chat Completions format, downstream code that parses choices[0].message.content works no matter which provider served the request. LiteLLM exposes the same unified contract across /chat/completions, /responses, /embeddings, /images, /audio, /batches, /rerank, and /messages.

Proxy Options Compared

The main proxy options split along two axes: open-source versus hosted, and how many providers and models each reaches. The feature sets overlap on the basics (unified endpoint, keys, fallbacks) and differ on guardrails, observability depth, and where the proxy runs.

Tool	Open source?	Providers / models	Key features
LiteLLM Proxy	Yes (Python SDK + Proxy Server)	100+ LLM APIs	Cost tracking, guardrails, load balancing, logging, virtual keys, spend tracking, admin dashboard
Portkey	Yes (open-source gateway)	1,600+ models	50+ guardrails, simple + semantic caching, observability on 40+ metrics per request
Cloudflare AI Gateway	No (managed edge service)	Workers AI, Anthropic, Gemini, OpenAI, Replicate	One line of code, analytics, logging, caching, rate limiting, retries with fallbacks
OpenRouter	No (hosted API)	400+ models, 50+ providers	OpenAI-compatible schema, automatic fallbacks, :floor (cheapest) and :nitro (fastest) variants, ~25ms overhead

LiteLLM Proxy is the open-source default when you want to self-host the translation layer. Per its repo, its providers include Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, SageMaker, HuggingFace, vLLM, and NVIDIA NIM. Portkey runs 50+ guardrails (request and response filters, jailbreak detection, PII redaction, policy enforcement) on top of its open-source gateway. Cloudflare AI Gateway trades configurability for an edge deployment that starts with one line of code. OpenRouter is the hosted option that normalizes every model's schema to the OpenAI Chat API so you learn only one interface.

50+

Guardrails Portkey runs on its gateway

40+

Metrics Portkey tracks per request

1 line

To start Cloudflare AI Gateway

OpenRouter routing variants: :floor, :nitro

LiteLLM Proxy

LiteLLM is an open-source Python SDK and Proxy Server that calls 100+ LLM APIs in OpenAI (or native) format. The SDK is the in-process library; the Proxy Server is the standalone gateway you run as a service and point clients at. Both let you swap providers without rewriting code.

The Proxy Server adds the governance layer: cost tracking, guardrails, load balancing, logging, virtual keys, spend tracking, per-project customization (logging, guardrails, caching), and an admin dashboard. It supports simultaneous load balancing and fallbacks, so you load balance across primary deployments and then fall back to a different provider if all primaries are unavailable.

LiteLLM Proxy config with load balancing and fallbacks

# config.yaml for the LiteLLM Proxy Server
model_list:
  - model_name: chat            # the name clients request
    litellm_params:
      model: openai/gpt-5
      api_key: os.environ/OPENAI_API_KEY
  - model_name: chat            # second deployment, same client-facing name
    litellm_params:
      model: anthropic/claude-sonnet-4
      api_key: os.environ/ANTHROPIC_API_KEY

# Load balance across the two "chat" deployments,
# then fall back if all primaries are unavailable.
litellm_settings:
  fallbacks: [{ "chat": ["gemini/gemini-pro"] }]

# Clients call one endpoint and one model name ("chat").
# Every response comes back in OpenAI Chat Completions format.

Every response is returned in OpenAI Chat Completions format regardless of which provider served it. That uniformity is what makes the fallback transparent: when the proxy fails over from OpenAI to Anthropic to Gemini, the client never sees a schema change.

Claude Code Proxy

A common reason people search for an LLM proxy is to point Claude Code at a different model. Claude Code talks to the Anthropic /messages API, and it reads its endpoint from the ANTHROPIC_BASE_URL environment variable. Set that to a proxy and Claude Code sends its traffic there instead of to Anthropic directly.

Point Claude Code at a proxy via environment variables

# Point Claude Code at a proxy that speaks the Anthropic /messages format.
# The proxy (for example a LiteLLM Proxy Server) can then route to other
# providers or models while keeping the Anthropic request/response contract.

export ANTHROPIC_BASE_URL="https://your-proxy.example.com"
export ANTHROPIC_AUTH_TOKEN="your-proxy-virtual-key"

# Now run Claude Code as usual. Its requests hit the proxy,
# which handles keys, fallbacks, caching, and cost tracking.
claude

The proxy keeps the Anthropic-shaped request and response intact, so Claude Code does not know it is talking to a proxy. The proxy is free to add fallbacks (retry on a second provider when the first is rate-limited), caching, per-project keys, and spend tracking on top. This is the same governance the proxy gives any OpenAI client, applied to the Anthropic /messages contract.

The proxy is access, not model choice

Pointing Claude Code at a proxy changes where requests go and what governance wraps them. It does not by itself decide which model is best for a given turn. That is a routing decision. For automatic model selection by prompt difficulty, see the LLM router and using a different LLM in Claude Code.

What a Proxy Costs You

A proxy is not free. It adds a network hop, an operational dependency, and a place where a misconfiguration can break every model call at once. The tradeoffs are worth stating plainly before adopting one.

Added latency

Every request passes through the proxy. OpenRouter reports roughly 25ms of edge overhead. A self-hosted proxy adds the network hop to wherever you run it plus minimal processing. Caching can offset this on repeat requests.

Single point of failure

All model traffic funnels through one layer. If the proxy is down, every provider is unreachable, even ones that are healthy. Self-hosted proxies need their own redundancy and monitoring.

Feature lag and lowest-common-denominator

A unified OpenAI-compatible schema can lag behind a provider's newest native features. Provider-specific parameters sometimes need passthrough handling the proxy may not expose yet.

For multi-provider applications the tradeoffs usually favor the proxy: one integration, centralized keys and budgets, and automatic fallbacks beat hand-rolling provider clients and retry logic. For a single-provider app that never plans to add a second model, a proxy adds a hop and a dependency you do not need yet.

Where Morph Fits

Morph exposes one OpenAI-compatible endpoint at https://api.morphllm.com/v1. Any client built for OpenAI works by changing the base URL, the same contract as a general-purpose LLM proxy. Through that endpoint Morph serves its own fleet (glm51-754b, qwen35-397b at ~120 tok/s, qwen36-27b, minimax27-230b at ~140 tok/s, dsv4flash, warp-grep-v2.1) plus its specialized models.

Morph also runs the decision layer the proxy pattern leaves out. The Model Router classifies prompt difficulty in ~430ms into four tiers (easy, medium, hard, needs_info), in balanced and aggressive modes, at about $0.001 per classification, for 40-70% API cost savings. The proxy answers "reach the model with this key and fail over if it breaks." The router answers "which model should this turn use."

api.morphllm.com

One OpenAI-compatible endpoint

~430ms

Router difficulty classification

40-70%

Cost savings from routing

$0.001

Per router classification

On Morph's fleet, the specialized models that ride the same endpoint show where dedicated inference beats a generic proxy hop. morph-v3-fast applies code edits at ~10,500 tok/s with ngram speculative decoding (k=64), and morph-compactor compresses a payload that takes 1.5 minutes elsewhere down to ~2.5 seconds at ~33k tok/s. A proxy gives you access; a tuned model behind it gives you speed.

Frequently Asked Questions

What is an LLM proxy?

An LLM proxy (AI gateway) sits between your application and provider APIs, exposing a single OpenAI-compatible endpoint so any client built for OpenAI works without code changes. It adds key management, rate limiting, fallbacks, caching, cost tracking, and load balancing across providers. You change the base URL and keep your existing code.

What is the difference between an LLM proxy and an LLM router?

A proxy provides unified OpenAI-compatible access and governance (virtual keys, rate limits, fallbacks, caching, cost tracking) across many providers. A router decides which model to send a given request to. The proxy is the access layer; the router is the decision layer. Morph runs both: one OpenAI-compatible endpoint at api.morphllm.com plus a router that classifies prompt difficulty in ~430ms.

Is an LLM proxy the same as an AI gateway?

The terms are used interchangeably. LiteLLM, Portkey, and Cloudflare all call their proxy an AI gateway. Both describe the layer that exposes one endpoint and centralizes keys, rate limits, fallbacks, caching, and logging across providers. Some teams reserve gateway for the hosted product and proxy for the self-hosted process, but the function is identical.

How do I use LiteLLM proxy?

LiteLLM Proxy is an open-source server that calls 100+ LLM APIs in OpenAI format. You run it as a server, configure your provider keys, and point your client's base URL at the proxy. Every response comes back in OpenAI Chat Completions format regardless of provider. It exposes /chat/completions, /responses, /embeddings, /images, /audio, /batches, /rerank, and /messages, plus virtual keys, spend tracking, load balancing, and fallbacks.

How do I point Claude Code at a proxy?

Claude Code reads the ANTHROPIC_BASE_URL environment variable. Set it to your proxy's URL (for example a LiteLLM Proxy server exposing the Anthropic /messages format) and set ANTHROPIC_AUTH_TOKEN to your proxy key. Claude Code then sends its requests to the proxy, which can fan out to other providers or models while keeping the Anthropic-shaped request and response contract intact.

Does an LLM proxy add latency?

A small amount. OpenRouter reports roughly 25ms of overhead from its edge architecture. A self-hosted proxy like LiteLLM adds the network hop to wherever you run it plus minimal processing. Caching can make repeat requests faster than calling the provider directly, since cached responses are served from the proxy. For most workloads the added latency is small relative to multi-second model response times.

Can an LLM proxy fail over to another provider?

Yes. Fallbacks are a core proxy feature. LiteLLM can load balance across primary deployments and then fall back to a different provider if all primaries are unavailable. OpenRouter automatically fails over to a backup model or provider when the primary is down or rate-limited. Cloudflare AI Gateway supports request retries with model fallbacks.

Related Resources

One Endpoint for Access. One Router for the Decision.

Morph exposes an OpenAI-compatible endpoint at api.morphllm.com and a router that classifies prompt difficulty in ~430ms into four tiers, at $0.001 per classification, for 40-70% API cost savings. Change one base URL to start.

Read the Docs

See the Router

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers