Lovable processes 1.8 billion tokens per minute across five LLM providers. At that volume, every provider fails eventually. Rate limits hit. Streams die mid-generation. Entire regions go dark. Their users rarely notice, because the routing layer absorbs it all: a PID controller recalculating provider health every 30 seconds, probabilistic fallback chains preserving prompt caching, and weighted traffic distribution that recovers from failures without anyone touching a config file.
1.8 Billion Tokens Per Minute
Lovable is an AI-powered software development platform. Every user interaction generates multiple LLM calls: planning, code generation, debugging, iteration. At peak traffic, this produces 1.8 billion tokens per minute flowing through their infrastructure across five providers.
At this scale, the probability of at least one provider having issues at any given moment approaches certainty. Rate limits trigger during traffic spikes. Streaming responses fail mid-generation when a provider's load balancer sheds connections. Regional outages take an entire provider offline for minutes. Without routing, each of these events becomes a user-visible error.
The conventional approach is a ranked fallback list: try Provider A, if it fails try Provider B, then C. This works for low traffic. At 1.8 billion tokens per minute, it creates a stampede. When Provider A goes down, all traffic floods Provider B, which immediately hits its own rate limits. The failure cascades through every provider on the list.
Why simple fallbacks fail at scale
A ranked fallback list turns a single provider outage into a multi-provider outage. All traffic shifts to the second provider simultaneously, overwhelming its rate limits. The system needs proportional distribution, not binary failover.
What Model Routing Actually Is
Model routing is a decision layer between your application and your LLM providers. It evaluates each request and determines where to send it based on some combination of cost, latency, quality, task type, and provider health. The term covers a broad spectrum of implementations.
At one end: multi-provider load balancing. Lovable's system distributes traffic across providers proportionally, maintaining prompt caching and handling failover transparently. The model stays the same (Claude Sonnet, say), but the provider serving it changes based on real-time health signals.
At the other end: per-request model selection. Given a prompt, which model should handle it? A TODO comment does not need Opus. A complex refactor does not work well on Haiku. A router classifies the task and picks the right model tier.
In between: hybrid systems that do both. Route to the right model, then route that model to the right provider. The two decisions are independent and compose naturally.
Provider routing
Same model, different providers. Distribute traffic based on health, latency, cost, and rate limit headroom. Preserves prompt caching through provider affinity.
Model routing
Different models, matched to task difficulty. Easy prompts go to cheap models. Hard prompts go to frontier models. Classification determines which.
Hybrid routing
Select the model based on the task, then select the provider based on real-time availability. Two independent routing decisions that compose naturally.
Lovable's PID Controller Approach
Lovable's routing system centers on a PID controller that continuously adjusts how much traffic each provider receives. Every 30 seconds, it computes a score for each provider:
Provider health scoring
// Computed every 30 seconds per provider
score = successful_responses - 200 * errors + 1
// PID controller adjusts availability to keep score near zero
// If score < 0 (error rate > 0.5%): reduce availability
// If score > 0: increase availability
// The +1 bias prevents permanently abandoning recovered providers
// Availability capped at [0, 1]The 200x error penalty means a single error costs the equivalent of 200 successful responses. This makes the system extremely sensitive to failures. At a 0.5% error rate, the score crosses zero and the controller starts reducing that provider's traffic share. The +1 bias ensures that a provider with zero traffic still gets a small positive score, so it eventually receives test traffic and can recover.
Weight calculation
With multiple providers, total availability usually exceeds 1. Lovable converts availabilities to weights that sum to 1, prioritizing preferred providers:
Provider weight distribution
// First provider: weight = its availability
weight_1 = availability_1
// Second provider: weight = min(availability, remaining capacity)
weight_2 = min(availability_2, 1 - weight_1)
// Third provider: fills remaining gap
weight_3 = min(availability_3, 1 - weight_1 - weight_2)
// Example with availabilities [0.8, 0.7, 0.9]:
// weight_1 = 0.8 (preferred provider gets 80%)
// weight_2 = 0.2 (second provider fills remaining 20%)
// weight_3 = 0.0 (third provider is pure fallback)
// If provider 1 degrades to availability 0.3:
// weight_1 = 0.3 (reduced share)
// weight_2 = 0.7 (automatically absorbs traffic)
// weight_3 = 0.0 (still fallback)This greedy allocation means the preferred provider always gets first claim on traffic, up to its current availability. When it degrades, traffic flows to the next provider without any manual intervention. Recovery is also automatic: as the preferred provider's health improves, its availability rises, and it reclaims traffic from downstream providers.
Why a PID controller, not simple thresholds
Simple threshold-based routing (if error rate > 5%, switch providers) creates oscillation. Traffic shifts away, the provider recovers because it has no load, traffic shifts back, it fails again. A PID controller smooths this with proportional, integral, and derivative terms that converge on a stable traffic distribution.
Prompt Caching Preservation
Prompt caching is the single largest cost and latency optimization available in LLM inference. Anthropic's implementation caches prompt prefixes and serves subsequent requests with identical prefixes at 90% reduced cost and up to 85% reduced latency. A 100K-token prompt that takes 11.5 seconds on first call drops to 2.4 seconds on cache hit.
The constraint: caches are per-provider. If a project's request hits Anthropic direct on one call and Anthropic via AWS Bedrock on the next, the cache misses. Both calls pay full price. At Lovable's scale, a few percent of cache misses from unnecessary provider switches costs thousands of dollars per hour.
Lovable solves this by generating multiple fallback chains with different primary providers, then assigning each project to one chain for a short duration. Consecutive requests from the same project stay on the same provider, preserving the cache. When a provider fails, the project switches to the next provider in its chain and starts building a new cache there.
| Scenario | Cache hit rate | Cost per request | Latency (100K prefix) |
|---|---|---|---|
| No routing (single provider) | ~95% | Baseline | ~2.4s |
| Naive routing (random provider) | ~20% | 4-5x higher | ~10s |
| Affinity routing (Lovable's approach) | ~90% | ~1.1x baseline | ~2.8s |
The difference between naive routing and affinity routing is dramatic. Random provider selection destroys the cache on most requests, inflating both cost and latency. Affinity routing preserves nearly all cache hits while still providing failover capability. The small cache hit rate reduction (95% to 90%) comes from the inevitable provider switches during actual failures.
Cache invalidation during provider switching
Anthropic's cache operates on exact prefix matching. Even minor changes (different tool_choice, presence of images) invalidate the cache. When routing switches providers, all cached prefixes for that project are lost. The project must rebuild its cache on the new provider, which means the first few requests after a switch pay full price. Good routing minimizes these switches.
Routing Strategies
There is no single correct routing strategy. The right choice depends on what you are optimizing for and what constraints you face. Five strategies cover most production use cases.
Cost-based routing
Route each request to the cheapest model or provider that meets a quality threshold. RouteLLM achieves 95% of GPT-4 quality at 25% of the cost. Best for high-volume applications where quality requirements are well-defined.
Latency-based routing
Route to the fastest available provider. OpenRouter's :nitro variants prioritize throughput. Best for interactive applications where time-to-first-token matters more than per-request cost.
Quality-based routing
Always use the best model regardless of cost. Route across providers for reliability, not savings. Best for high-stakes tasks where incorrect output costs more than the model price.
Task-based routing
Match the model to the task type. Code generation to code-tuned models, summarization to efficient models, reasoning to frontier models. Classification adds ~430ms but routes 60% of coding prompts to 60x cheaper models.
Reliability-based routing
The fifth strategy is what Lovable implements: route for reliability. The primary goal is not to save money on models but to maintain 99.95%+ uptime when individual providers have multiple outages per week. Cost optimization is a secondary benefit; the primary benefit is that users never see a provider error.
OpenRouter's default behavior illustrates a hybrid approach: prioritize providers with no recent outages, weight by inverse square of price among healthy providers, and use the remaining providers as fallbacks. This combines reliability (avoid recently-failed providers) with cost optimization (prefer cheaper among healthy).
| Strategy | Optimizes for | Typical savings | Tradeoff |
|---|---|---|---|
| Cost-based | Lowest cost per quality | 40-85% | May increase latency if cheap models are slower |
| Latency-based | Fastest response | 0-20% | May cost more; fast providers are often more expensive |
| Quality-based | Best output quality | 0% | No cost savings; pure reliability improvement |
| Task-based | Right model per task | 40-70% | Requires accurate classification; adds ~430ms |
| Reliability-based | Maximum uptime | Varies | May route to more expensive providers during outages |
RouteLLM: Routing from Preference Data
RouteLLM, published at ICLR 2025 by researchers from UC Berkeley, Anyscale, and Canva, is the most rigorous open-source routing framework available. It trains router models on human preference data from Chatbot Arena to predict when a strong model (GPT-4) would outperform a weak model (Mixtral 8x7B) on a given prompt.
The key result: the matrix factorization router achieves 95% of GPT-4's performance on MT Bench while using only 26% GPT-4 calls. With data augmentation from an LLM judge, that drops to 14% GPT-4 calls at the same quality level. That is a 75% cost reduction without meaningful quality degradation.
RouteLLM provides five router implementations:
| Router | Method | Best for |
|---|---|---|
| Matrix Factorization (mf) | Preference data factorization | General use. Recommended default. |
| SW Ranking | Elo-weighted prompt similarity | When query similarity to training data is high |
| BERT | Trained classifier on preferences | Environments where BERT inference is fast |
| Causal LLM | Language model fine-tuned on preferences | When accuracy matters more than latency |
| Random | Random model selection | Baseline comparison only |
A finding that matters for production: the routers transfer across model pairs. A router trained on GPT-4 vs. Mixtral maintains its performance when routing between Claude vs. Llama, without retraining. The routing decision is about query complexity, not about specific model capabilities. This means you can swap underlying models as pricing and performance shift without rebuilding the router.
Using RouteLLM
import openai
# RouteLLM uses threshold-based routing via the model field
# Format: router-[ROUTER_NAME]-[THRESHOLD]
client = openai.OpenAI(
base_url="https://localhost:6060/v1", # RouteLLM server
api_key="not-needed"
)
# Threshold controls strong/weak model split
# Lower threshold = more requests to strong model
response = client.chat.completions.create(
model="router-mf-0.11593", # matrix factorization, calibrated threshold
messages=[{"role": "user", "content": "Debug this race condition..."}]
)
# Internally:
# 1. Router calculates win-rate probability for strong model
# 2. If probability > threshold: route to GPT-4
# 3. If probability <= threshold: route to Mixtral
# 4. Threshold 0.11593 = ~50% GPT-4 calls on Chatbot Arena dataCascade Routing
Standard routing makes a single model decision before generation. Cascade routing makes multiple: start with the cheapest model, evaluate the response, and escalate to more expensive models only when the response is inadequate.
A research team at ETH Zurich unified routing and cascading into a single theoretical framework, published at ICML 2025. Their proof: cascade routing is optimal under certain cost-performance tradeoffs, meaning no single-step router can achieve the same cost-quality frontier.
The intuition: a single-step router must predict difficulty before seeing any output. A cascade router gets to see the cheap model's actual attempt. If the cheap model produces a confident, well-structured response, accept it. If it hedges, contradicts itself, or produces low-confidence tokens, escalate. The cascade uses actual evidence instead of prediction.
Cascade routing pattern
async function cascadeRoute(prompt: string) {
// Step 1: Try cheapest model
const cheapResponse = await llm.generate({
model: "haiku",
prompt,
logprobs: true // need confidence signals
})
// Step 2: Evaluate response quality
const confidence = evaluateConfidence(cheapResponse)
if (confidence > THRESHOLD) {
return cheapResponse // Cheap model handled it. Done.
}
// Step 3: Escalate to mid-tier
const midResponse = await llm.generate({
model: "sonnet",
prompt
})
const midConfidence = evaluateConfidence(midResponse)
if (midConfidence > THRESHOLD) {
return midResponse
}
// Step 4: Final escalation to frontier model
return await llm.generate({
model: "opus",
prompt
})
}
// Tradeoff: cascade adds latency from sequential calls
// but saves money when the cheap model succeeds (60%+ of the time)The tradeoff is latency. Each escalation step adds a full model generation round-trip. For a prompt that needs the frontier model, cascade routing takes 2-3x longer than routing directly to it. The savings come from the majority of prompts that never escalate. A 2026 survey of routing strategies found that data-augmented cascade approaches achieve up to 16x better efficiency compared to always-escalate baselines.
When to cascade vs. single-step route
Use single-step routing when latency matters and you have good classification accuracy (85%+). Use cascade routing when cost dominates, latency tolerance is high (batch processing, async tasks), or when task difficulty is genuinely hard to predict before seeing output. Most interactive coding agents use single-step. Most batch pipelines benefit from cascading.
Task-Based Routing for Coding Agents
Coding agents are the strongest use case for task-based model routing. A single coding session generates 50-200 LLM calls with wildly varying difficulty. Analysis of millions of coding prompts shows a consistent distribution: roughly 60% easy, 25% medium, 15% hard. The ratio shifts with the task (greenfield coding skews harder, maintenance skews easier), but the pattern holds across languages and frameworks.
The cost difference between tiers is 15-60x. Claude Haiku at $0.25/M tokens versus Claude Opus at $15/M tokens. Even the middle tier (Sonnet at $3/M) is 5x cheaper than frontier. Routing the 60% of easy tasks to Haiku alone saves more than half the bill.
| Difficulty | Examples | Model tier | Cost/M tokens |
|---|---|---|---|
| Easy (60%) | Add import, rename variable, write docstring, fix lint error | Haiku / GPT-5-mini | $0.25-1 |
| Medium (25%) | Multi-file refactor, standard patterns, moderate logic | Sonnet / GPT-5-low | $3-5 |
| Hard (15%) | Architectural design, race condition debugging, novel algorithms | Opus / GPT-5-high | $15 |
The challenge is classification accuracy. A prompt like "add error handling" might be easy (wrap in try-catch) or hard (design retry logic with circuit breakers and graceful degradation). Keyword matching cannot distinguish these. Length heuristics fail because long prompts can be easy (large blocks of boilerplate) and short prompts can be hard ("refactor to visitor pattern").
Amazon Science found that task decomposition with smaller LLMs significantly reduces cost while maintaining quality. Their insight: LLMs perform better reasoning over smaller, well-defined problems. Decompose a complex task into subtasks, route each subtask to the appropriate model tier, and the total cost drops while quality stays constant or improves.
This is where multi-agent coding architectures intersect with routing. An orchestrator decomposes the task and delegates subtasks to specialized agents. Each subtask gets classified independently and routed to the right model. The agent adding a test file uses Haiku. The agent designing the API schema uses Opus. The orchestrator that coordinates them uses Sonnet. Three tiers, three cost levels, matched to actual difficulty.
Implementation Patterns
Model routing implementations fall into three categories: managed services, self-hosted proxies, and application-level routing. Each has different tradeoff profiles.
Managed routing services
OpenRouter, Martian, and similar services sit between your application and the providers. You send requests to their endpoint, and they handle model selection, provider routing, failover, and retry logic. The advantage: zero infrastructure. The cost: a margin on every request and a dependency on another service in your critical path.
OpenRouter's default load balancing strategy prioritizes providers with no outages in the last 30 seconds, then selects among healthy providers weighted by inverse square of price. It automatically filters providers that do not support your request's features (tools, max_tokens requirements). Failover happens transparently when a provider returns an error.
Self-hosted proxies
LiteLLM and similar proxies run in your infrastructure. You control the routing logic, provider credentials, and failover configuration. The advantage: full control, no third-party dependency, no markup. The cost: operational overhead for deployment, monitoring, and updates.
Application-level routing
The lightest approach: the routing decision happens in your application code. A classifier returns a model recommendation, and your application sends the request directly to the provider. No proxy in the critical path. No additional infrastructure.
Application-level routing with Morph
import { morph } from 'morph'
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic()
async function routedChat(userQuery: string, context: string[]) {
// Classification and context preparation run in parallel
const [routerResult, preparedContext] = await Promise.all([
morph.routers.anthropic.selectModel({
input: userQuery,
mode: 'balanced'
}),
prepareContext(context) // file reads, search, etc.
])
// Router returns model name, not a proxied response
// Your app calls the provider directly
const response = await anthropic.messages.create({
model: routerResult.model, // "claude-haiku-4" or "claude-opus-4"
messages: [
{ role: 'user', content: preparedContext + userQuery }
],
max_tokens: 4096,
})
return response
}
// Classification: ~430ms (hidden behind parallel context prep)
// Cost: $0.001 per classification
// Net savings: 40-70% on typical coding workloadsThe parallel execution pattern is critical. Classification takes ~430ms. Context preparation (reading files, running search, assembling the system prompt) typically takes 200-800ms. By running them in parallel with Promise.all, the classification completes before the context is ready. Zero added latency in the common case.
Multi-provider routing with fallback chains
For teams running at Lovable-like scale, the full pattern combines model selection with provider routing:
Combined model + provider routing
// Step 1: Select the right model for the task
const { model } = await router.selectModel({ input: prompt })
// → "claude-sonnet-4.5"
// Step 2: Select the right provider for that model
const provider = selectProvider({
model: "claude-sonnet-4.5",
providers: [
{ name: "anthropic-direct", weight: 0.6, healthy: true },
{ name: "aws-bedrock", weight: 0.3, healthy: true },
{ name: "gcp-vertex", weight: 0.1, healthy: true },
],
// Preserve prompt caching: sticky to last provider for this project
projectAffinity: project.lastProvider,
})
// Step 3: Send request to selected provider
const response = await provider.complete({
model: "claude-sonnet-4.5",
messages,
// Fallback: if this provider fails, try next in chain
fallback: provider.nextInChain,
})When Routing Breaks Down
Routing is not always the right answer. Some scenarios are better served by a fixed model and provider.
Homogeneous difficulty
If all prompts are genuinely hard (theorem proving, novel research), task-based routing has nothing to route. Everything goes to the frontier model anyway. The classification overhead adds cost without savings.
Low volume
Routing infrastructure, whether managed or self-hosted, has a fixed cost. Below ~1,000 LLM calls per month, the absolute dollar savings from routing may not justify the integration effort.
Single-provider lock-in
If you depend on provider-specific features (Anthropic's tool use format, OpenAI's function calling schema), multi-provider routing adds compatibility complexity. Provider routing still works, but model routing across providers requires translation layers.
Latency-critical autocomplete
For inline code completion with sub-200ms latency budgets, even the ~430ms classification overhead is too much. These workloads are better served by a fixed fast model. The savings from routing do not compensate for doubled latency.
For most production LLM applications, routing is worth it once you cross two thresholds: more than 30% of requests are below maximum difficulty, and you make more than a few thousand calls per month. Below those thresholds, a fixed model with a simple retry-on-failure strategy is simpler and sufficient.
Frequently Asked Questions
What is model routing?
Model routing is the practice of dynamically selecting which LLM handles each request based on task difficulty, cost, latency, quality requirements, and provider availability. Instead of hardcoding a single model, a router evaluates each request and sends it to the optimal model or provider. At scale, this includes multi-provider load balancing with automatic failover, prompt caching preservation, and rate limit management.
How does Lovable route 1.8 billion tokens per minute?
Lovable uses a PID controller that recalculates provider availability every 30 seconds. The scoring formula is: score = successful_responses - 200 * errors + 1. When error rate exceeds 0.5%, traffic shifts away automatically. They generate multiple fallback chains with different primary providers and assign each project to one chain for a fixed duration, preserving prompt caching by keeping consecutive requests on the same provider.
What is RouteLLM?
RouteLLM is an open-source framework from UC Berkeley, published at ICLR 2025. It trains router models on human preference data to predict when a strong model would outperform a weak model. The matrix factorization router achieves 95% of GPT-4 quality using 26% GPT-4 calls, cutting costs by ~75%. Routers transfer across model pairs without retraining.
How does prompt caching interact with model routing?
Prompt caching relies on consecutive requests hitting the same provider with identical prefixes. Anthropic's cache provides 90% cost reduction and up to 85% latency reduction. When a router switches providers, the cache is invalidated. Effective routing preserves caching through project-level provider affinity, keeping consecutive requests on the same provider.
What are the main model routing strategies?
Five primary strategies. Cost-based: route to the cheapest model meeting quality thresholds (40-85% savings). Latency-based: route to the fastest provider. Quality-based: always use the best model. Task-based: match model tier to task difficulty (40-70% savings on coding workloads). Reliability-based: distribute traffic for uptime, handle failover transparently.
How much does model routing save on LLM costs?
40% to 85% depending on implementation and workload. RouteLLM achieves up to 85% on benchmarks. Task-based routing for coding agents saves 40-70% because 60% of prompts are simple tasks where cheap models produce identical output. Multi-provider routing adds reliability savings by preventing costly retries during outages.
What is cascade routing?
Cascade routing tries a cheap model first, evaluates the response quality, and escalates to expensive models only when needed. A unified framework from ETH Zurich (ICML 2025) proves this optimal under certain conditions. The tradeoff: cascading adds latency from sequential calls but saves money when the cheap model succeeds, which is 60%+ of the time for typical workloads.
Related Resources
Skip the Routing Infrastructure
Morph's inference layer handles model routing automatically. Prompt difficulty classification in ~430ms, provider-level routing with failover, and prompt caching preservation. Use the right model for each subtask without building routing infrastructure yourself.