Model Routing: How Lovable Routes 1.8B Tokens/Min Across 5 Providers

Lovable processes 1.8 billion tokens per minute across five LLM providers. At that volume, every provider fails eventually. Rate limits hit. Streams die mid-generation. Entire regions go dark. Their users rarely notice, because the routing layer absorbs it all: a PID controller recalculating provider health every 30 seconds, probabilistic fallback chains preserving prompt caching, and weighted traffic distribution that recovers from failures without anyone touching a config file.

1.8B

Tokens per minute at peak

30s

PID recalculation interval

90%

Cost reduction from prompt caching

85%

Max cost savings with routing

1.8 Billion Tokens Per Minute

Lovable is an AI-powered software development platform. Every user interaction generates multiple LLM calls: planning, code generation, debugging, iteration. At peak traffic, this produces 1.8 billion tokens per minute flowing through their infrastructure across five providers.

At this scale, the probability of at least one provider having issues at any given moment approaches certainty. Rate limits trigger during traffic spikes. Streaming responses fail mid-generation when a provider's load balancer sheds connections. Regional outages take an entire provider offline for minutes. Without routing, each of these events becomes a user-visible error.

The conventional approach is a ranked fallback list: try Provider A, if it fails try Provider B, then C. This works for low traffic. At 1.8 billion tokens per minute, it creates a stampede. When Provider A goes down, all traffic floods Provider B, which immediately hits its own rate limits. The failure cascades through every provider on the list.

Why simple fallbacks fail at scale

A ranked fallback list turns a single provider outage into a multi-provider outage. All traffic shifts to the second provider simultaneously, overwhelming its rate limits. The system needs proportional distribution, not binary failover.

What Model Routing Actually Is

Model routing is a decision layer between your application and your LLM providers. It evaluates each request and determines where to send it based on some combination of cost, latency, quality, task type, and provider health. The term covers a broad spectrum of implementations.

At one end: multi-provider load balancing. Lovable's system distributes traffic across providers proportionally, maintaining prompt caching and handling failover transparently. The model stays the same (Claude Sonnet, say), but the provider serving it changes based on real-time health signals.

At the other end: per-request model selection. Given a prompt, which model should handle it? A TODO comment does not need Opus. A complex refactor does not work well on Haiku. A router classifies the task and picks the right model tier.

In between: hybrid systems that do both. Route to the right model, then route that model to the right provider. The two decisions are independent and compose naturally.

Provider routing

Same model, different providers. Distribute traffic based on health, latency, cost, and rate limit headroom. Preserves prompt caching through provider affinity.

Model routing

Different models, matched to task difficulty. Easy prompts go to cheap models. Hard prompts go to frontier models. Classification determines which.

Hybrid routing

Select the model based on the task, then select the provider based on real-time availability. Two independent routing decisions that compose naturally.

Lovable's PID Controller Approach

Lovable's routing system centers on a PID controller that continuously adjusts how much traffic each provider receives. Every 30 seconds, it computes a score for each provider:

Provider health scoring

// Computed every 30 seconds per provider
score = successful_responses - 200 * errors + 1

// PID controller adjusts availability to keep score near zero
// If score < 0 (error rate > 0.5%): reduce availability
// If score > 0: increase availability
// The +1 bias prevents permanently abandoning recovered providers
// Availability capped at [0, 1]

The 200x error penalty means a single error costs the equivalent of 200 successful responses. This makes the system extremely sensitive to failures. At a 0.5% error rate, the score crosses zero and the controller starts reducing that provider's traffic share. The +1 bias ensures that a provider with zero traffic still gets a small positive score, so it eventually receives test traffic and can recover.

Weight calculation

With multiple providers, total availability usually exceeds 1. Lovable converts availabilities to weights that sum to 1, prioritizing preferred providers:

Provider weight distribution

// First provider: weight = its availability
weight_1 = availability_1

// Second provider: weight = min(availability, remaining capacity)
weight_2 = min(availability_2, 1 - weight_1)

// Third provider: fills remaining gap
weight_3 = min(availability_3, 1 - weight_1 - weight_2)

// Example with availabilities [0.8, 0.7, 0.9]:
// weight_1 = 0.8   (preferred provider gets 80%)
// weight_2 = 0.2   (second provider fills remaining 20%)
// weight_3 = 0.0   (third provider is pure fallback)

// If provider 1 degrades to availability 0.3:
// weight_1 = 0.3   (reduced share)
// weight_2 = 0.7   (automatically absorbs traffic)
// weight_3 = 0.0   (still fallback)

This greedy allocation means the preferred provider always gets first claim on traffic, up to its current availability. When it degrades, traffic flows to the next provider without any manual intervention. Recovery is also automatic: as the preferred provider's health improves, its availability rises, and it reclaims traffic from downstream providers.

Why a PID controller, not simple thresholds

Simple threshold-based routing (if error rate > 5%, switch providers) creates oscillation. Traffic shifts away, the provider recovers because it has no load, traffic shifts back, it fails again. A PID controller smooths this with proportional, integral, and derivative terms that converge on a stable traffic distribution.

Prompt Caching Preservation

Prompt caching is the single largest cost and latency optimization available in LLM inference. Anthropic's implementation caches prompt prefixes and serves subsequent requests with identical prefixes at 90% reduced cost and up to 85% reduced latency. A 100K-token prompt that takes 11.5 seconds on first call drops to 2.4 seconds on cache hit.

The constraint: caches are per-provider. If a project's request hits Anthropic direct on one call and Anthropic via AWS Bedrock on the next, the cache misses. Both calls pay full price. At Lovable's scale, a few percent of cache misses from unnecessary provider switches costs thousands of dollars per hour.

Lovable solves this by generating multiple fallback chains with different primary providers, then assigning each project to one chain for a short duration. Consecutive requests from the same project stay on the same provider, preserving the cache. When a provider fails, the project switches to the next provider in its chain and starts building a new cache there.

Scenario	Cache hit rate	Cost per request	Latency (100K prefix)
No routing (single provider)	~95%	Baseline	~2.4s
Naive routing (random provider)	~20%	4-5x higher	~10s
Affinity routing (Lovable's approach)	~90%	~1.1x baseline	~2.8s

The difference between naive routing and affinity routing is dramatic. Random provider selection destroys the cache on most requests, inflating both cost and latency. Affinity routing preserves nearly all cache hits while still providing failover capability. The small cache hit rate reduction (95% to 90%) comes from the inevitable provider switches during actual failures.

Cache invalidation during provider switching

Anthropic's cache operates on exact prefix matching. Even minor changes (different tool_choice, presence of images) invalidate the cache. When routing switches providers, all cached prefixes for that project are lost. The project must rebuild its cache on the new provider, which means the first few requests after a switch pay full price. Good routing minimizes these switches.

Routing Strategies

There is no single correct routing strategy. The right choice depends on what you are optimizing for and what constraints you face. Five strategies cover most production use cases.

Cost-based routing

Route each request to the cheapest model or provider that meets a quality threshold. RouteLLM achieves 95% of GPT-4 quality at 25% of the cost. Best for high-volume applications where quality requirements are well-defined.

Latency-based routing

Route to the fastest available provider. OpenRouter's :nitro variants prioritize throughput. Best for interactive applications where time-to-first-token matters more than per-request cost.

Quality-based routing

Always use the best model regardless of cost. Route across providers for reliability, not savings. Best for high-stakes tasks where incorrect output costs more than the model price.

Task-based routing

Match the model to the task type. Code generation to code-tuned models, summarization to efficient models, reasoning to frontier models. Classification adds ~430ms but routes 60% of coding prompts to 60x cheaper models.

Reliability-based routing

The fifth strategy is what Lovable implements: route for reliability. The primary goal is not to save money on models but to maintain 99.95%+ uptime when individual providers have multiple outages per week. Cost optimization is a secondary benefit; the primary benefit is that users never see a provider error.

OpenRouter's default behavior illustrates a hybrid approach: prioritize providers with no recent outages, weight by inverse square of price among healthy providers, and use the remaining providers as fallbacks. This combines reliability (avoid recently-failed providers) with cost optimization (prefer cheaper among healthy).

Strategy	Optimizes for	Typical savings	Tradeoff
Cost-based	Lowest cost per quality	40-85%	May increase latency if cheap models are slower
Latency-based	Fastest response	0-20%	May cost more; fast providers are often more expensive
Quality-based	Best output quality	0%	No cost savings; pure reliability improvement
Task-based	Right model per task	40-70%	Requires accurate classification; adds ~430ms
Reliability-based	Maximum uptime	Varies	May route to more expensive providers during outages

RouteLLM: Routing from Preference Data

RouteLLM, published at ICLR 2025 by researchers from UC Berkeley, Anyscale, and Canva, is the most rigorous open-source routing framework available. It trains router models on human preference data from Chatbot Arena to predict when a strong model (GPT-4) would outperform a weak model (Mixtral 8x7B) on a given prompt.

The key result: the matrix factorization router achieves 95% of GPT-4's performance on MT Bench while using only 26% GPT-4 calls. With data augmentation from an LLM judge, that drops to 14% GPT-4 calls at the same quality level. That is a 75% cost reduction without meaningful quality degradation.

95%

GPT-4 quality retained

26%

GPT-4 calls needed

75%

Cost reduction

Router models available

RouteLLM provides five router implementations:

Router	Method	Best for
Matrix Factorization (mf)	Preference data factorization	General use. Recommended default.
SW Ranking	Elo-weighted prompt similarity	When query similarity to training data is high
BERT	Trained classifier on preferences	Environments where BERT inference is fast
Causal LLM	Language model fine-tuned on preferences	When accuracy matters more than latency
Random	Random model selection	Baseline comparison only

A finding that matters for production: the routers transfer across model pairs. A router trained on GPT-4 vs. Mixtral maintains its performance when routing between Claude vs. Llama, without retraining. The routing decision is about query complexity, not about specific model capabilities. This means you can swap underlying models as pricing and performance shift without rebuilding the router.

Using RouteLLM

import openai

# RouteLLM uses threshold-based routing via the model field
# Format: router-[ROUTER_NAME]-[THRESHOLD]
client = openai.OpenAI(
    base_url="https://localhost:6060/v1",  # RouteLLM server
    api_key="not-needed"
)

# Threshold controls strong/weak model split
# Lower threshold = more requests to strong model
response = client.chat.completions.create(
    model="router-mf-0.11593",  # matrix factorization, calibrated threshold
    messages=[{"role": "user", "content": "Debug this race condition..."}]
)

# Internally:
# 1. Router calculates win-rate probability for strong model
# 2. If probability > threshold: route to GPT-4
# 3. If probability <= threshold: route to Mixtral
# 4. Threshold 0.11593 = ~50% GPT-4 calls on Chatbot Arena data

Cascade Routing

Standard routing makes a single model decision before generation. Cascade routing makes multiple: start with the cheapest model, evaluate the response, and escalate to more expensive models only when the response is inadequate.

A research team at ETH Zurich unified routing and cascading into a single theoretical framework, published at ICML 2025. Their proof: cascade routing is optimal under certain cost-performance tradeoffs, meaning no single-step router can achieve the same cost-quality frontier.

The intuition: a single-step router must predict difficulty before seeing any output. A cascade router gets to see the cheap model's actual attempt. If the cheap model produces a confident, well-structured response, accept it. If it hedges, contradicts itself, or produces low-confidence tokens, escalate. The cascade uses actual evidence instead of prediction.

Cascade routing pattern

async function cascadeRoute(prompt: string) {
  // Step 1: Try cheapest model
  const cheapResponse = await llm.generate({
    model: "haiku",
    prompt,
    logprobs: true  // need confidence signals
  })

  // Step 2: Evaluate response quality
  const confidence = evaluateConfidence(cheapResponse)

  if (confidence > THRESHOLD) {
    return cheapResponse  // Cheap model handled it. Done.
  }

  // Step 3: Escalate to mid-tier
  const midResponse = await llm.generate({
    model: "sonnet",
    prompt
  })

  const midConfidence = evaluateConfidence(midResponse)

  if (midConfidence > THRESHOLD) {
    return midResponse
  }

  // Step 4: Final escalation to frontier model
  return await llm.generate({
    model: "opus",
    prompt
  })
}

// Tradeoff: cascade adds latency from sequential calls
// but saves money when the cheap model succeeds (60%+ of the time)

The tradeoff is latency. Each escalation step adds a full model generation round-trip. For a prompt that needs the frontier model, cascade routing takes 2-3x longer than routing directly to it. The savings come from the majority of prompts that never escalate. A 2026 survey of routing strategies found that data-augmented cascade approaches achieve up to 16x better efficiency compared to always-escalate baselines.

When to cascade vs. single-step route

Use single-step routing when latency matters and you have good classification accuracy (85%+). Use cascade routing when cost dominates, latency tolerance is high (batch processing, async tasks), or when task difficulty is genuinely hard to predict before seeing output. Most interactive coding agents use single-step. Most batch pipelines benefit from cascading.

Task-Based Routing for Coding Agents

Coding agents are the strongest use case for task-based model routing. A single coding session generates 50-200 LLM calls with wildly varying difficulty. Analysis of millions of coding prompts shows a consistent distribution: roughly 60% easy, 25% medium, 15% hard. The ratio shifts with the task (greenfield coding skews harder, maintenance skews easier), but the pattern holds across languages and frameworks.

The cost difference between tiers is 15-60x. Claude Haiku at $0.25/M tokens versus Claude Opus at $15/M tokens. Even the middle tier (Sonnet at $3/M) is 5x cheaper than frontier. Routing the 60% of easy tasks to Haiku alone saves more than half the bill.

Difficulty	Examples	Model tier	Cost/M tokens
Easy (60%)	Add import, rename variable, write docstring, fix lint error	Haiku / GPT-5-mini	$0.25-1
Medium (25%)	Multi-file refactor, standard patterns, moderate logic	Sonnet / GPT-5-low	$3-5
Hard (15%)	Architectural design, race condition debugging, novel algorithms	Opus / GPT-5-high	$15

The challenge is classification accuracy. A prompt like "add error handling" might be easy (wrap in try-catch) or hard (design retry logic with circuit breakers and graceful degradation). Keyword matching cannot distinguish these. Length heuristics fail because long prompts can be easy (large blocks of boilerplate) and short prompts can be hard ("refactor to visitor pattern").

Amazon Science found that task decomposition with smaller LLMs significantly reduces cost while maintaining quality. Their insight: LLMs perform better reasoning over smaller, well-defined problems. Decompose a complex task into subtasks, route each subtask to the appropriate model tier, and the total cost drops while quality stays constant or improves.

This is where multi-agent coding architectures intersect with routing. An orchestrator decomposes the task and delegates subtasks to specialized agents. Each subtask gets classified independently and routed to the right model. The agent adding a test file uses Haiku. The agent designing the API schema uses Opus. The orchestrator that coordinates them uses Sonnet. Three tiers, three cost levels, matched to actual difficulty.

Implementation Patterns

Model routing implementations fall into three categories: managed services, self-hosted proxies, and application-level routing. Each has different tradeoff profiles.

Managed routing services

OpenRouter, Martian, and similar services sit between your application and the providers. You send requests to their endpoint, and they handle model selection, provider routing, failover, and retry logic. The advantage: zero infrastructure. The cost: a margin on every request and a dependency on another service in your critical path.

OpenRouter's default load balancing strategy prioritizes providers with no outages in the last 30 seconds, then selects among healthy providers weighted by inverse square of price. It automatically filters providers that do not support your request's features (tools, max_tokens requirements). Failover happens transparently when a provider returns an error.

Self-hosted proxies

LiteLLM and similar proxies run in your infrastructure. You control the routing logic, provider credentials, and failover configuration. The advantage: full control, no third-party dependency, no markup. The cost: operational overhead for deployment, monitoring, and updates.

Application-level routing

The lightest approach: the routing decision happens in your application code. A classifier returns a model recommendation, and your application sends the request directly to the provider. No proxy in the critical path. No additional infrastructure.

Application-level routing with Morph

import { morph } from 'morph'
import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic()

async function routedChat(userQuery: string, context: string[]) {
  // Classification and context preparation run in parallel
  const [routerResult, preparedContext] = await Promise.all([
    morph.routers.anthropic.selectModel({
      input: userQuery,
      mode: 'balanced'
    }),
    prepareContext(context)  // file reads, search, etc.
  ])

  // Router returns model name, not a proxied response
  // Your app calls the provider directly
  const response = await anthropic.messages.create({
    model: routerResult.model,  // "claude-haiku-4" or "claude-opus-4"
    messages: [
      { role: 'user', content: preparedContext + userQuery }
    ],
    max_tokens: 4096,
  })

  return response
}

// Classification: ~430ms (hidden behind parallel context prep)
// Cost: $0.001 per classification
// Net savings: 40-70% on typical coding workloads

The parallel execution pattern is critical. Classification takes ~430ms. Context preparation (reading files, running search, assembling the system prompt) typically takes 200-800ms. By running them in parallel with Promise.all, the classification completes before the context is ready. Zero added latency in the common case.

Multi-provider routing with fallback chains

For teams running at Lovable-like scale, the full pattern combines model selection with provider routing:

Combined model + provider routing

// Step 1: Select the right model for the task
const { model } = await router.selectModel({ input: prompt })
// → "claude-sonnet-4.5"

// Step 2: Select the right provider for that model
const provider = selectProvider({
  model: "claude-sonnet-4.5",
  providers: [
    { name: "anthropic-direct", weight: 0.6, healthy: true },
    { name: "aws-bedrock",      weight: 0.3, healthy: true },
    { name: "gcp-vertex",       weight: 0.1, healthy: true },
  ],
  // Preserve prompt caching: sticky to last provider for this project
  projectAffinity: project.lastProvider,
})

// Step 3: Send request to selected provider
const response = await provider.complete({
  model: "claude-sonnet-4.5",
  messages,
  // Fallback: if this provider fails, try next in chain
  fallback: provider.nextInChain,
})

When Routing Breaks Down

Routing is not always the right answer. Some scenarios are better served by a fixed model and provider.

Homogeneous difficulty

If all prompts are genuinely hard (theorem proving, novel research), task-based routing has nothing to route. Everything goes to the frontier model anyway. The classification overhead adds cost without savings.

Low volume

Routing infrastructure, whether managed or self-hosted, has a fixed cost. Below ~1,000 LLM calls per month, the absolute dollar savings from routing may not justify the integration effort.

Single-provider lock-in

If you depend on provider-specific features (Anthropic's tool use format, OpenAI's function calling schema), multi-provider routing adds compatibility complexity. Provider routing still works, but model routing across providers requires translation layers.

Latency-critical autocomplete

For inline code completion with sub-200ms latency budgets, even the ~430ms classification overhead is too much. These workloads are better served by a fixed fast model. The savings from routing do not compensate for doubled latency.

For most production LLM applications, routing is worth it once you cross two thresholds: more than 30% of requests are below maximum difficulty, and you make more than a few thousand calls per month. Below those thresholds, a fixed model with a simple retry-on-failure strategy is simpler and sufficient.

Frequently Asked Questions

What is model routing?

Model routing is the practice of dynamically selecting which LLM handles each request based on task difficulty, cost, latency, quality requirements, and provider availability. Instead of hardcoding a single model, a router evaluates each request and sends it to the optimal model or provider. At scale, this includes multi-provider load balancing with automatic failover, prompt caching preservation, and rate limit management.

How does Lovable route 1.8 billion tokens per minute?

Lovable uses a PID controller that recalculates provider availability every 30 seconds. The scoring formula is: score = successful_responses - 200 * errors + 1. When error rate exceeds 0.5%, traffic shifts away automatically. They generate multiple fallback chains with different primary providers and assign each project to one chain for a fixed duration, preserving prompt caching by keeping consecutive requests on the same provider.

What is RouteLLM?

RouteLLM is an open-source framework from UC Berkeley, published at ICLR 2025. It trains router models on human preference data to predict when a strong model would outperform a weak model. The matrix factorization router achieves 95% of GPT-4 quality using 26% GPT-4 calls, cutting costs by ~75%. Routers transfer across model pairs without retraining.

How does prompt caching interact with model routing?

Prompt caching relies on consecutive requests hitting the same provider with identical prefixes. Anthropic's cache provides 90% cost reduction and up to 85% latency reduction. When a router switches providers, the cache is invalidated. Effective routing preserves caching through project-level provider affinity, keeping consecutive requests on the same provider.

What are the main model routing strategies?

Five primary strategies. Cost-based: route to the cheapest model meeting quality thresholds (40-85% savings). Latency-based: route to the fastest provider. Quality-based: always use the best model. Task-based: match model tier to task difficulty (40-70% savings on coding workloads). Reliability-based: distribute traffic for uptime, handle failover transparently.

How much does model routing save on LLM costs?

40% to 85% depending on implementation and workload. RouteLLM achieves up to 85% on benchmarks. Task-based routing for coding agents saves 40-70% because 60% of prompts are simple tasks where cheap models produce identical output. Multi-provider routing adds reliability savings by preventing costly retries during outages.

What is cascade routing?

Cascade routing tries a cheap model first, evaluates the response quality, and escalates to expensive models only when needed. A unified framework from ETH Zurich (ICML 2025) proves this optimal under certain conditions. The tradeoff: cascading adds latency from sequential calls but saves money when the cheap model succeeds, which is 60%+ of the time for typical workloads.

Related Resources

Skip the Routing Infrastructure

Morph's inference layer handles model routing automatically. Prompt difficulty classification in ~430ms, provider-level routing with failover, and prompt caching preservation. Use the right model for each subtask without building routing infrastructure yourself.

Try Morph

See the Router

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Model Routing: How Lovable Routes 1.8 Billion Tokens Per Minute Across 5 Providers