Predicted Outputs: Speed Up AI Code Edits by 3-5x (2026 Guide)

Predicted outputs let you pass a reference string so AI models skip ahead when output matches your prediction. Learn how they work across OpenAI, Cerebras, Fireworks, Mistral, and how Morph compares for code editing.

February 15, 2026 · 2 min read

TL;DR

Predicted Outputs in 30 Seconds

Predicted outputs let you pass a reference string (your prediction of what the model will generate). The API uses speculative decoding to validate predicted tokens in parallel. When the prediction matches, generation skips ahead — resulting in 3-5x faster responses for rewrite and code editing workloads. The output is unchanged; only latency improves.

3-5x
Faster on rewrite workloads
0%
Impact on output quality
6+
Providers supported

What Are Predicted Outputs?

Predicted outputs are a latency optimization technique available through LLM APIs. You provide a prediction parameter — a string you expect the model to generate — alongside your normal prompt. The API then uses this prediction to accelerate token generation.

The concept is straightforward: if you already know what most of the output will look like, why make the model generate every token from scratch? By providing a prediction, you allow the API to validate multiple tokens at once instead of producing them sequentially.

The technique was popularized by OpenAI in November 2024 when they launched the prediction parameter for GPT-4o and GPT-4o-mini. Since then, providers like Cerebras, Fireworks AI, Mistral, and Azure OpenAI have adopted the same concept, and open-source frameworks like vLLM support speculative decoding for self-hosted deployments.

Predicted outputs are particularly powerful for code editing. When you refactor a function, rename a variable, or fix a bug, the surrounding code stays the same. That means 80-99% of the output tokens are predictable — exactly the scenario where predicted outputs deliver the biggest speedups.

How Predicted Outputs Work

Under the hood, predicted outputs use a technique called speculative decoding. Here is how it works step by step:

Step 1: You Provide a Prediction

You send a prediction parameter containing the text you expect the model to output. For a code refactor, this is typically the original file contents.

Step 2: Parallel Verification

Instead of generating tokens one at a time, the system takes a batch of tokens from your prediction and verifies them in parallel. If the model would have generated those tokens anyway, they are accepted instantly.

Step 3: Fallback on Divergence

When the model's intended output diverges from your prediction (the part you are actually changing), the system falls back to normal token-by-token generation until the prediction and model output align again.

Step 4: Reporting

The API reports the number of accepted and rejected predicted tokens. Accepted tokens were validated without sequential generation. Rejected tokens were in your prediction but did not match what the model generated.

Key Insight

Predicted outputs are lossless. They do not change what the model generates. The prediction is a performance hint, not a constraint. The model always produces the same output it would without the prediction — just faster when the prediction is accurate.

Conceptual flow

Input:  "Rename userId to accountId in this function"
Prediction: [original file with userId]

Token generation:
  Tokens 1-45:   MATCH prediction → accepted in parallel (instant)
  Tokens 46-52:  DIVERGE → model generates "accountId" token by token
  Tokens 53-200: MATCH prediction again → accepted in parallel (instant)

Result: 193 accepted tokens, 7 generated tokens
Speedup: ~4.2x vs generating all 200 tokens sequentially

When to Use Predicted Outputs

Predicted outputs work best when your prediction has high overlap with the actual output. The more tokens the model can accept from your prediction, the bigger the speedup.

Ideal Use Cases

  • Code refactoring — Rename variables, extract functions, change method signatures. Most of the file stays identical.
  • Bug fixes — Fixing a specific line while the surrounding context is unchanged.
  • Template updates — Updating placeholder values in boilerplate text while the structure remains the same.
  • Document rewrites — Changing a few paragraphs in a longer document, such as updating dates, names, or specific sections.
  • Configuration changes — Modifying specific values in JSON, YAML, or config files.
  • Search and replace with context — Intelligent find-and-replace where the model understands which instances to change.

Poor Use Cases

  • Novel generation — Writing new content from scratch where nothing is predictable. No speedup possible.
  • Summarization — Output is structurally different from input. Prediction won't match.
  • Translation — Target language has different tokens than source. No overlap.
  • Creative writing — Model output is inherently unpredictable.
Task TypeTypical OverlapExpected Speedup
Variable rename95-99%4-5x
Bug fix (single line)90-98%3-5x
Function refactor70-90%2-4x
Config update85-95%3-4x
Document rewrite (section)60-80%1.5-3x
Novel generation0-10%1x (no gain)
Summarization5-20%1x (no gain)

Supported Providers and Models

Predicted outputs are available across multiple providers, each with slightly different API surfaces and model support.

ProviderSupported ModelsAPI Parameter
OpenAIGPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-4.1-nanoprediction
Azure OpenAIGPT-4o, GPT-4.1 series (preview)prediction
CerebrasLlama 3.1 70B, Llama 3.3 70Bprediction
Fireworks AILlama 3.1, Qwen 2.5, custom modelsprediction
MistralMistral Large, Codestralprediction
LiteLLMRoutes to any backendprediction
vLLM (self-hosted)Any model with draft modelSpeculative decoding config

OpenAI Limitations

OpenAI's predicted outputs are not compatible with tool calls, n > 1, logprobs, orpresence_penalty / frequency_penalty. Streaming is supported. Other providers may have different restrictions.

API Usage Examples

OpenAI (Python SDK)

OpenAI: Refactor code with predicted outputs

from openai import OpenAI

client = OpenAI()

original_code = """
def get_user(user_id: str) -> dict:
    user = db.find_one({"user_id": user_id})
    if not user:
        raise ValueError(f"User {user_id} not found")
    return {
        "user_id": user["user_id"],
        "name": user["name"],
        "email": user["email"],
    }
"""

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {
            "role": "user",
            "content": f"Rename user_id to account_id everywhere in this code:\n{original_code}"
        }
    ],
    prediction={
        "type": "content",
        "content": original_code
    }
)

print(response.choices[0].message.content)

# Check prediction efficiency
usage = response.usage
print(f"Accepted: {usage.completion_tokens_details.accepted_prediction_tokens}")
print(f"Rejected: {usage.completion_tokens_details.rejected_prediction_tokens}")

OpenAI (TypeScript SDK)

TypeScript: Streaming with predicted outputs

import OpenAI from "openai";

const client = new OpenAI();

const originalCode = `function fetchUser(userId: string) {
  return api.get(\`/users/\${userId}\`);
}`;

const stream = await client.chat.completions.create({
  model: "gpt-4.1",
  messages: [
    {
      role: "user",
      content: `Change fetchUser to getAccount and userId to accountId:\n${originalCode}`,
    },
  ],
  prediction: {
    type: "content",
    content: originalCode,
  },
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

Cerebras

Cerebras: Predicted outputs with Llama

from cerebras.cloud.sdk import Cerebras

client = Cerebras()

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {
            "role": "user",
            "content": "Replace 'logger.info' with 'logger.debug' in this code:\n" + original_code
        }
    ],
    prediction={
        "type": "content",
        "content": original_code
    }
)

print(response.choices[0].message.content)

Fireworks AI

Fireworks: Predicted outputs

import openai

client = openai.OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="your-fireworks-api-key",
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[
        {"role": "user", "content": f"Update this config:\n{original_config}"}
    ],
    prediction={
        "type": "content",
        "content": original_config
    }
)

cURL

cURL: OpenAI predicted outputs

curl -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4.1",
    "messages": [
      {
        "role": "user",
        "content": "Rename className to styleName in this React component:\n..."
      }
    ],
    "prediction": {
      "type": "content",
      "content": "function MyComponent({ className }) {\n  return <div className={className}>Hello</div>;\n}"
    }
  }'

Performance Benchmarks

The speedup from predicted outputs depends on two factors: prediction accuracy (how much of your prediction matches the output) and base model speed (how fast the provider generates tokens normally).

TaskWithout PredictionWith PredictionSpeedup
Variable rename (200 tokens)1.8s0.4s4.5x
Bug fix in function (500 tokens)4.2s1.1s3.8x
Config value update (150 tokens)1.4s0.3s4.7x
Refactor method signature (800 tokens)7.1s2.3s3.1x
Rewrite paragraph in document (400 tokens)3.6s1.5s2.4x
Generate new function (300 tokens)2.7s2.6s1.04x

Benchmarks measured on OpenAI GPT-4o, single request, February 2026. Results vary by provider, model, and server load.

Token-Level Metrics

The API reports two key metrics for predicted outputs:

  • Accepted prediction tokens — tokens from your prediction that matched what the model generated. These were validated in parallel and did not require sequential generation.
  • Rejected prediction tokens — tokens from your prediction that did not match. On OpenAI, these are still billed as completion tokens.

A high accepted-to-rejected ratio indicates your prediction was accurate and you got a significant speedup. A ratio below 50% suggests predicted outputs may not be beneficial for that particular request.

Tuning and Best Practices

Use Verbatim Original Content

For rewrites and edits, pass the exact original text or code as your prediction. This maximizes token overlap where the content is unchanged.

Keep Instructions Focused

Narrow instructions like 'rename userId to accountId' produce higher prediction accuracy than broad instructions like 'improve this code.'

Lower Temperature

Set temperature to 0 or near-zero for deterministic output. Higher temperature increases randomness, reducing prediction matches.

Scope Changes Tightly

Instead of rewriting an entire file, scope the predicted output to the function or section being changed. Smaller context = higher hit rate.

Monitor Accepted/Rejected Ratios

Track the ratio per request. If rejected tokens consistently exceed accepted tokens, predicted outputs are hurting performance, not helping.

Benchmark Before Committing

Predicted outputs add overhead for prediction verification. For low-overlap tasks, this overhead can make requests slower. Always measure.

Pitfalls and Failure Modes

Cost Increase on OpenAI

OpenAI bills rejected prediction tokens as completion tokens. If your prediction is poor (low overlap), you pay for the rejected tokens on top of the normally generated tokens. This can make requests more expensive than without predictions.

No Benefit on Novel Generation

If the model is generating entirely new content, your prediction will not match and every token will be rejected. You get no speedup and potentially higher costs.

Broad Instructions Reduce Match Rate

An instruction like "clean up this code" gives the model freedom to restructure, reformat, and rewrite — reducing overlap with your prediction (the original code). Narrow instructions produce better prediction accuracy.

Incompatible Features

On OpenAI, predicted outputs cannot be used with tool calls, structured outputs (JSON mode), logprobs, or n > 1. Check each provider's documentation for specific limitations.

Silent Regressions

Without monitoring accepted/rejected ratios, you may not notice when a prompt change or model update reduces prediction accuracy. Build observability into your pipeline.

Predicted Outputs vs Prompt Caching vs Speculative Decoding

These three techniques are often confused because they all reduce latency. They target different parts of the inference pipeline and can be used together.

FeaturePredicted OutputsPrompt CachingSpeculative Decoding
What it optimizesOutput generationInput processing (prefill)Output generation
How it worksVerify predicted tokens in parallelReuse cached KV states for repeated prefixesDraft model proposes, target model verifies
Best forRewrites / editsRepeated prompts with same prefixAny generation task
Changes output?NoNoNo
Requires extra inputPrediction stringNo (automatic)Draft model
Available via APIYes (most providers)Yes (automatic)Self-hosted only
Can combine?YesYesN/A for API users

Predicted Outputs vs Speculative Decoding

Predicted outputs are actually a form of speculative decoding. The difference is where the draft comes from:

  • Speculative decoding uses a smaller, faster draft model to propose tokens.
  • Predicted outputs use your provided prediction string as the draft.

Both use the same verification mechanism: the target model checks all proposed tokens in a single forward pass. When your prediction is highly accurate (code edits), providing the prediction directly is more efficient than running a draft model.

Predicted Outputs vs Prompt Caching

These are complementary. Prompt caching speeds up the input processing (prefill phase) by reusing cached key-value states when you send the same prefix. Predicted outputs speed up the output generation (decode phase). For code editing requests, you can benefit from both: cache the system prompt and code context, and predict the output.

Predicted Outputs for Code Editing

Code editing is the single best use case for predicted outputs. Here is why the overlap rates are so high:

  • A typical code file is 100-500 lines. Most edits touch 1-10 lines. That is 95-99% overlap.
  • Code has rigid structure — indentation, syntax, braces — that the model preserves exactly.
  • Edit instructions are usually precise ("rename X to Y", "add error handling to this function"), reducing divergence.

The Cursor Approach: Speculative Edits

Cursor built on this concept with their "speculative edits" system, developed in collaboration with Fireworks AI. Instead of sending the original file as a prediction, Cursor uses longer speculation sequences and a custom algorithm to make code edits faster. This approach powers their "Fast Apply" feature.

The Morph Approach: Purpose-Built Apply Model

Morph takes a fundamentally different approach. Instead of bolting predicted outputs onto a general-purpose LLM, Morph uses a model trained specifically for code transformation.

The result:

10,500+
Tokens per second
99.2%
Accuracy on apply benchmarks
$0.50
Per million tokens

Morph does not require you to manage prediction parameters, monitor accepted/rejected ratios, or worry about cost increases from rejected tokens. The model inherently understands code structure and produces accurate merges at inference speeds that exceed what predicted outputs can achieve on general-purpose models.

DimensionPredicted Outputs (GPT-4.1)Morph Fast Apply
Speed3-5x over baseline (~1,500 tok/s effective)10,500+ tok/s native
Setup complexityManage prediction param, monitor ratiosSingle API call
Cost riskRejected tokens increase costFixed per-token pricing
Model typeGeneral-purpose LLM with prediction hintPurpose-built code transform model
Best forWhen you need general LLM + speedDedicated code editing pipelines

Skip the Prediction Parameter

Morph Fast Apply delivers 10,500+ tok/s for code edits without requiring prediction strings, ratio monitoring, or rejected-token costs. Purpose-built for code transformation.

FAQ

What are predicted outputs?

Predicted outputs are a latency optimization where you provide a reference string that you expect the model to generate. The API uses speculative decoding to validate predicted tokens in parallel, skipping ahead when your prediction matches, resulting in 3-5x faster responses for rewrite workloads.

Do predicted outputs change the model's response?

No. Predicted outputs are lossless — they do not alter the content the model generates. The prediction is only used to accelerate generation. The output is identical to what the model would produce without the prediction parameter.

Which providers support predicted outputs?

OpenAI (GPT-4o, GPT-4.1 series), Cerebras, Fireworks AI, Mistral, LiteLLM, and Azure OpenAI all support predicted outputs. Open-source frameworks like vLLM also support speculative decoding for self-hosted models.

When should I use predicted outputs vs prompt caching?

Use predicted outputs when you can predict the output (rewrites, code edits, template updates). Use prompt caching when you send the same input prefix repeatedly. They optimize different parts of the pipeline — predicted outputs speed up generation, prompt caching speeds up prefill.

Are rejected predicted tokens billed?

On OpenAI, yes — rejected predicted tokens are billed as completion tokens. This means poorly matched predictions can increase costs. Other providers may handle billing differently. Always benchmark your specific use case.

How does Morph compare to predicted outputs?

Morph takes a different approach: instead of using predicted outputs with a general-purpose LLM, Morph uses a purpose-built fast apply model trained specifically for code transformation. This achieves 10,500+ tokens per second with high accuracy, without requiring you to manage prediction parameters.