Prompt Optimization: DSPy vs TextGrad vs GEPA vs OPRO (2026 Guide)

Manual prompt engineering works for simple tasks. It breaks when you need to optimize across models, handle edge cases at scale, or tune multi-step pipelines. This guide covers every major automated prompt optimization framework, with benchmarks, code examples, and a decision matrix for choosing the right one.

25%+

DSPy improvement over few-shot (GPT-3.5)

50%

OPRO improvement over human prompts (BBH)

+12%

GEPA over MIPROv2 (AIME 2025)

1.1%

SPO cost vs TextGrad (same accuracy)

Why Manual Prompting Fails at Scale

A developer can write a good prompt for a single task on a single model in an afternoon. The problems start when that prompt needs to work reliably across thousands of inputs, transfer to a different model, or coordinate with other prompts in a pipeline.

Brittleness

Prompts that work for 90% of inputs fail catastrophically on edge cases. Semantically identical phrasings ('Calculate' vs 'Determine') produce different outputs. A prompt tuned on 100 examples breaks on a different data distribution.

Model Coupling

A GPT-4 prompt degrades on Claude. A Claude prompt fails on Llama. Each model switch requires manual rewriting and re-testing. With optimization, you recompile instead of rewrite.

Combinatorial Explosion

A multi-step pipeline has instructions, few-shot examples, formatting, and chain-of-thought structure to tune at each step. 5 choices per variable across 4 steps = 625 combinations. No developer tests 625 variants.

The systematic evidence backs this up. Research shows that prompts tuned on small samples frequently produce errors when scaled, with models failing to focus on relevant information. Chain-of-thought alone can improve math reasoning by 30-60%, but choosing the right chain-of-thought phrasing requires testing dozens of candidates. On well-defined tasks, systematic optimization yields 15-40% improvement over naive prompts.

The optimization gap

An OPRO experiment with PaLM 2-L found that optimized instructions outperformed "Let's think step by step" by over 5% on 19 of 23 BBH tasks. The best human-designed prompt was rarely the best possible prompt. An optimizer that tests 100 candidates in 10 rounds will almost always find something better than what a developer writes in an afternoon.

Taxonomy of Prompt Optimization Approaches

Every prompt optimization method shares the same loop: generate candidate prompts, evaluate them on data, update the search based on results. They differ in how they generate candidates and how they update.

1. Gradient-Free Search (OPRO, APE)

The simplest approach. Use the LLM itself to propose better prompts based on prior scores. OPRO (Google DeepMind, ICLR 2024) runs 10 rounds of 10 candidates, evaluating each on a training set and feeding scores back to the LLM as context for the next round. APE (Automatic Prompt Engineer) generates task instructions from input-output examples without any initial prompt, then selects the highest-scoring one.

Strengths: no infrastructure, works with any API. Weaknesses: no few-shot optimization, limited to single-prompt tasks, high variance across runs.

2. Gradient-Based Feedback (TextGrad)

TextGrad (Stanford, published in Nature) applies the concept of backpropagation to text. Instead of numerical gradients, it uses an LLM to generate natural language feedback ("textual gradients") on outputs, then uses that feedback to refine the prompt or solution iteratively. It excels at instance-level optimization: taking a single hard problem (a LeetCode question, a scientific paper review) and refining the solution through multiple passes.

Strengths: strong on hard individual problems, produces interpretable feedback. Weaknesses: expensive per-instance (each iteration costs LLM calls for both forward and backward passes), not designed for batch optimization across datasets.

3. Compiled Pipelines (DSPy)

DSPy (Stanford NLP, ICLR 2024) treats LM pipelines like neural networks: parameterized systems compiled against a loss function. You define what each step should do (Signatures), compose steps into a pipeline (Modules), and run an optimizer (MIPROv2, BootstrapFewShot) that searches for the best instructions and demonstrations. The compiled program serializes to JSON and runs without recompilation.

Strengths: handles multi-step pipelines, optimizes both instructions and demonstrations, portable across models. Weaknesses: requires a metric function and training data (20+ examples minimum), higher setup overhead.

4. Evolutionary Methods (EvoPrompt, PromptBreeder, GEPA)

These treat prompt optimization as an evolutionary search. EvoPrompt combines LLMs with genetic algorithms (crossover, mutation) to evolve prompt populations, outperforming human prompts by up to 25% on BBH. PromptBreeder adds self-referential improvement: it evolves not just prompts but the mutation operators themselves, achieving 83.9% zero-shot on commonsense reasoning.

GEPA (ICLR 2026, Oral) is the current state of the art in this category. It samples execution trajectories (reasoning, tool calls, outputs), reflects on them to diagnose failures, and combines complementary lessons from the Pareto frontier of its own attempts. GEPA improved GPT-4.1 Mini on AIME 2025 from 46.6% to 56.6% (+10 points) and outperformed reinforcement learning (GRPO) by 6% on average while using 35x fewer rollouts.

5. RL and Search-Based (PromptAgent, GRPO)

PromptAgent (ICLR 2024) uses Monte Carlo Tree Search (MCTS) to navigate the prompt space, reflecting on model errors to generate constructive feedback. It outperformed APE by 9.1% on GPT-3.5 and 7.7% on GPT-4 across 12 tasks. GRPO (Group Relative Policy Optimization, from DeepSeek) treats prompt generation as a policy optimization problem, training the model to prefer prompts that produce better outputs.

Strengths: can discover non-obvious prompt strategies through deep search. Weaknesses: computationally expensive, requires many rollouts, less interpretable than evolutionary methods.

Category	Key Methods	Best For	Data Required
Gradient-Free Search	OPRO, APE	Quick baseline improvement, single prompts	Input-output pairs
Gradient-Based Feedback	TextGrad, metaTextGrad	Hard individual problems (coding, science)	None (instance-level)
Compiled Pipelines	DSPy (MIPROv2, BootstrapFewShot)	Multi-step production pipelines	20+ labeled examples + metric
Evolutionary	EvoPrompt, PromptBreeder, GEPA	Maximum performance, novel strategies	10+ examples (GEPA), varies
RL / Search	PromptAgent, GRPO	Expert-level prompts via deep exploration	Task examples + many rollouts

Framework Deep Dives

DSPy: Compiled Prompt Pipelines

DSPy is the most complete framework for production prompt optimization. It has 32,700+ GitHub stars, 1,500 dependent projects, and ships 8 optimizers covering different tradeoffs. The key abstraction: you write a program using Signatures (what each LM call should do) and Modules (how it should reason), then compile it against a metric. The compiled program stores optimized instructions and demonstrations in JSON.

DSPy: Define, optimize, deploy

import dspy
from dspy.teleprompt import MIPROv2

# 1. Configure the model
dspy.configure(lm=dspy.LM("openai/gpt-5.4-mini"))

# 2. Define the task as a module
class CodeReviewer(dspy.Module):
    def __init__(self):
        self.review = dspy.ChainOfThought(
            "code: str, language: str -> issues: list[str], severity: str"
        )

    def forward(self, code: str, language: str):
        return self.review(code=code, language=language)

# 3. Define your metric
def review_quality(example, pred, trace=None):
    found = sum(1 for issue in example.issues if issue in pred.issues)
    return found / len(example.issues)

# 4. Optimize with MIPROv2 (Bayesian search over instructions + demos)
optimizer = MIPROv2(metric=review_quality, auto="medium")
optimized = optimizer.compile(CodeReviewer(), trainset=train_examples)

# 5. Save compiled program (runs without re-optimization)
optimized.save("code_reviewer_compiled.json")

DSPy's MIPROv2 optimizer uses Bayesian optimization to search the instruction space efficiently. It bootstraps few-shot demonstrations from your data, generates data-aware instruction proposals, and evaluates candidates on mini-batches. The "medium" preset runs 25 trials, and "heavy" runs 50+. For production systems, MIPROv2 with 40+ trials consistently finds better prompts than manual engineering. See the full DSPy prompt optimization guide for all optimizer types and code patterns.

TextGrad: Gradient-Based Prompt Refinement

TextGrad treats LLM outputs as variables in a computational graph and uses natural language feedback as "gradients" to improve them. The forward pass generates an output. The backward pass asks an LLM to critique the output and suggest improvements. The update step revises the prompt or solution based on that critique.

TextGrad: Iterative refinement via textual gradients

import textgrad as tg

# Define the variable to optimize (a system prompt)
system_prompt = tg.Variable(
    "You are a code reviewer. Find bugs and suggest fixes.",
    role_description="system prompt for code review",
    requires_grad=True  # this variable will be optimized
)

# Forward pass: generate output using the prompt
model = tg.BlackboxLLM(llm_engine)
output = model(system_prompt + code_input)

# Backward pass: get textual feedback
loss_fn = tg.TextLoss(
    "Evaluate if the review catches the buffer overflow bug"
)
loss = loss_fn(output)
loss.backward()  # generates natural language gradients

# Update: revise the prompt based on feedback
optimizer = tg.TGD(parameters=[system_prompt])
optimizer.step()  # applies textual gradients to improve the prompt

TextGrad improved GSM8K reasoning from 72.9% to 81.1%, boosted GPQA accuracy from 51% to 55%, and increased LeetCode solution quality by 20%. The metaTextGrad extension (2025) optimizes the optimizer itself, yielding 6-11% additional gains. The main tradeoff: TextGrad excels at refining individual hard problems but is expensive for batch optimization across large datasets.

GEPA: Reflective Prompt Evolution

GEPA (Genetic-Pareto) reflects on execution trajectories to learn high-level rules from trial and error. For each candidate prompt, GEPA runs it on test cases, examines the full trajectory (reasoning steps, tool calls, outputs), diagnoses what went wrong in natural language, and proposes targeted updates. It maintains a Pareto frontier of complementary prompt variants and combines lessons from multiple attempts.

46.6% → 56.6%

GPT-4.1 Mini on AIME 2025

+12%

Over MIPROv2 accuracy

35x fewer

Rollouts vs GRPO (RL)

GEPA's advantage is efficiency. Reinforcement learning methods like GRPO need thousands of rollouts to converge. GEPA achieves better results with 35x fewer because each reflection step extracts more signal per rollout. The paper was accepted as an Oral at ICLR 2026, the highest distinction. Read the full GEPA prompt optimization guide for implementation details.

OPRO: LLM-as-Optimizer

OPRO from Google DeepMind treats the LLM itself as the optimizer. You provide a "meta-prompt" containing prior candidate prompts and their scores. The LLM generates new candidates that aim to score higher. Over 10 rounds of 10 candidates each, performance converges to a good solution.

OPRO: Using the LLM to optimize its own prompts

# OPRO meta-prompt structure (simplified)
meta_prompt = """
Below are prompts and their scores on a math reasoning task.
Generate a new prompt that will score higher.

Prompt: "Solve this step by step."  Score: 0.72
Prompt: "Break this into sub-problems, solve each, combine."  Score: 0.79
Prompt: "Identify what's given, what's asked, then solve."  Score: 0.81

New prompt:"""

# The LLM generates a new candidate prompt
# Evaluate it on the training set, add score to history
# Repeat for 10 rounds of 10 candidates each

OPRO improved BBH performance by up to 50% over human-designed prompts and GSM8K by 8%. The instructions it discovers are often non-obvious: one of its best math prompts was "Take a deep breath and work on this problem step-by-step," which outperformed more elaborate human-designed instructions. The simplicity is the appeal: no framework, no infrastructure, just API calls and a scoring function.

SPO: Cost-Efficient Optimization

Self-Supervised Prompt Optimization (SPO, EMNLP 2025) achieves performance comparable to TextGrad and other expensive methods at 1.1% to 5.6% of the cost. Instead of requiring external reference outputs, SPO selects better prompts through pairwise output comparisons evaluated by an LLM, followed by an LLM optimizer that aligns outputs with task requirements. It needs as few as 3 examples to start improving.

SPO is the right choice when your budget is limited and you need quick improvements without the infrastructure overhead of DSPy or the per-instance cost of TextGrad.

Head-to-Head Benchmarks

Benchmark numbers come from the original papers, peer-reviewed and published at top venues (ICLR 2024, ICLR 2026, EMNLP 2025, Nature). All comparisons are on the same task and model unless noted.

Method	Base Accuracy	Optimized Accuracy	Gain
TextGrad (GPT-4o)	72.9%	81.1%	+8.2 points
OPRO (PaLM 2-L)	Baseline	+8% over human prompts	+8%
DSPy BootstrapFewShot (GPT-3.5)	Standard few-shot	+25% over baseline	+25%
SPO (GPT-4o-mini)	Comparable to TextGrad	Comparable to TextGrad	1.1% of the cost

Method	vs Human Prompts	vs Chain-of-Thought	Notable
OPRO	Up to +50%	+5% on 19/23 tasks	ICLR 2024
EvoPrompt	Up to +25%	Consistent across 31 datasets	Genetic + LLM crossover
PromptBreeder	83.9% zero-shot	Beats CoT and Plan-and-Solve	Self-referential mutation
PromptAgent (MCTS)	+9.1% over APE (GPT-3.5)	+7.7% over APE (GPT-4)	ICLR 2024

Method	Model	Accuracy	Efficiency
GEPA	GPT-4.1 Mini	56.6%	35x fewer rollouts than GRPO
MIPROv2	GPT-4.1 Mini	~44.6%	Bayesian search
GRPO (RL)	GPT-4.1 Mini	~50.6%	Thousands of rollouts
Baseline	GPT-4.1 Mini	46.6%	No optimization

The cost-performance frontier

SPO consistently ranks among the top two methods across all benchmarks while costing 1.1-5.6% of TextGrad and other methods. GEPA achieves the highest absolute performance but requires more compute than OPRO or SPO. DSPy MIPROv2 offers the best balance of performance, infrastructure, and production readiness. The right choice depends on whether you are optimizing for accuracy, cost, or deployment simplicity.

Framework Comparison Table

Dimension	DSPy	TextGrad	GEPA	OPRO	Manual
What it optimizes	Instructions + demos	Prompt or solution text	Prompt via reflection	Instruction text	Everything (by hand)
Multi-step pipelines	Yes (core strength)	Single-variable focus	Yes (tool calls, reasoning)	Single prompt only	Possible but brittle
Data requirement	20+ labeled examples	None (instance-level)	10+ examples	Input-output pairs	None
Cost per run	$0.50-$20	$1-$10 per instance	Moderate (fewer rollouts)	$3-$15	Developer hours
Model portability	Recompile for new model	Model-agnostic	Model-agnostic	Model-agnostic	Manual rewrite
Best benchmark gain	+65% (Llama 2 13B)	+8.2 points (GSM8K)	+12% over MIPROv2	+50% (BBH)	Baseline
Production readiness	High (JSON export)	Medium (research)	Medium (research)	Low (no framework)	High (but manual)
Publication	ICLR 2024	Nature	ICLR 2026 (Oral)	ICLR 2024	N/A

Decision Matrix: Which Framework to Use

Scenario	Best Choice	Why
Production multi-step pipeline (RAG, agents)	DSPy + MIPROv2	Handles pipeline composition, serializes to JSON, 8 optimizer options
Hard individual problem (coding, science)	TextGrad	Instance-level refinement through iterative feedback, +20% on LeetCode
Maximum accuracy on math/reasoning	GEPA	+12% over MIPROv2, +6% over GRPO, ICLR 2026 Oral
Quick improvement with zero infrastructure	OPRO	Just API calls and a scoring function, +50% on BBH
Budget-constrained optimization	SPO	1.1-5.6% of TextGrad cost, comparable accuracy
Optimizing coding agent prompts	DSPy or GEPA	DSPy for pipeline structure, GEPA for trajectory reflection
Evolving prompt populations	EvoPrompt or PromptBreeder	Genetic crossover finds non-obvious strategies
Expert-level prompt discovery	PromptAgent	MCTS explores deep search tree, +9.1% over APE
Distilling to a smaller model	DSPy BetterTogether	Combines prompt optimization + fine-tuning in sequence

Start with prompt optimization when:

You have a measurable metric for success
You can collect 10-50 labeled examples
Manual prompt iteration has stalled at a plateau
You need to support multiple models
Your pipeline has 2+ LLM calls in sequence
Prompt brittleness is causing production failures

Skip prompt optimization when:

A simple prompt already achieves 95%+ accuracy
You have no training data and cannot create it
The task has no quantifiable success metric
You need a working prototype in under an hour
You're doing open-ended creative generation
The model and task will never change

Integration with Coding Agents

Coding agents like Claude Code, Cursor, and Windsurf rely on system prompts and tool-use instructions to navigate codebases, plan edits, and apply changes. These prompts are exactly the kind that benefit from optimization: they run thousands of times, need to handle diverse inputs, and have measurable outcomes (did the edit apply correctly? did the test pass?).

Several production systems already apply prompt optimization to coding agents:

Arize Phoenix: Prompt Learning

Automatically optimizes coding agent rulesets. Generates rules that lead to better accuracy on coding tasks, tuned specifically for the user's codebase and task distribution.

Opik Agent Optimizer

Uses MIPROv2 to optimize agent instructions with and without tools. Includes a Meta Prompter that uses reasoning models to critique and iteratively refine instruction prompts.

Databricks: 90x Cost Reduction

Automated prompt optimization delivered performance on par with supervised fine-tuning while reducing serving costs by 20%. Enterprise agents at 90x lower cost than manual prompt engineering.

GEPA for Agent Trajectories

GEPA reflects on full agent trajectories (reasoning, tool calls, outputs) to diagnose problems. Naturally suited to coding agents that produce multi-step execution traces.

The optimization loop for coding agents follows a pattern: run the agent on a set of coding tasks, measure pass rate or edit accuracy, feed failures into the optimizer, and get back improved system prompts and tool-use instructions. GEPA is particularly suited here because it reflects on full trajectories rather than just final outputs, catching issues like "the agent searched the wrong directory" or "the agent applied the edit in the wrong file."

Optimizing a coding agent's system prompt with DSPy

import dspy

class CodingAgent(dspy.Module):
    def __init__(self):
        self.plan = dspy.ChainOfThought(
            "task: str, codebase_context: str -> plan: str, files_to_edit: list[str]"
        )
        self.edit = dspy.Predict(
            "plan: str, file_content: str -> edited_content: str"
        )

    def forward(self, task, codebase_context):
        plan = self.plan(task=task, codebase_context=codebase_context)
        # Apply edits to each file
        results = []
        for file_path in plan.files_to_edit:
            content = read_file(file_path)
            edited = self.edit(plan=plan.plan, file_content=content)
            results.append(edited)
        return dspy.Prediction(edits=results)

# Metric: does the edited code pass the test suite?
def edit_passes_tests(example, pred, trace=None):
    return run_tests(pred.edits)  # returns 1.0 if tests pass

# Optimize the agent's prompts
optimizer = MIPROv2(metric=edit_passes_tests, auto="heavy")
optimized_agent = optimizer.compile(CodingAgent(), trainset=coding_tasks)

Infrastructure for Optimization Loops

Prompt optimization loops generate and evaluate hundreds of prompt candidates. Each evaluation may involve running an agent that searches a codebase, generates edits, and applies them. Two bottlenecks emerge: search speed (how fast can the agent find relevant code?) and edit speed (how fast can it apply changes?).

Fast Apply: 10,500 tok/s Code Editing

When an optimization loop evaluates 40+ candidate prompts, each producing code edits, edit application speed becomes the bottleneck. Morph's Fast Apply model processes edits at 10,500 tokens/second with 97.3% accuracy, keeping evaluation cycles short.

WarpGrep: Semantic Codebase Search

Coding agents in optimization loops need to find relevant code fast. WarpGrep runs 8 parallel tool calls per turn across 4 turns in under 6 seconds, giving agents the context they need without the 60% search overhead that Cognition measured in unoptimized agents.

The interaction between prompt optimization and infrastructure is multiplicative. A 10x faster edit model means 10x more candidates evaluated per optimization run. More candidates means better prompts. Better prompts mean fewer wasted agent turns. The infrastructure and the optimization reinforce each other.

Anthropic's research showed 90% improvement in multi-agent systems when agents specialize into subagent hierarchies. FlashCompact addresses the context management side, compressing conversation history to keep optimization-heavy agents within context windows. Together, these tools form the infrastructure layer that makes automated prompt optimization practical at scale.

Frequently Asked Questions

What is prompt optimization?

Prompt optimization is the process of automatically finding the best instructions, demonstrations, and formatting for a language model on a specific task. Instead of manually iterating through prompt variants, an optimizer evaluates candidates against a metric function and returns the prompt that scores highest. Methods include DSPy (compiled pipelines), TextGrad (gradient-based feedback), GEPA (reflective evolution), OPRO (LLM-as-optimizer), and evolutionary approaches like EvoPrompt and PromptBreeder.

Which prompt optimization framework should I use?

For production pipelines with clear metrics and 20+ training examples, use DSPy with MIPROv2. For single-instance optimization on hard problems (coding, scientific QA), use TextGrad. For maximum performance with minimal data, use GEPA. For quick baseline improvement without infrastructure, use OPRO. For cost-sensitive optimization, use SPO which achieves comparable results at 1.1-5.6% of the cost.

How much does prompt optimization cost?

DSPy BootstrapFewShot runs cost approximately $0.50-$2 and take 5-10 minutes. MIPROv2 with 40+ trials costs $5-$20 depending on the model. OPRO requires 10 rounds of 10 candidates, costing $3-$15 with GPT-4. SPO achieves comparable results at 1.1-5.6% of the cost of TextGrad and other methods, making it the most cost-efficient option available.

Does prompt optimization replace fine-tuning?

Not entirely. Prompt optimization changes what you say to the model. Fine-tuning changes the model's weights. Prompt optimization is cheaper, faster to iterate, and requires less data (20-50 examples vs thousands). Fine-tuning achieves higher ceilings on narrow tasks and produces faster inference. DSPy's BetterTogether combines both: optimize the prompt first, then distill into a fine-tuned model. Databricks reported automated prompt optimization delivered performance matching supervised fine-tuning while reducing serving costs by 20%.

What is the difference between OPRO and DSPy?

OPRO uses the LLM itself to generate and score candidate prompts iteratively. It requires no infrastructure beyond API access. DSPy is a full framework that treats prompts as compiled parameters in a pipeline, supporting multi-step programs, few-shot demonstration selection, and serialized compiled programs. OPRO is simpler to start with. DSPy is more powerful for complex pipelines.

What is TextGrad and how does it work?

TextGrad applies the concept of automatic differentiation to text. Instead of numerical gradients, it uses an LLM to generate natural language feedback ("textual gradients") on outputs, then uses that feedback to refine the prompt or solution. It improved GSM8K accuracy from 72.9% to 81.1% and boosted LeetCode problem-solving by 20%. It excels at instance-level optimization for hard individual problems, not batch optimization.

What is GEPA prompt optimization?

GEPA (Genetic-Pareto) uses reflective evolution to learn high-level rules from trial and error. It samples execution trajectories, reflects on them to diagnose problems, proposes targeted prompt updates, and combines complementary lessons from its Pareto frontier. It outperformed MIPROv2 by 12% on AIME 2025 and GRPO by 6% on average while using 35x fewer rollouts. Accepted as an Oral at ICLR 2026.

Can prompt optimization work with coding agents?

Yes. Prompt optimization is used to tune system prompts, tool-use instructions, and planning strategies for coding agents. Arize Phoenix's Prompt Learning optimizes coding agent rulesets automatically. Opik's Agent Optimizer uses MIPROv2 for agent instructions. The key requirement is a measurable metric (test pass rate, edit accuracy) that the optimizer can score against.

Related Guides

Infrastructure for Prompt Optimization Loops

Prompt optimization generates hundreds of candidate edits per run. Morph Fast Apply processes them at 10,500 tok/s with 97.3% accuracy. WarpGrep gives agents the codebase context they need in under 6 seconds. Together, they cut evaluation cycle time so your optimizer converges faster.

Try Morph API

View Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Prompt Optimization: The Definitive Guide to Automated Prompt Engineering (2026)