GEPA Prompt Optimization: Evolutionary Prompt Evolution for DSPy

GEPA (Genetic-Pareto) uses reflective prompt evolution to outperform MIPROv2 by 13% and GRPO by 20% with 35x fewer rollouts. The definitive guide to DSPy's most sample-efficient optimizer: how it works, when to use it, code examples, and benchmark results from the ICLR 2026 oral paper.

March 14, 2026 ยท 2 min read

GEPA (Genetic-Pareto) replaces scalar-reward optimization with reflective prompt evolution. It reads execution traces, diagnoses failures in natural language, and maintains a Pareto frontier of diverse prompt candidates. The result: +13% over MIPROv2, +20% over GRPO, with 35x fewer rollouts. Accepted as an ICLR 2026 oral.

+13%
Over MIPROv2 (aggregate across all benchmarks)
+20%
Over GRPO on best task
35x
Fewer rollouts than RL methods
93%
MATH accuracy (vs 67% baseline)

What GEPA Is

GEPA (Genetic-Pareto) is a prompt optimizer from the paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 2025, arxiv:2507.19457), accepted as an oral presentation at ICLR 2026. It is integrated into DSPy as dspy.GEPA and also ships as a standalone library (pip install gepa).

The core problem GEPA solves: conventional optimizers collapse rich execution data into a single number. They know a prompt scored 0.4 out of 1.0, but they do not know why. Was it a formatting error? A reasoning failure? A missing constraint? Without this diagnostic information, the optimizer has to brute-force its way through the search space, trying random variations until something works.

GEPA takes a different approach. Evaluators return Actionable Side Information (ASI) alongside the score. ASI can be error messages, reasoning traces, profiling data, constraint violations, or any text that helps diagnose the failure. A strong LLM (the "reflector") reads this feedback, identifies what went wrong, and proposes a new prompt that addresses the specific failure mode. Each mutation inherits accumulated lessons from all ancestors in the search tree.

Why 'Genetic-Pareto'?

The name captures both mechanisms. "Genetic" refers to the evolutionary search: a population of prompt candidates is mutated, evaluated, and selected across generations. "Pareto" refers to the selection strategy: rather than keeping only the single best candidate, GEPA maintains the Pareto frontier of candidates that each excel on at least one evaluation instance. This prevents premature convergence and preserves diverse strategies.

Reflective, not random

GEPA reads full execution traces to diagnose failures, not just scores. Each mutation is a targeted fix, not a random perturbation.

Pareto-aware selection

Maintains a frontier of complementary candidates. A prompt survives if it's the best at anything, preserving diverse strategies.

35x more sample-efficient

100-500 evaluations vs 10,000+ for RL. Reflective feedback eliminates the need for brute-force search.

How GEPA Works

GEPA's optimization loop has five stages, repeated for each generation:

1. Candidate Selection from the Pareto Frontier

The optimizer selects a parent candidate from the Pareto frontier. Selection probability is proportional to coverage: candidates that are the sole best on many evaluation instances get sampled more often. This balances exploitation (improving strong candidates) with exploration (giving underrepresented strategies a chance).

2. Minibatch Evaluation with Trace Capture

The selected candidate runs on a small minibatch from the training set (default size: 2 examples). GEPA captures the full execution trace for each example: the prompt sent, the model's reasoning steps, tool calls and their outputs, the final answer, and why the evaluator scored it the way it did. The trace includes ASI returned by the evaluator.

3. LLM Reflection

A strong LLM (the reflection_lm) receives the execution traces, both successes and failures, and produces a diagnosis. The reflector identifies failure modes, logic breakdowns, and causal relationships. It sees what worked in successful examples and what broke in failures, building a complete picture of the candidate's strengths and weaknesses.

4. Targeted Mutation

Based on the reflection, GEPA proposes a new prompt that addresses the diagnosed issues. This is not a random mutation. The new prompt inherits all accumulated lessons from its ancestors in the search tree, plus the specific fixes identified by the reflector. GEPA also supports system-aware merge: combining the strengths of two Pareto-optimal candidates that excel on different problem types.

5. Pareto Validation

The new candidate is evaluated on the full validation set. If it achieves the best score on any evaluation instance, it joins the Pareto frontier. If an existing candidate is now dominated (beaten on every instance), it drops off the frontier. This ensures the frontier only grows with genuinely complementary candidates.

Actionable Side Information (ASI)

ASI is GEPA's core differentiator. It is the text-optimization analogue of a gradient. While numerical gradients tell you which direction to move in weight space, ASI tells you in natural language what went wrong and how to fix it. Good ASI can be error messages, profiling traces, rendered images via VLMs, constraint violations, or agent reasoning logs. The richer the ASI, the fewer evaluations GEPA needs.

GEPA vs Other DSPy Optimizers

DSPy ships with eight optimizers covering different tradeoffs. GEPA occupies the high-quality, instruction-only end of the spectrum. Here is how it compares to the other major options:

DimensionBootstrapFewShotMIPROv2COPROGEPA
Optimization targetFew-shot demonstrationsInstructions + demonstrationsInstructions onlyInstructions only
Search strategyTeacher-generated demos, metric filteringBayesian optimizationCoordinate ascent (hill-climbing)Pareto-aware evolutionary search
Feedback signalBinary pass/fail on metricScore from metricScore from metricScore + ASI (execution traces, error messages)
Sample efficiencyGood (10+ examples)Good (40+ trials, 200+ examples)Moderate (20-50 evals)Best (20-100 evals, 10+ examples)
Diversity preservationNo (single best)ModerateNo (single best)Yes (Pareto frontier)
Quality (relative)Baseline+5.6% over baselineModerate+13% over baseline
Compute per iterationLowMediumLowHigher (reflection LM calls)
Best forQuick start, small dataLarge datasets, balancedSimple instruction tuningMaximum quality, complex reasoning

BootstrapFewShot: The Starting Point

BootstrapFewShot generates demonstrations using a teacher model and filters them by your metric. It is fast, cheap, and requires minimal data. But it only optimizes few-shot examples, not the instructions themselves. For tasks where the instruction quality matters more than the examples (complex reasoning, math, agentic workflows), BootstrapFewShot hits a ceiling quickly.

MIPROv2: The Previous Best

MIPROv2 uses Bayesian optimization to search over both instructions and demonstrations. It is data-aware and demo-aware, producing strong results with enough data (200+ examples) and compute (40+ trials). GEPA outperforms MIPROv2 by 13% aggregate and by +12% on AIME 2025. The key difference: MIPROv2 treats the metric score as the only feedback signal. GEPA reads the full execution trace.

COPRO: Coordinate Ascent

COPRO generates and refines instructions via hill-climbing. It is simple and interpretable but lacks diversity preservation. Once COPRO finds a local optimum, it has no mechanism to escape. GEPA's Pareto frontier maintains multiple candidates simultaneously, each representing a different local optimum. The merge operation can combine their strengths.

SIMBA: Failure-Focused

SIMBA identifies examples with high output variability (where the model sometimes succeeds and sometimes fails) and uses LLM introspection to generate improvement rules. GEPA and SIMBA share the idea of LLM-driven diagnosis, but GEPA operates on a population level with Pareto selection, while SIMBA focuses on individual hard examples.

GEPA vs TextGrad

TextGrad and GEPA both use LLM feedback to optimize prompts, but their architectures are different enough that they excel in different scenarios.

DimensionGEPATextGrad
Search strategyPopulation-based Pareto evolutionSingle-candidate gradient descent
Feedback sourceFull execution traces + ASICritic LLM feedback
Candidate diversityMaintains Pareto frontier of diverse candidatesSingle candidate, sequential refinement
Local optima riskLow (population + merge operations)Higher (single trajectory)
Compute profileMore LLM calls per generation, fewer generations neededFewer calls per step, more steps needed
Works without gradientsYes (fully gradient-free)Yes (uses textual 'gradients', not numerical)
Multi-task transferYes (Pareto frontier captures per-task specialization)Limited (single candidate optimized for aggregate)
Best forComplex tasks, diverse evaluation instancesSimple tasks, consistent evaluation criteria

The fundamental difference: TextGrad refines a single prompt iteratively, like gradient descent on a single trajectory. GEPA maintains a population of diverse candidates and combines their strengths, like evolutionary search with Pareto selection. For tasks where different evaluation instances require different strategies (multi-hop QA, math with varied problem types, agentic workflows with branching logic), GEPA's population approach avoids the local optima that single-candidate methods converge to.

TextGrad is simpler to set up and can work well on tasks with uniform difficulty. GEPA is the better choice when your evaluation set contains heterogeneous problem types that benefit from specialized strategies.

Benchmark Results

The ICLR 2026 paper evaluated GEPA across six tasks against GRPO (a top RL method) and MIPROv2. All results are on Qwen3 8B unless noted otherwise.

+12%
Over MIPROv2 on AIME 2025
93%
MATH accuracy (vs 67% baseline ChainOfThought)
+6%
Average gain over GRPO across 6 tasks
97.8%
Structured extraction (2.93/3 score)
ComparisonGEPA AdvantageSample Efficiency
vs GRPO (RL)Up to +20% accuracy35x fewer rollouts (100-500 vs 24,000)
vs MIPROv2+13% aggregate, +12% on AIME 2025Comparable or fewer evaluations
vs BootstrapFewShot+26 points on MATH (67% to 93%)More compute per iteration
vs COPROHigher quality, better diversitySimilar eval count, richer feedback
vs baseline prompting+42.5 points (37.5% to 80%)100-500 evaluations total

MATH Benchmark

GEPA achieves 93% accuracy on the MATH benchmark using a DSPy ChainOfThought program, compared to 67% with the unoptimized ChainOfThought baseline. The 26-point improvement comes from instruction refinement alone, with no few-shot examples, no architectural changes, and no model fine-tuning.

AIME 2025

On AIME 2025 (competition-level math), GEPA with GPT-4.1 Mini achieves 10% gains over the unoptimized baseline. Against MIPROv2 on the same task, GEPA achieves +12% accuracy. Competition math problems are heterogeneous by design, which means different problems reward different reasoning strategies. GEPA's Pareto frontier preserves specialized strategies for geometry, algebra, and number theory rather than averaging them into a single "best" prompt.

Enterprise Structured Extraction

On a facility support analyzer task (extracting structured information from enterprise documents), GEPA achieves a metric of 2.93/3 (97.8%). On financial entity extraction, GEPA delivers +14 points over the DSPy baseline and +22 points over the raw OpenAI baseline.

Why GEPA excels on heterogeneous tasks

Most real-world evaluation sets contain a mix of problem types. A math benchmark includes algebra, geometry, and combinatorics. A QA benchmark includes factoid, multi-hop, and comparative questions. Methods that maintain a single best prompt compromise on all types. GEPA's Pareto frontier keeps a specialist for each type, and the best specialist is selected at inference time. This is the source of GEPA's consistent advantage over single-candidate optimizers.

Code Examples

Basic GEPA Usage in DSPy

The minimal setup requires a DSPy program, a metric function that returns ASI feedback, and a reflection LLM:

GEPA with DSPy (basic)

import dspy
from dspy import GEPA

# Configure your task LM
lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)

# Define your program
class MathSolver(dspy.Module):
    def __init__(self):
        self.solve = dspy.ChainOfThought("question: str -> answer: str")

    def forward(self, question: str):
        return self.solve(question=question)

# Define a metric with feedback (ASI)
def math_metric(example, pred, trace=None):
    correct = example.answer.strip() == pred.answer.strip()
    # Return feedback as ASI for GEPA to reflect on
    if not correct:
        feedback = f"Expected '{example.answer}' but got '{pred.answer}'. "
        feedback += f"The reasoning was: {getattr(pred, 'reasoning', 'N/A')}"
        return correct, feedback
    return correct, "Correct answer."

# Initialize GEPA
optimizer = GEPA(
    metric=math_metric,
    auto="light",              # "light", "medium", or "heavy"
    num_threads=8,
    reflection_lm=dspy.LM(
        model="openai/gpt-5.4",  # Strong model for reflection
        temperature=1.0,
        max_tokens=16000
    )
)

# Compile
optimized = optimizer.compile(
    MathSolver(),
    trainset=train_examples,
    valset=val_examples,
)

# Save the optimized program
optimized.save("math_solver_gepa.json")

GEPA with Rich ASI Feedback

The quality of GEPA's optimization scales with the quality of ASI feedback. Here is an example with detailed diagnostic information:

GEPA with rich ASI

def extraction_metric(example, pred, trace=None):
    """Metric that returns rich ASI for GEPA reflection."""
    score = 0.0
    feedback_parts = []

    # Check each expected field
    for field in example.expected_fields:
        predicted = getattr(pred, field, None)
        expected = getattr(example, field)

        if predicted == expected:
            score += 1.0
            feedback_parts.append(f"[CORRECT] {field}: '{predicted}'")
        elif predicted is None:
            score += 0.0
            feedback_parts.append(
                f"[MISSING] {field}: expected '{expected}', got None. "
                f"The model did not extract this field at all."
            )
        else:
            score += 0.0
            feedback_parts.append(
                f"[WRONG] {field}: expected '{expected}', got '{predicted}'. "
                f"Possible cause: the model may have confused this with "
                f"a similar field or extracted from the wrong section."
            )

    normalized = score / len(example.expected_fields)
    feedback = "\n".join(feedback_parts)
    return normalized, feedback

GEPA Configuration Options

GEPA advanced configuration

optimizer = GEPA(
    metric=metric_with_feedback,

    # Preset: "light" (fast), "medium" (balanced), "heavy" (thorough)
    auto="medium",

    # Reflection model (use a strong LLM here)
    reflection_lm=dspy.LM(
        model="openai/gpt-5.4",
        temperature=1.0,
        max_tokens=32000
    ),

    # How many examples per reflection step (default: 2)
    # Smaller = more focused reflections, larger = broader view
    reflection_minibatch_size=3,

    # Candidate selection strategy
    # "pareto" = sample from Pareto frontier (default)
    candidate_selection_strategy="pareto",

    # Parallelism
    num_threads=16,

    # Track optimization statistics
    track_stats=True,
)

optimized = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,  # If not provided, uses trainset for both
)

Choosing a reflection model

The reflection model does the heavy lifting in GEPA: analyzing traces, diagnosing failures, and proposing fixes. Use a strong model here (GPT-5.4, Claude Sonnet 4, or better). The task model (student) can be cheaper. Reflection calls are infrequent relative to task evaluations, so the cost of a strong reflector is amortized over many candidate evaluations.

When to Use GEPA

Use GEPA when:

  • You need the highest possible accuracy on a specific task
  • Your evaluation set has diverse problem types (math, multi-hop QA, mixed formats)
  • You can provide rich diagnostic feedback (ASI) beyond a scalar score
  • You are using a black-box model with no gradient access
  • Sample efficiency matters (limited budget for rollouts)
  • You want instruction-only optimization (no few-shot demos)
  • Your task involves complex reasoning, agentic workflows, or multi-step pipelines

Use a different optimizer when:

  • You need results in under 5 minutes (use BootstrapFewShot)
  • You only need few-shot example selection, not instruction optimization
  • Your task is simple enough that MIPROv2 or COPRO already saturates
  • You lack a meaningful feedback signal beyond pass/fail
  • You have 200+ examples and want joint instruction + demo optimization (MIPROv2)
  • Compute budget is extremely tight (GEPA's reflection calls cost more per iteration)

The Decision Flow

ScenarioBest OptimizerWhy
Quick baseline, 10 examplesBootstrapFewShotFast, cheap, no instruction optimization needed
200+ examples, balanced qualityMIPROv2Bayesian search over instructions + demos
Complex reasoning, rich feedback availableGEPAReflective evolution with ASI
Heterogeneous eval set, diverse problem typesGEPAPareto frontier preserves specialists
Repeatedly failing on specific hard examplesSIMBATargeted introspection on failure cases
Simple instruction tuning, no demosCOPROCoordinate ascent, interpretable
Production deployment on smaller modelBootstrapFinetuneDistills prompt behavior into weights

GEPA Beyond DSPy: optimize_anything

GEPA is not limited to prompt optimization within DSPy. The optimize_anything API, available via pip install gepa, can optimize any text artifact against any evaluation function. You declare the artifact, the evaluator, and optional background knowledge. GEPA handles the rest.

optimize_anything: optimizing code

from gepa import optimize_anything

# Optimize a Python function for performance
result = optimize_anything(
    artifact="""
def process_data(records):
    results = []
    for record in records:
        if record['status'] == 'active':
            results.append(transform(record))
    return results
""",
    evaluator=benchmark_function,  # returns (score, feedback)
    background="Python 3.12, target: minimize runtime on 100k records",
    mode="single_task",
    reflection_lm="openai/gpt-5.4",
)

print(result.optimized_artifact)  # Optimized code
print(result.score)               # Final score

The three optimization modes cover different use cases:

Single-Task Search

Optimize one artifact for one problem. Use cases: circle packing, blackbox mathematical optimization, optimizing a single function.

Multi-Task Search

Optimize across a batch of related problems with cross-task transfer. Use cases: CUDA kernel generation, multi-aspect SVG optimization.

Generalization

Build a skill that transfers to unseen problems. Use cases: prompt optimization for AIME math, agent architecture evolution for ARC-AGI.

Integration with Coding Agents

GEPA-optimized prompts are only as useful as the infrastructure that executes them. In coding agent workflows, two bottlenecks dominate: finding the right code context and applying generated edits.

A GEPA-optimized coding agent might produce better instructions for code modification, but if the agent spends 60% of its time searching for the right files, the optimized prompt's improvements are diluted. Similarly, if code edits are applied slowly or inaccurately, the optimization loop's iteration speed drops.

WarpGrep: Faster Context Retrieval

Semantic codebase search in sub-6 seconds. A GEPA-optimized agent with WarpGrep spends less time searching and more time reasoning, making each optimization iteration faster.

Morph Fast Apply: 10,500 tok/s

GEPA optimizes the agent's reasoning prompts. Morph Fast Apply handles the code edits those agents produce at 10,500 tok/s with 97.3% accuracy, closing the loop from optimized prompt to applied change.

The optimization feedback loop works like this: GEPA evolves the agent's system prompt. Each candidate prompt is evaluated by running the agent on a set of coding tasks. WarpGrep handles the context retrieval step within each evaluation. Morph Fast Apply handles the code edit application. The evaluator returns ASI describing what the agent did wrong (missed a file, applied the edit to the wrong location, failed to handle an edge case), and GEPA uses this to propose the next prompt variant.

Frequently Asked Questions

What is GEPA in DSPy?

GEPA (Genetic-Pareto) is a reflective prompt optimizer in DSPy that evolves instructions using natural language reflection on execution traces. Instead of collapsing feedback into a scalar reward, GEPA reads error messages, reasoning logs, and profiling data to diagnose why a prompt failed and propose targeted fixes. It was accepted as an ICLR 2026 oral and outperforms both MIPROv2 (+13% aggregate) and reinforcement learning methods like GRPO (+20% with 35x fewer rollouts).

How does GEPA differ from MIPROv2?

MIPROv2 uses Bayesian optimization to search over instruction and demonstration combinations, requiring 40+ trials and 200+ examples for best results. GEPA uses reflective prompt evolution with a Pareto frontier, requiring as few as 10 examples and 20-100 evaluations. GEPA focuses on instruction-only optimization without few-shot demonstrations, while MIPROv2 optimizes both. In head-to-head benchmarks, GEPA outperforms MIPROv2 by 13% aggregate across all tasks and models.

What is Actionable Side Information (ASI)?

ASI is diagnostic feedback returned by the evaluator alongside the score. It can include error messages, profiling traces, constraint violations, agent reasoning logs, or rendered images via VLMs. ASI serves as the text-optimization analogue of a gradient: it tells the LLM reflector why a candidate failed and how to fix it. The richer the ASI, the fewer evaluations GEPA needs to converge.

What is the GEPA Pareto frontier?

The Pareto frontier is the set of prompt candidates where each one achieves the highest score on at least one evaluation instance. A prompt survives on the frontier as long as no other single prompt beats it on every test case. This preserves diverse strategies that excel on different problem types, preventing the optimizer from converging to a single local optimum.

Can GEPA be used outside of DSPy?

Yes. GEPA ships as a standalone library (pip install gepa). The optimize_anything API can optimize any text artifact: prompts, code, agent architectures, configurations, and more. It supports three modes: single-task search, multi-task search with cross-task transfer, and generalization to unseen problems.

How many examples does GEPA need?

GEPA is highly sample-efficient. It can produce improvements with as few as 10 training examples and 20-100 total evaluations. RL methods like GRPO typically need 10,000+ rollouts. MIPROv2 recommends 200+ examples with 40+ trials. GEPA achieves this efficiency through reflective feedback and Pareto-aware candidate selection, which eliminates most of the random exploration that other methods require.

What benchmarks has GEPA been tested on?

The ICLR 2026 paper evaluated GEPA across six tasks including AIME 2025 (math), HotpotQA (multi-hop QA), PUPA, and HoVer. On Qwen3 8B, it outperformed GRPO by up to 20% and MIPROv2 by 13% aggregate. On the MATH benchmark, GEPA-optimized programs reached 93% accuracy vs 67% with basic ChainOfThought. On structured extraction tasks, GEPA achieved 97.8% quality scores.

What is the relationship between GEPA and SIMBA?

Both are instruction-level optimizers in DSPy. SIMBA uses stochastic mini-batch sampling to find examples with high output variability, then applies LLM introspection to generate improvement rules. GEPA operates on a population level with Pareto selection and reflective evolution. GEPA achieves higher quality but uses more compute per generation. SIMBA is better suited for targeted improvement on specific failure cases within an otherwise working system.

How does GEPA compare to TextGrad?

Both use LLM feedback, but their search strategies differ. TextGrad iteratively refines a single prompt using textual feedback from a critic LLM, like gradient descent on a single trajectory. GEPA maintains a population on a Pareto frontier and combines complementary strategies through evolutionary merge operations. GEPA's population-based approach avoids local optima that single-candidate methods like TextGrad can converge to, especially on heterogeneous evaluation sets.

What LLM should I use as the GEPA reflection model?

Use a strong model for reflection: GPT-5.4, Claude Sonnet 4, or better. The reflector analyzes execution traces, diagnoses failure modes, and proposes targeted prompt improvements. The task model (the "student") can be cheaper since it handles more frequent calls. Reflection calls are infrequent relative to task evaluations, so the cost of a stronger reflector is amortized.

Related Articles

10,500 tok/s Code Edits for Your Optimized Agent

GEPA optimizes the instructions your agent uses. Morph handles the code edits those agents produce. The morph-v3-fast model applies LM-generated edits at 10,500 tok/s with 97.3% accuracy, so your GEPA-optimized pipeline can ship changes at scale.