GEPA Prompt Optimization: Beat RL With 35x Fewer Rollouts

Reinforcement learning tunes a model with tens of thousands of rollouts and a scalar reward. GEPA (Genetic-Pareto) gets the same gains by reading what went wrong. It samples trajectories, reflects on the failures in natural language, and evolves a Pareto frontier of prompt candidates. Across four tasks it beats GRPO by 10% on average (up to 20%) with up to 35x fewer rollouts, and beats MIPROv2 by +14% aggregate with prompts up to 9.2x shorter. Accepted as an ICLR 2026 oral.

+14%

Aggregate gain over MIPROv2 (vs its +7%)

35x

Fewer rollouts than GRPO (RL)

up to 20%

Gain over GRPO (10% average)

9.2x

Shorter prompts than MIPROv2

What GEPA Is

GEPA (Genetic-Pareto) is a prompt optimizer from the paper "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (Agrawal et al., 2025, arxiv:2507.19457), accepted as an oral presentation at ICLR 2026. It is integrated into DSPy as dspy.GEPA and also ships as a standalone library (pip install gepa).

The core problem GEPA solves: reinforcement learning and conventional prompt search both collapse rich execution data into a single number. They know a prompt scored 0.4 out of 1.0, but they do not know why. Was it a formatting error? A reasoning failure? A missing constraint? Without that diagnostic information, the optimizer brute-forces the search space, trying variations until something works, which is why GRPO needs tens of thousands of rollouts.

GEPA takes a different approach. The metric returns natural-language feedback alongside the score. That feedback can be error messages, reasoning traces, profiling data, constraint violations, or compiler and test output. A strong LLM (the "reflector") reads it, identifies what went wrong, and proposes a new prompt that addresses the specific failure mode. Each mutation inherits accumulated lessons from its ancestors in the search tree, so a few well-diagnosed rollouts go a long way.

Why 'Genetic-Pareto'?

The name captures both mechanisms. "Genetic" refers to the evolutionary search: a population of prompt candidates is mutated, evaluated, and selected across generations. "Pareto" refers to the selection strategy: rather than keeping only the single best candidate, GEPA maintains the Pareto frontier of candidates that each excel on at least one evaluation instance. This prevents premature convergence and preserves diverse strategies.

Reflective, not random

GEPA reads full execution traces to diagnose failures, not just scores. Each mutation is a targeted fix, not a random perturbation.

Pareto-aware selection

Maintains a frontier of complementary candidates. A prompt survives if it's the best at anything, preserving diverse strategies.

35x fewer rollouts

Beats GRPO using up to 35x fewer rollouts. Reflective feedback turns a handful of trajectories into a large quality gain.

How GEPA Works

GEPA's optimization loop has five stages, repeated for each generation:

1. Candidate Selection from the Pareto Frontier

The optimizer selects a parent candidate from the Pareto frontier. Selection probability is proportional to coverage: candidates that are the sole best on many evaluation instances get sampled more often. This balances exploitation (improving strong candidates) with exploration (giving underrepresented strategies a chance).

2. Minibatch Evaluation with Trace Capture

The selected candidate runs on a small minibatch from the training set (the DSPy default is 3 examples, set by reflection_minibatch_size). GEPA captures the full execution trace for each example: the prompt sent, the model's reasoning steps, tool calls and their outputs, the final answer, and the natural-language feedback the metric returns alongside the score.

3. LLM Reflection

A strong LLM (the reflection_lm) receives the execution traces, both successes and failures, and produces a diagnosis. The reflector identifies failure modes, logic breakdowns, and causal relationships. It sees what worked in successful examples and what broke in failures, building a complete picture of the candidate's strengths and weaknesses.

4. Targeted Mutation

Based on the reflection, GEPA proposes a new prompt that addresses the diagnosed issues. This is not a random mutation. The new prompt inherits all accumulated lessons from its ancestors in the search tree, plus the specific fixes identified by the reflector. GEPA also supports a merge step: combining complementary lessons from two Pareto-frontier candidates that excel on different problem types.

5. Pareto Validation

The new candidate is evaluated on the validation set. If it achieves the best score on any evaluation instance, it joins the Pareto frontier. If an existing candidate is now dominated (beaten on every instance), it drops off the frontier. This ensures the frontier only grows with genuinely complementary candidates.

Natural-language feedback is the gradient

The feedback the metric returns is GEPA's core differentiator. It is the text-optimization analogue of a gradient. While numerical gradients tell you which direction to move in weight space, this feedback tells you in natural language what went wrong and how to fix it. Good feedback can be error messages, profiling traces, constraint violations, compiler and test output, or agent reasoning logs. The richer the feedback, the fewer rollouts GEPA needs.

GEPA vs Other DSPy Optimizers

DSPy ships with several optimizers covering different tradeoffs. GEPA occupies the high-quality, instruction-only end of the spectrum. Here is how it compares to the other major options:

DSPy Optimizer Comparison

Dimension	BootstrapFewShot	MIPROv2	COPRO	GEPA
Optimization target	Few-shot demonstrations	Instructions + demonstrations	Instructions only	Instructions only
Search strategy	Teacher-generated demos, metric filtering	Bayesian optimization	Coordinate ascent (hill-climbing)	Pareto-aware evolutionary search
Feedback signal	Binary pass/fail on metric	Score from metric	Score from metric	Score + natural-language feedback (traces, errors)
Sample efficiency	Good (few-shot bootstrapping)	Needs many trials and examples	Moderate	Best (few rollouts -> large gain)
Diversity preservation	No (single best)	Moderate	No (single best)	Yes (Pareto frontier)
Aggregate gain (paper)	Baseline	+7% over baseline	Not reported	+14% over baseline
Compute per iteration	Low	Medium	Low	Higher (reflection LM calls)
Best for	Quick start, small data	Large datasets, balanced	Simple instruction tuning	Maximum quality, complex reasoning

BootstrapFewShot: The Starting Point

BootstrapFewShot generates demonstrations using a teacher model and filters them by your metric. It is fast, cheap, and requires minimal data. But it only optimizes few-shot examples, not the instructions themselves. For tasks where the instruction quality matters more than the examples (complex reasoning, math, agentic workflows), BootstrapFewShot hits a ceiling quickly.

MIPROv2: The Previous Best

MIPROv2 uses Bayesian optimization to search over both instructions and demonstrations. It is data-aware and demo-aware, and was the strongest prompt optimizer before GEPA. In the ICLR 2026 paper, GEPA beats MIPROv2 on every benchmark and model, with aggregate gains of +14% versus MIPROv2's +7%, while producing prompts up to 9.2x shorter. The key difference: MIPROv2 treats the metric score as the only feedback signal. GEPA reads the full execution trace and the natural-language feedback the metric returns.

COPRO: Coordinate Ascent

COPRO generates and refines instructions via hill-climbing. It is simple and interpretable but lacks diversity preservation. Once COPRO finds a local optimum, it has no mechanism to escape. GEPA's Pareto frontier maintains multiple candidates simultaneously, each representing a different local optimum. The merge operation can combine their strengths.

SIMBA: Failure-Focused

SIMBA identifies examples with high output variability (where the model sometimes succeeds and sometimes fails) and uses LLM introspection to generate improvement rules. GEPA and SIMBA share the idea of LLM-driven diagnosis, but GEPA operates on a population level with Pareto selection, while SIMBA focuses on individual hard examples.

GEPA vs TextGrad

TextGrad and GEPA both use LLM feedback to optimize prompts, but their architectures are different enough that they excel in different scenarios.

GEPA vs TextGrad

Dimension	GEPA	TextGrad
Search strategy	Population-based Pareto evolution	Single-candidate gradient descent
Feedback source	Full execution traces + metric feedback	Critic LLM feedback
Candidate diversity	Maintains Pareto frontier of diverse candidates	Single candidate, sequential refinement
Local optima risk	Low (population + merge operations)	Higher (single trajectory)
Compute profile	More LLM calls per generation, fewer generations needed	Fewer calls per step, more steps needed
Works without gradients	Yes (fully gradient-free)	Yes (uses textual 'gradients', not numerical)
Multi-task transfer	Yes (Pareto frontier captures per-task specialization)	Limited (single candidate optimized for aggregate)
Best for	Complex tasks, diverse evaluation instances	Simple tasks, consistent evaluation criteria

The fundamental difference: TextGrad refines a single prompt iteratively, like gradient descent on a single trajectory. GEPA maintains a population of diverse candidates and combines their strengths, like evolutionary search with Pareto selection. For tasks where different evaluation instances require different strategies (multi-hop QA, math with varied problem types, agentic workflows with branching logic), GEPA's population approach avoids the local optima that single-candidate methods converge to.

TextGrad is simpler to set up and can work well on tasks with uniform difficulty. GEPA is the better choice when your evaluation set contains heterogeneous problem types that benefit from specialized strategies.

Benchmark Results

The ICLR 2026 paper evaluated GEPA across four tasks (HotpotQA, IFBench, HoVer, and Papillon) on two models, Qwen3 8B and GPT-4.1 Mini, against GRPO (a strong RL method) and MIPROv2 (the prior best prompt optimizer).

+14%

Aggregate gain over MIPROv2 (vs its +7%)

10% / 20%

Average / max gain over GRPO

35x

Fewer rollouts than GRPO

9.2x

Shorter prompts than MIPROv2

GEPA vs Baselines (ICLR 2026 Results)

Comparison	GEPA Advantage	Sample Efficiency
vs GRPO (RL)	+10% on average, up to +20%	Up to 35x fewer rollouts
vs MIPROv2	+14% aggregate (vs +7%), every task and model	Prompts up to 9.2x shorter
Tasks evaluated	HotpotQA, IFBench, HoVer, Papillon	Qwen3 8B and GPT-4.1 Mini

Why a few rollouts go so far

GRPO learns from a scalar reward, so it needs to sample the same situation thousands of times to estimate which direction improves the policy. GEPA reads the trajectory and the metric's natural-language feedback, diagnoses the failure once, and writes a targeted fix into the prompt. The paper's phrasing is that GEPA "can often turn even just a few rollouts into a large quality gain." That is the source of the up-to-35x rollout reduction.

Shorter prompts, not just better scores

GEPA and GEPA+Merge produce instructions up to 9.2x shorter than MIPROv2 while scoring higher. MIPROv2 stuffs in demonstrations and verbose instruction blocks; GEPA evolves a tight instruction that encodes the lesson directly. Shorter prompts mean lower inference cost and less context pressure at serving time.

Beyond the four core tasks

The standalone GEPA library reports results outside the paper's four tasks. On AIME 2025, GEPA lifts GPT-4.1 Mini from 46.6% to 56.6%. On ARC-AGI, it evolves a naive agent into a 300-line system that reaches 89.5% test accuracy (from 32.5%). The paper also shows GEPA working as an inference-time search strategy for code optimization on NPUEval and KernelBench, where the library reports 87% of generated kernels matching or beating the baseline with a strong proposer model.

Why GEPA excels on heterogeneous tasks

Most real-world evaluation sets contain a mix of problem types. A QA benchmark includes factoid, multi-hop, and comparative questions. An instruction-following set spans formatting, length, and constraint rules. Methods that maintain a single best prompt compromise across all types. GEPA's Pareto frontier keeps a candidate that is the best at something, so complementary strategies survive instead of being averaged away. This is the source of GEPA's consistent advantage over single-candidate optimizers.

Code Examples

Basic GEPA Usage in DSPy

The minimal setup requires a DSPy program, a metric that returns a score plus natural-language feedback, and a strong reflection LLM:

GEPA with DSPy (basic)

import dspy
from dspy import GEPA

# Configure your task LM
lm = dspy.LM("openai/gpt-5-mini")
dspy.configure(lm=lm)

# Define your program
class MathSolver(dspy.Module):
    def __init__(self):
        self.solve = dspy.ChainOfThought("question: str -> answer: str")

    def forward(self, question: str):
        return self.solve(question=question)

# Metric returns a score AND natural-language feedback for GEPA to reflect on.
# The feedback is what makes GEPA sample-efficient.
def math_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    correct = gold.answer.strip() == pred.answer.strip()
    if correct:
        return dspy.Prediction(score=1.0, feedback="Correct answer.")
    feedback = (
        f"Expected '{gold.answer}' but got '{pred.answer}'. "
        f"The reasoning was: {getattr(pred, 'reasoning', 'N/A')}"
    )
    return dspy.Prediction(score=0.0, feedback=feedback)

# Initialize GEPA
optimizer = GEPA(
    metric=math_metric,
    auto="light",              # "light", "medium", or "heavy"
    num_threads=8,
    reflection_lm=dspy.LM(
        model="openai/gpt-5",  # Strong model for reflection
        temperature=1.0,
        max_tokens=16000
    )
)

# Compile
optimized = optimizer.compile(
    MathSolver(),
    trainset=train_examples,
    valset=val_examples,
)

# Save the optimized program
optimized.save("math_solver_gepa.json")

GEPA with Rich Feedback

The quality of GEPA's optimization scales with the quality of the feedback the metric returns. Here is an example with detailed diagnostic information:

GEPA with rich, per-field feedback

def extraction_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """Metric that returns rich, per-field feedback for GEPA reflection."""
    score = 0.0
    feedback_parts = []

    # Check each expected field
    for field in gold.expected_fields:
        predicted = getattr(pred, field, None)
        expected = getattr(gold, field)

        if predicted == expected:
            score += 1.0
            feedback_parts.append(f"[CORRECT] {field}: '{predicted}'")
        elif predicted is None:
            score += 0.0
            feedback_parts.append(
                f"[MISSING] {field}: expected '{expected}', got None. "
                f"The model did not extract this field at all."
            )
        else:
            score += 0.0
            feedback_parts.append(
                f"[WRONG] {field}: expected '{expected}', got '{predicted}'. "
                f"Possible cause: the model may have confused this with "
                f"a similar field or extracted from the wrong section."
            )

    normalized = score / len(gold.expected_fields)
    feedback = "\n".join(feedback_parts)
    return dspy.Prediction(score=normalized, feedback=feedback)

GEPA Configuration Options

GEPA advanced configuration

optimizer = GEPA(
    metric=metric_with_feedback,

    # Preset: "light" (fast), "medium" (balanced), "heavy" (thorough)
    auto="medium",

    # Reflection model (use a strong LLM here). The DSPy docs suggest
    # dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000).
    reflection_lm=dspy.LM(
        model="openai/gpt-5",
        temperature=1.0,
        max_tokens=32000
    ),

    # How many examples per reflection step (default: 2)
    # Smaller = more focused reflections, larger = broader view
    reflection_minibatch_size=3,

    # Candidate selection strategy
    # "pareto" = sample from Pareto frontier (default); "current_best" = greedy
    candidate_selection_strategy="pareto",

    # Parallelism
    num_threads=16,

    # Track optimization statistics
    track_stats=True,
)

optimized = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,  # If not provided, uses trainset for both
)

Choosing a reflection model

The reflection model does the heavy lifting in GEPA: analyzing traces, diagnosing failures, and proposing fixes. Use a strong model here. The DSPy docs suggest gpt-5 with temperature=1.0 and a large max_tokens; a frontier Claude model works equally well. The task model (student) can be cheaper. Reflection calls are infrequent relative to task evaluations, so the cost of a strong reflector is amortized over many candidate evaluations.

When to Use GEPA

Use GEPA when:

You need the highest possible accuracy on a specific task
Your evaluation set has diverse problem types (math, multi-hop QA, mixed formats)
You can return rich natural-language feedback from your metric, not just a score
You are using a black-box model with no gradient access
Rollouts are expensive and you want RL-level gains for far fewer of them
You want short, instruction-only prompts (no few-shot demos)
Your task involves complex reasoning, agentic workflows, or multi-step pipelines

Use a different optimizer when:

You need a quick baseline (use BootstrapFewShot)
You only need few-shot example selection, not instruction optimization
Your task is simple enough that MIPROv2 or COPRO already saturates
You lack a meaningful feedback signal beyond pass/fail
You have plenty of data and want joint instruction + demo optimization (MIPROv2)
Per-iteration compute is extremely tight (GEPA's reflection calls cost more per step)

The Decision Flow

When to pick which optimizer

Scenario	Best Optimizer	Why
Quick baseline, small dataset	BootstrapFewShot	Fast, cheap, no instruction optimization needed
Plenty of data, balanced quality	MIPROv2	Bayesian search over instructions + demos
Complex reasoning, rich feedback available	GEPA	Reflective evolution on natural-language feedback
Heterogeneous eval set, diverse problem types	GEPA	Pareto frontier preserves specialists
Repeatedly failing on specific hard examples	SIMBA	Targeted introspection on failure cases
Simple instruction tuning, no demos	COPRO	Coordinate ascent, interpretable
Production deployment on smaller model	BootstrapFinetune	Distills prompt behavior into weights

GEPA Beyond DSPy

GEPA is not limited to prompt optimization within DSPy. The standalone library (pip install gepa) can optimize any text artifact against any evaluation function. You implement a GEPAAdapter that evaluates a candidate and builds a reflective dataset from the traces; GEPA runs the evolutionary search. The same machinery has been pointed at code, agent architectures, scheduling policies, and vector graphics.

Optimizing a code artifact (illustrative)

import gepa

# The library optimizes any text artifact given a candidate (a dict of
# named components) and an evaluator that returns a score plus feedback.
seed_candidate = {
    "process_data": '''
def process_data(records):
    results = []
    for record in records:
        if record["status"] == "active":
            results.append(transform(record))
    return results
''',
}

result = gepa.optimize(
    seed_candidate=seed_candidate,
    trainset=trainset,                 # tasks the evaluator scores against
    valset=valset,
    adapter=code_adapter,              # your GEPAAdapter: evaluate + reflect
    reflection_lm="openai/gpt-5",      # strong model for the reflection step
    max_metric_calls=150,              # rollout budget
)

print(result.best_candidate)          # optimized artifact
print(result.val_aggregate_scores)    # score trajectory

The three optimization modes cover different use cases:

Single-Task Search

Optimize one artifact for one problem. Use cases: circle packing, blackbox mathematical optimization, optimizing a single function.

Multi-Task Search

Optimize across a batch of related problems with cross-task transfer. Use cases: CUDA kernel generation, multi-aspect SVG optimization.

Generalization

Build a skill that transfers to unseen problems. Use cases: prompt optimization for AIME math, agent architecture evolution for ARC-AGI.

Integration with Coding Agents

GEPA-optimized prompts are only as useful as the infrastructure that executes them. In coding agent workflows, two bottlenecks dominate: finding the right code context and applying generated edits.

A GEPA-optimized coding agent might produce better instructions for code modification, but if the agent spends 60% of its time searching for the right files, the optimized prompt's improvements are diluted. Similarly, if code edits are applied slowly or inaccurately, the optimization loop's iteration speed drops.

WarpGrep: Faster Context Retrieval

Semantic codebase search in sub-6 seconds. A GEPA-optimized agent with WarpGrep spends less time searching and more time reasoning, making each optimization iteration faster.

Morph Fast Apply: 10,500 tok/s

GEPA optimizes the agent's reasoning prompts. Morph Fast Apply handles the code edits those agents produce at 10,500+ tok/s with 98% accuracy, closing the loop from optimized prompt to applied change.

The optimization feedback loop works like this: GEPA evolves the agent's system prompt. Each candidate prompt is evaluated by running the agent on a set of coding tasks. WarpGrep handles the context retrieval step within each evaluation. Morph Fast Apply handles the code edit application. The evaluator returns natural-language feedback describing what the agent did wrong (missed a file, applied the edit to the wrong location, failed to handle an edge case), and GEPA uses that to propose the next prompt variant.

Frequently Asked Questions

What is GEPA in DSPy?

GEPA (Genetic-Pareto) is a reflective prompt optimizer in DSPy that evolves instructions using natural language reflection on execution traces. Instead of collapsing feedback into a scalar reward, GEPA reads error messages, reasoning logs, and profiling data to diagnose why a prompt failed and propose targeted fixes. It was accepted as an ICLR 2026 oral and outperforms both MIPROv2 (+14% aggregate) and the GRPO reinforcement learning method (10% on average, up to 20%, with up to 35x fewer rollouts).

How does GEPA differ from MIPROv2?

MIPROv2 uses Bayesian optimization to search over instruction and demonstration combinations. GEPA uses reflective prompt evolution with a Pareto frontier and rich natural-language feedback. In the ICLR 2026 paper, GEPA beats MIPROv2 on every benchmark and model, with aggregate gains of +14% versus MIPROv2's +7%, and produces prompts up to 9.2x shorter. It also turns far fewer rollouts into quality gains, which makes it cheaper to run.

What feedback signal does GEPA use?

GEPA's metric returns natural-language feedback alongside the numeric score. The feedback can include error messages, profiling traces, constraint violations, agent reasoning logs, or compiler and test output. It serves as the text-optimization analogue of a gradient: it tells the LLM reflector why a candidate failed and how to fix it. The richer the feedback, the fewer rollouts GEPA needs to converge.

What is the GEPA Pareto frontier?

The Pareto frontier is the set of prompt candidates where each one achieves the highest score on at least one evaluation instance. A prompt survives on the frontier as long as no other single prompt beats it on every test case. This preserves diverse strategies that excel on different problem types, preventing the optimizer from converging to a single local optimum.

Can GEPA be used outside of DSPy?

Yes. GEPA ships as a standalone library (pip install gepa). Through its GEPAAdapter interface it can optimize any text artifact: prompts, code, agent architectures, scheduling policies, and vector graphics. Reported uses include lifting GPT-4.1 Mini on AIME 2025 from 46.6% to 56.6% and evolving an ARC-AGI agent from 32.5% to 89.5% test accuracy.

How many rollouts does GEPA need?

GEPA is highly sample-efficient. The paper reports it can turn even a few rollouts into a large quality gain, beating GRPO while using up to 35x fewer rollouts. It achieves this through reflective natural-language feedback and Pareto-aware candidate selection, which eliminates most of the random exploration that reinforcement learning requires.

What benchmarks has GEPA been tested on?

The ICLR 2026 paper evaluated GEPA across four tasks: HotpotQA (multi-hop QA), IFBench (instruction following), HoVer (claim verification), and Papillon (privacy-preserving delegation), on Qwen3 8B and GPT-4.1 Mini. GEPA beat GRPO by 10% on average and up to 20% with up to 35x fewer rollouts, and beat MIPROv2 by +14% aggregate (versus +7%). The standalone library also reports AIME 2025 gains and code-optimization results on NPUEval and KernelBench.

What is the relationship between GEPA and SIMBA?

Both are instruction-level optimizers in DSPy. SIMBA uses stochastic mini-batch sampling to find examples with high output variability, then applies LLM introspection to generate improvement rules. GEPA operates on a population level with Pareto selection and reflective evolution. GEPA achieves higher quality but uses more compute per generation. SIMBA is better suited for targeted improvement on specific failure cases within an otherwise working system.

How does GEPA compare to TextGrad?

Both use LLM feedback, but their search strategies differ. TextGrad iteratively refines a single prompt using textual feedback from a critic LLM, like gradient descent on a single trajectory. GEPA maintains a population on a Pareto frontier and combines complementary strategies through a merge operation. GEPA's population-based approach avoids local optima that single-candidate methods like TextGrad can converge to, especially on heterogeneous evaluation sets.

What LLM should I use as the GEPA reflection model?

Use a strong model for reflection. The DSPy docs suggest dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000); a frontier Claude model works as well. The reflector analyzes execution traces, diagnoses failure modes, and proposes targeted prompt improvements. The task model (the "student") can be cheaper since it handles more frequent calls, so the cost of a stronger reflector is amortized.

10,500 tok/s Code Edits for Your Optimized Agent

GEPA optimizes the instructions your agent uses. Morph handles the code edits those agents produce. The morph-v3-fast model applies LM-generated edits at 10,500+ tok/s with 98% accuracy, so your GEPA-optimized pipeline can ship changes at scale.

Try Morph API

View Docs

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

GEPA Prompt Optimization: Beat RL With 35x Fewer Rollouts (DSPy Guide)

What GEPA Is

Reflective, not random

Pareto-aware selection

35x fewer rollouts

How GEPA Works

1. Candidate Selection from the Pareto Frontier

2. Minibatch Evaluation with Trace Capture

3. LLM Reflection

4. Targeted Mutation

5. Pareto Validation

GEPA vs Other DSPy Optimizers

BootstrapFewShot: The Starting Point

MIPROv2: The Previous Best

COPRO: Coordinate Ascent

SIMBA: Failure-Focused

GEPA vs TextGrad

Benchmark Results

Why a few rollouts go so far

Shorter prompts, not just better scores

Beyond the four core tasks

Code Examples

Basic GEPA Usage in DSPy

GEPA with DSPy (basic)

GEPA with Rich Feedback

GEPA with rich, per-field feedback

GEPA Configuration Options

GEPA advanced configuration

When to Use GEPA

Use GEPA when:

Use a different optimizer when:

The Decision Flow

GEPA Beyond DSPy

Optimizing a code artifact (illustrative)

Single-Task Search

Multi-Task Search

Generalization

Integration with Coding Agents

WarpGrep: Faster Context Retrieval

Morph Fast Apply: 10,500 tok/s

Frequently Asked Questions

What is GEPA in DSPy?

How does GEPA differ from MIPROv2?

What feedback signal does GEPA use?

What is the GEPA Pareto frontier?

Can GEPA be used outside of DSPy?

How many rollouts does GEPA need?

What benchmarks has GEPA been tested on?

What is the relationship between GEPA and SIMBA?

How does GEPA compare to TextGrad?

What LLM should I use as the GEPA reflection model?

Related Articles

10,500 tok/s Code Edits for Your Optimized Agent