DSPy Prompt Optimization: Automated Prompt Compilation

DSPy replaces hand-written prompts with compiled ones. You declare what your LM module should do, define a metric, provide training examples, and an optimizer finds the best instructions and demonstrations automatically. The original DSPy paper showed 25%+ improvement over standard few-shot prompting on GPT-3.5 and 65% on Llama 2 13B.

25%+

Improvement over standard few-shot on GPT-3.5

65%

Improvement on Llama 2 13B vs baseline

24% → 51%

ReAct agent accuracy after optimization

66% → 87%

Classification score after fine-tuning

What DSPy Is

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP published at ICLR 2024. It treats LM pipelines the way PyTorch treats neural networks: as parameterized systems compiled against a loss function. The "parameters" are prompts and few-shot demonstrations. The "loss function" is your metric. The "compiler" is an optimizer.

The problem DSPy solves is that prompts are brittle. A carefully crafted few-shot prompt for GPT-4 often degrades when you switch to Claude. Adding a new reasoning step requires rewriting every downstream prompt. Evaluating which of 20 candidate prompts actually works requires manual comparison. DSPy automates all of this.

The core shift

Traditional prompt engineering: you write what to say to the model. DSPy: you write what the model should do (Signature) and how to measure success (metric), and the optimizer figures out what to say. The prompts become an implementation detail.

DSPy programs are modular Python classes. Each module wraps an LM call with a specific prompting strategy (basic prediction, chain-of-thought, tool use). Modules compose into larger pipelines. Once you have a working pipeline, you run an optimizer to compile it into a high-quality prompt. The compiled program can be saved to JSON and loaded back without rerunning optimization.

🔄Portable across models

Switch from GPT-4 to Claude to Llama without rewriting prompts. Recompile against the new model and the optimizer finds the right instructions for it.

📊Data-driven

Optimization uses your actual training examples and metric function. The resulting prompt is tailored to your task, not a generic template.

🧱Composable

Build complex pipelines from simple modules. Each module is independently optimizable. The full pipeline trains end-to-end.

Signatures: Declaring LM Behavior

A Signature is a declarative specification of what a language model module takes as input and produces as output. It replaces the role of a hand-written prompt: instead of telling the model how to respond, you tell DSPy what the task is.

Field names carry semantic meaning. Naming a field question versus sql_query changes how the optimizer generates instructions. The names are not just documentation; they guide the optimization process.

Inline signatures (simple tasks)

import dspy

# Basic question answering
qa = dspy.Predict("question -> answer")

# With type annotations
sentiment = dspy.Predict("sentence -> sentiment: bool")

# Multi-field with types
rag = dspy.Predict("context: list[str], question: str -> answer: str")

# Run the module
result = qa(question="What year was Python created?")
print(result.answer)  # "1991"

Class-based signatures (complex tasks)

import dspy

class GenerateSearchQuery(dspy.Signature):
    """Generate a search query to help answer a factual question."""

    context: list[str] = dspy.InputField(desc="May contain relevant facts")
    question: str = dspy.InputField()
    query: str = dspy.OutputField(desc="A targeted search query, 3-8 words")

class ClassifyDocument(dspy.Signature):
    """Classify a document into one of the predefined categories."""

    document: str = dspy.InputField(desc="The full text to classify")
    category: str = dspy.OutputField(
        desc="One of: finance, legal, medical, technical, other"
    )
    confidence: float = dspy.OutputField(
        desc="Confidence score between 0 and 1"
    )

# Use in a module
generate_query = dspy.Predict(GenerateSearchQuery)
result = generate_query(
    context=["Python was created by Guido van Rossum"],
    question="Who created Python?"
)
print(result.query)  # "Python creator Guido van Rossum"

Class-based signatures are what the optimizer uses to generate instructions. The docstring becomes the basis for instruction proposals. Field descriptions constrain what the model should produce. During optimization, the optimizer varies these instructions systematically to find what works best on your training data.

Modules: Prompting Strategies

Modules implement specific prompting strategies on top of Signatures. Each module is a learnable unit: it stores the optimized instructions and demonstrations produced by the optimizer.

Predict

The base module. Calls the LM with the signature, no additional strategy. All other modules build on Predict. Use it when you want direct input-output behavior without explicit reasoning steps.

ChainOfThought

Injects a reasoning field before the output fields. The model thinks step-by-step before committing to an answer. This consistently improves accuracy on tasks requiring multi-step reasoning, math, or logical deduction.

ChainOfThought vs Predict

import dspy

lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)

# Without reasoning
direct = dspy.Predict("question -> answer: int")

# With step-by-step reasoning
reasoned = dspy.ChainOfThought("question -> answer: int")

question = "A train travels 120 miles at 60 mph. How many hours does it take?"

direct_result = direct(question=question)
print(direct_result.answer)  # might be wrong

reasoned_result = reasoned(question=question)
print(reasoned_result.reasoning)  # "Distance = 120 miles, Speed = 60 mph. Time = 120/60 = 2 hours."
print(reasoned_result.answer)     # 2

ReAct

An agent that interleaves reasoning and tool calls. It implements the Reason-Act loop: reason about what to do, call a tool, observe the result, reason again. Optimization raised ReAct agent accuracy from 24% to 51% on HotPotQA question answering.

ReAct agent with tools

import dspy

def search_wikipedia(query: str) -> str:
    """Search Wikipedia and return a summary."""
    # Your search implementation
    return f"Results for: {query}"

def calculate(expression: str) -> float:
    """Evaluate a mathematical expression."""
    return eval(expression)

# Build a ReAct agent with two tools
agent = dspy.ReAct(
    "question -> answer",
    tools=[search_wikipedia, calculate]
)

result = agent(
    question="What is the population of Tokyo divided by 1 million?"
)
print(result.answer)

Custom Modules

Any Python class that inherits from dspy.Module and implements forward() is a module. This lets you compose multiple LM calls, add control flow, integrate retrieval, or build multi-step pipelines. The optimizer can compile any custom module as long as it uses DSPy primitives internally.

Custom RAG module

import dspy

class RAG(dspy.Module):
    def __init__(self, num_passages: int = 3):
        self.num_passages = num_passages
        self.generate_query = dspy.ChainOfThought("question -> search_query")
        self.generate_answer = dspy.ChainOfThought(
            "context: list[str], question: str -> answer: str"
        )

    def forward(self, question: str) -> dspy.Prediction:
        # Step 1: Generate a search query
        query_result = self.generate_query(question=question)

        # Step 2: Retrieve relevant passages
        passages = retrieve(query_result.search_query, k=self.num_passages)

        # Step 3: Generate answer from context
        return self.generate_answer(
            context=passages,
            question=question
        )

# This entire pipeline is optimizable as a unit
rag = RAG(num_passages=3)
result = rag(question="What is the speed of light?")
print(result.answer)

How Optimization Works

DSPy optimization requires three inputs: a program, a metric function, and training data. The optimizer searches for the combination of instructions and few-shot demonstrations that maximizes your metric on the training set.

The optimization loop

import dspy
from dspy.evaluate import Evaluate

# 1. Configure your LM
lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)

# 2. Define your program
rag = RAG()

# 3. Define your metric
def answer_correctness(example, pred, trace=None):
    """Returns 1 if the predicted answer matches the expected answer."""
    return float(example.answer.lower() in pred.answer.lower())

# 4. Prepare training data (DSPy recommends 20% train, 80% dev to prevent overfitting)
trainset = [
    dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare").with_inputs("question"),
    dspy.Example(question="When was Python created?", answer="1991").with_inputs("question"),
    # ... more examples
]

# 5. Run optimization
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=answer_correctness, max_bootstrapped_demos=4)
optimized_rag = optimizer.compile(rag, trainset=trainset)

# 6. Save for reuse — no need to recompile every run
optimized_rag.save("optimized_rag.json")

# 7. Evaluate on held-out set
evaluate = Evaluate(devset=devset, metric=answer_correctness, num_threads=4)
score = evaluate(optimized_rag)
print(f"Accuracy: {score:.1%}")

The metric function has access to the expected output (example), the predicted output (pred), and an optional trace (trace). When trace is not None, the optimizer is in compilation mode and can inspect intermediate LM calls. This allows metrics to enforce constraints on reasoning steps, not just final outputs.

Data split recommendation

DSPy recommends an inverted split from standard ML: use 20% for training and 80% for validation. This is intentional. Prompt optimizers can overfit on training examples, so you want a large validation set to catch this. With small datasets, use 10-20 examples for training and the rest for evaluation.

Compiled programs store their optimized parameters in JSON. Load them back with .load() and the program runs with the optimized prompts without any additional LM calls for optimization. The compilation cost is paid once; inference runs at normal speed.

Optimizer Types

DSPy ships with eight optimizers covering different tradeoffs between cost, dataset size, and optimization depth. Choosing the right one for your scenario matters more than fine-tuning any single optimizer's parameters.

Few-Shot Optimizers

Optimizer	How It Works	Best For	Data Needed
LabeledFewShot	Selects k examples directly from labeled data	Baseline, no LM calls for optimization	5+ examples
BootstrapFewShot	Generates demonstrations via a teacher LM, filters by metric	Small datasets, quick results	~10 examples
BootstrapFewShotWithRandomSearch	Runs BootstrapFewShot multiple times, picks best via eval	Better coverage, moderate cost	50+ examples
KNNFewShot	K-nearest neighbors selects training examples per query	Tasks where example relevance varies	100+ examples

BootstrapFewShot (most common starting point)

from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(
    metric=your_metric,
    max_labeled_demos=4,      # demos drawn directly from trainset
    max_bootstrapped_demos=4, # demos generated by teacher LM
    teacher_settings={"lm": dspy.LM("openai/gpt-5.4")}  # use stronger model as teacher
)

optimized_program = optimizer.compile(program, trainset=trainset)

Instruction Optimizers

Optimizer	Strategy	Strengths	Trials Needed
COPRO	Coordinate ascent (hill-climbing) on instructions	Simple, interpretable	depth × breadth
MIPROv2	Bayesian optimization over instructions + demos	Best quality, data-aware	40+ recommended
SIMBA	Stochastic mini-batch, LLM introspection on failures	Hard examples, failure analysis	20+
GEPA	LM reflection on execution trajectories	Iterative refinement, domain feedback	10+

MIPROv2 (highest quality for instruction optimization)

from dspy.teleprompt import MIPROv2

# MIPROv2 is data-aware and demo-aware
# It uses Bayesian optimization to efficiently search the instruction space
optimizer = MIPROv2(
    metric=your_metric,
    auto="medium",        # "light" (10 trials), "medium" (25), "heavy" (50+)
    num_threads=8,        # parallelize candidate evaluation
)

# Zero-shot mode: optimize instructions only, no few-shot examples
optimized_0shot = optimizer.compile(
    program,
    trainset=trainset,
    num_trials=25,
    requires_permission_to_run=False
)

# Few-shot mode: optimize both instructions and demonstrations
optimized_fewshot = optimizer.compile(
    program,
    trainset=trainset,
    num_trials=40,
    max_labeled_demos=5,
    max_bootstrapped_demos=4,
    requires_permission_to_run=False
)

Fine-Tuning and Meta-Optimizers

Once you have a working prompt-based program, two additional optimizers can push performance further.

🎯BootstrapFinetune

Converts a prompt-based program into model weight updates. Builds a training dataset from successful executions, then runs fine-tuning. The resulting model is smaller and faster than prompting a larger model.

🔗BetterTogether

Combines prompt optimization and fine-tuning in sequence. Each pass builds on the prior. Often outperforms either approach alone, particularly on specialized domains where both instruction quality and model weights matter.

Optimizer selection guide

Start with BootstrapFewShot for any new task. If you have 50+ examples and want better instructions, move to MIPROv2 with auto="medium". If you have a production system with a fixed model and want maximum performance, run BetterTogether. If you need to deploy on a smaller model, use BootstrapFinetune to distill the prompt-based behavior into weights.

Benchmarks: DSPy vs Manual Prompting

The ICLR 2024 DSPy paper tested compiled programs against two baselines: standard few-shot prompting and expert-written prompt chains. Both were on complex reasoning and retrieval tasks.

Model	vs Standard Few-Shot	vs Expert Prompts
GPT-3.5	+25%+	+5% to +46%
Llama 2 13B	+65%	+16% to +40%
T5 770M	Competitive with GPT-3.5 expert prompts	Outperforms hand-crafted chains

The T5 result is the most striking. A 770M-parameter open model compiled with DSPy matched GPT-3.5 performance on expert-written prompt chains. This illustrates the core DSPy argument: with the right optimized prompts and demonstrations, smaller models can reach parity with larger models on specific tasks.

24% → 51%

ReAct agent (HotPotQA) after BootstrapFewShot

66% → 87%

GPT-5.4-mini classification after BootstrapFinetune

~$2

Typical optimization run cost

~10 min

Typical optimization run time

The ReAct improvement from 24% to 51% is particularly relevant for agent developers. ReAct agents make multiple tool calls per question, and each call accumulates context. The optimized version found better search strategies and more accurate reasoning patterns, doubling accuracy without any architectural changes.

Where DSPy gains are largest

DSPy shows the largest improvements on: (1) multi-hop reasoning tasks requiring multiple LM calls, (2) smaller models where good few-shot examples have outsized impact, and (3) tasks with clear success metrics where the optimizer can reliably signal progress. Simple single-call classification tasks with large models show smaller gains because baseline prompting is already near ceiling.

DSPy vs LangChain

LangChain and DSPy solve different parts of the LLM pipeline problem. The choice between them is rarely about which is better overall; it is about what you are trying to do.

Dimension	DSPy	LangChain
Prompt source	Generated by optimizer from your data	Pre-written templates, you maintain them
Model portability	Recompile for new model, prompts adapt automatically	Rewrite prompts manually when switching
Setup time	Higher upfront: need metric, training data	Lower: start with a template immediately
Performance ceiling	Higher: optimizer finds task-specific prompts	Capped by quality of generic templates
Pre-built integrations	Fewer, more generic	Extensive: 200+ integrations, vector stores, etc.
Best for	Novel tasks needing maximum performance	Standard patterns needing fast integration

DSPy's documentation states the distinction directly: LangChain provides "batteries-included, pre-built application modules." DSPy provides "a small set of much more powerful and general-purpose modules that can learn to prompt your LM." These are not competing for the same use case. LangChain is faster to get working on common patterns. DSPy is better when you have a specific task and want the highest possible accuracy.

Many production systems use both. LangChain handles retrieval, memory, and tool orchestration. DSPy handles the LM calls within those pipelines, with optimized prompts for each step.

When to Use DSPy

Use DSPy when:

You have a specific task with a measurable metric
You can collect at least 20-30 labeled examples
You expect to change models or need portability
You are building a multi-step pipeline (RAG, agents)
Manual prompt iteration has stalled
You want the highest possible accuracy on a narrow task

Skip DSPy when:

You need a working prototype in under an hour
You lack labeled training data and cannot create it
The task has no clear success metric
You are using a standard pattern (summarization, basic Q&A) where templates work fine
Your LM provider is fixed and will never change

The minimum viable DSPy setup is small: a Signature, a Predict or ChainOfThought module, 20 examples, and a metric function. BootstrapFewShot will run in under 5 minutes and typically cost under $0.50 on a mid-size model. If it does not improve your baseline, you have lost 5 minutes and $0.50. If it does, you have a data-driven, portable, self-documenting prompt pipeline.

Minimal end-to-end DSPy pipeline

import dspy
from dspy.teleprompt import BootstrapFewShot

# 1. Configure LM
dspy.configure(lm=dspy.LM("openai/gpt-5.4-mini"))

# 2. Define the task
class SentimentClassifier(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(
            "review: str -> sentiment: str, confidence: float"
        )

    def forward(self, review: str) -> dspy.Prediction:
        return self.classify(review=review)

# 3. Define the metric
def exact_match(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

# 4. Training data (minimum viable: 10-20 examples)
trainset = [
    dspy.Example(review="This is amazing!", sentiment="positive").with_inputs("review"),
    dspy.Example(review="Terrible product.", sentiment="negative").with_inputs("review"),
    dspy.Example(review="It&apos;s okay.", sentiment="neutral").with_inputs("review"),
    # ... more examples
]

# 5. Optimize
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=3)
optimized = optimizer.compile(SentimentClassifier(), trainset=trainset)

# 6. Save and use
optimized.save("sentiment_classifier.json")

result = optimized(review="Best purchase I&apos;ve ever made!")
print(result.sentiment)    # "positive"
print(result.confidence)   # 0.97

Frequently Asked Questions

What is DSPy prompt optimization?

DSPy prompt optimization is the process of automatically generating and refining LM prompts using optimizers like BootstrapFewShot, MIPROv2, and COPRO. Instead of writing prompts by hand, you define your task with a Signature, specify a metric, provide training examples, and run an optimizer that compiles the best prompt for your specific model and data.

What is the difference between DSPy and manual prompt engineering?

Manual prompt engineering requires hand-crafting instructions, selecting few-shot examples, and re-writing prompts every time you change models or data distributions. DSPy treats prompts as learned parameters. You define what the task is (Signatures) and how to measure success (a metric function), and the optimizer finds the best instructions and demonstrations automatically. DSPy programs also transfer across models without manual rewriting.

Which DSPy optimizer should I use?

For small datasets under 20 examples, start with BootstrapFewShot. For 50+ examples, use BootstrapFewShotWithRandomSearch. For instruction-level optimization without few-shot examples, use MIPROv2 in zero-shot mode. For the highest quality on large datasets with 40+ trials, use MIPROv2 with 200+ examples. SIMBA works well for hard examples that repeatedly fail during evaluation.

How much does DSPy optimization cost?

Typical runs cost approximately $2 and take about 10 minutes. Costs range from cents (small dataset, BootstrapFewShot, small model) to tens of dollars (large dataset, MIPROv2 with 50+ trials, GPT-4). You can cap costs by limiting num_candidate_programs or num_trials in the optimizer config.

Does DSPy work with all LLM providers?

Yes. DSPy supports OpenAI, Anthropic, Google Gemini, Cohere, Together AI, Groq, local models via Ollama, and any OpenAI-compatible API. Because prompts are compiled per-model, switching providers means recompiling rather than rewriting. This is a key advantage when evaluating model upgrades.

Can DSPy fine-tune models, not just prompts?

Yes. BootstrapFinetune converts a prompt-based DSPy program into weight updates by building a training dataset from successful executions. BetterTogether combines prompt optimization and fine-tuning in sequence. This lets you prototype with prompting (faster iteration) and then distill into a smaller, faster fine-tuned model for production.

How does DSPy compare to LangChain for prompt optimization?

LangChain provides pre-built prompt templates and chains but does not optimize prompts automatically. If you want better performance, you rewrite the prompts yourself. DSPy has no pre-built prompts: it generates them through optimization against your data. LangChain is better for rapid prototyping with standard patterns. DSPy is better when you have a specific task, a clear metric, and want to find the best possible prompt without manual iteration.

What is a DSPy Signature?

A DSPy Signature is a declarative specification of what an LM module takes as input and produces as output. Simple inline signatures look like "question -> answer". Class-based signatures support docstrings and field descriptions that guide optimization. Signatures replace hand-written prompts: you declare the task, DSPy figures out how to prompt the model to do it.

Build LLM Pipelines That Apply Edits at 10,500 tok/s

DSPy optimizes the prompts your agent uses. Morph handles the code edits those agents produce. The morph-v3-fast model applies LM-generated edits with 10,500 tok/s throughput and 97.3% accuracy, so your optimized agent pipeline can actually ship changes at scale.

Try Morph API

View Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

DSPy Prompt Optimization: Automated Prompt Compilation for LLM Pipelines