DSPy replaces hand-written prompts with compiled ones. You declare what your LM module should do, define a metric, provide training examples, and an optimizer finds the best instructions and demonstrations automatically. The original DSPy paper showed 25%+ improvement over standard few-shot prompting on GPT-3.5 and 65% on Llama 2 13B.
What DSPy Is
DSPy (Declarative Self-improving Python) is a framework from Stanford NLP published at ICLR 2024. It treats LM pipelines the way PyTorch treats neural networks: as parameterized systems compiled against a loss function. The "parameters" are prompts and few-shot demonstrations. The "loss function" is your metric. The "compiler" is an optimizer.
The problem DSPy solves is that prompts are brittle. A carefully crafted few-shot prompt for GPT-4 often degrades when you switch to Claude. Adding a new reasoning step requires rewriting every downstream prompt. Evaluating which of 20 candidate prompts actually works requires manual comparison. DSPy automates all of this.
The core shift
Traditional prompt engineering: you write what to say to the model. DSPy: you write what the model should do (Signature) and how to measure success (metric), and the optimizer figures out what to say. The prompts become an implementation detail.
DSPy programs are modular Python classes. Each module wraps an LM call with a specific prompting strategy (basic prediction, chain-of-thought, tool use). Modules compose into larger pipelines. Once you have a working pipeline, you run an optimizer to compile it into a high-quality prompt. The compiled program can be saved to JSON and loaded back without rerunning optimization.
๐Portable across models
Switch from GPT-4 to Claude to Llama without rewriting prompts. Recompile against the new model and the optimizer finds the right instructions for it.
๐Data-driven
Optimization uses your actual training examples and metric function. The resulting prompt is tailored to your task, not a generic template.
๐งฑComposable
Build complex pipelines from simple modules. Each module is independently optimizable. The full pipeline trains end-to-end.
Signatures: Declaring LM Behavior
A Signature is a declarative specification of what a language model module takes as input and produces as output. It replaces the role of a hand-written prompt: instead of telling the model how to respond, you tell DSPy what the task is.
Field names carry semantic meaning. Naming a field question versus sql_query changes how the optimizer generates instructions. The names are not just documentation; they guide the optimization process.
Inline signatures (simple tasks)
import dspy
# Basic question answering
qa = dspy.Predict("question -> answer")
# With type annotations
sentiment = dspy.Predict("sentence -> sentiment: bool")
# Multi-field with types
rag = dspy.Predict("context: list[str], question: str -> answer: str")
# Run the module
result = qa(question="What year was Python created?")
print(result.answer) # "1991"Class-based signatures (complex tasks)
import dspy
class GenerateSearchQuery(dspy.Signature):
"""Generate a search query to help answer a factual question."""
context: list[str] = dspy.InputField(desc="May contain relevant facts")
question: str = dspy.InputField()
query: str = dspy.OutputField(desc="A targeted search query, 3-8 words")
class ClassifyDocument(dspy.Signature):
"""Classify a document into one of the predefined categories."""
document: str = dspy.InputField(desc="The full text to classify")
category: str = dspy.OutputField(
desc="One of: finance, legal, medical, technical, other"
)
confidence: float = dspy.OutputField(
desc="Confidence score between 0 and 1"
)
# Use in a module
generate_query = dspy.Predict(GenerateSearchQuery)
result = generate_query(
context=["Python was created by Guido van Rossum"],
question="Who created Python?"
)
print(result.query) # "Python creator Guido van Rossum"Class-based signatures are what the optimizer uses to generate instructions. The docstring becomes the basis for instruction proposals. Field descriptions constrain what the model should produce. During optimization, the optimizer varies these instructions systematically to find what works best on your training data.
Modules: Prompting Strategies
Modules implement specific prompting strategies on top of Signatures. Each module is a learnable unit: it stores the optimized instructions and demonstrations produced by the optimizer.
Predict
The base module. Calls the LM with the signature, no additional strategy. All other modules build on Predict. Use it when you want direct input-output behavior without explicit reasoning steps.
ChainOfThought
Injects a reasoning field before the output fields. The model thinks step-by-step before committing to an answer. This consistently improves accuracy on tasks requiring multi-step reasoning, math, or logical deduction.
ChainOfThought vs Predict
import dspy
lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)
# Without reasoning
direct = dspy.Predict("question -> answer: int")
# With step-by-step reasoning
reasoned = dspy.ChainOfThought("question -> answer: int")
question = "A train travels 120 miles at 60 mph. How many hours does it take?"
direct_result = direct(question=question)
print(direct_result.answer) # might be wrong
reasoned_result = reasoned(question=question)
print(reasoned_result.reasoning) # "Distance = 120 miles, Speed = 60 mph. Time = 120/60 = 2 hours."
print(reasoned_result.answer) # 2ReAct
An agent that interleaves reasoning and tool calls. It implements the Reason-Act loop: reason about what to do, call a tool, observe the result, reason again. Optimization raised ReAct agent accuracy from 24% to 51% on HotPotQA question answering.
ReAct agent with tools
import dspy
def search_wikipedia(query: str) -> str:
"""Search Wikipedia and return a summary."""
# Your search implementation
return f"Results for: {query}"
def calculate(expression: str) -> float:
"""Evaluate a mathematical expression."""
return eval(expression)
# Build a ReAct agent with two tools
agent = dspy.ReAct(
"question -> answer",
tools=[search_wikipedia, calculate]
)
result = agent(
question="What is the population of Tokyo divided by 1 million?"
)
print(result.answer)Custom Modules
Any Python class that inherits from dspy.Module and implements forward() is a module. This lets you compose multiple LM calls, add control flow, integrate retrieval, or build multi-step pipelines. The optimizer can compile any custom module as long as it uses DSPy primitives internally.
Custom RAG module
import dspy
class RAG(dspy.Module):
def __init__(self, num_passages: int = 3):
self.num_passages = num_passages
self.generate_query = dspy.ChainOfThought("question -> search_query")
self.generate_answer = dspy.ChainOfThought(
"context: list[str], question: str -> answer: str"
)
def forward(self, question: str) -> dspy.Prediction:
# Step 1: Generate a search query
query_result = self.generate_query(question=question)
# Step 2: Retrieve relevant passages
passages = retrieve(query_result.search_query, k=self.num_passages)
# Step 3: Generate answer from context
return self.generate_answer(
context=passages,
question=question
)
# This entire pipeline is optimizable as a unit
rag = RAG(num_passages=3)
result = rag(question="What is the speed of light?")
print(result.answer)How Optimization Works
DSPy optimization requires three inputs: a program, a metric function, and training data. The optimizer searches for the combination of instructions and few-shot demonstrations that maximizes your metric on the training set.
The optimization loop
import dspy
from dspy.evaluate import Evaluate
# 1. Configure your LM
lm = dspy.LM("openai/gpt-5.4-mini")
dspy.configure(lm=lm)
# 2. Define your program
rag = RAG()
# 3. Define your metric
def answer_correctness(example, pred, trace=None):
"""Returns 1 if the predicted answer matches the expected answer."""
return float(example.answer.lower() in pred.answer.lower())
# 4. Prepare training data (DSPy recommends 20% train, 80% dev to prevent overfitting)
trainset = [
dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare").with_inputs("question"),
dspy.Example(question="When was Python created?", answer="1991").with_inputs("question"),
# ... more examples
]
# 5. Run optimization
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=answer_correctness, max_bootstrapped_demos=4)
optimized_rag = optimizer.compile(rag, trainset=trainset)
# 6. Save for reuse โ no need to recompile every run
optimized_rag.save("optimized_rag.json")
# 7. Evaluate on held-out set
evaluate = Evaluate(devset=devset, metric=answer_correctness, num_threads=4)
score = evaluate(optimized_rag)
print(f"Accuracy: {score:.1%}")The metric function has access to the expected output (example), the predicted output (pred), and an optional trace (trace). When trace is not None, the optimizer is in compilation mode and can inspect intermediate LM calls. This allows metrics to enforce constraints on reasoning steps, not just final outputs.
Data split recommendation
DSPy recommends an inverted split from standard ML: use 20% for training and 80% for validation. This is intentional. Prompt optimizers can overfit on training examples, so you want a large validation set to catch this. With small datasets, use 10-20 examples for training and the rest for evaluation.
Compiled programs store their optimized parameters in JSON. Load them back with .load() and the program runs with the optimized prompts without any additional LM calls for optimization. The compilation cost is paid once; inference runs at normal speed.
Optimizer Types
DSPy ships with eight optimizers covering different tradeoffs between cost, dataset size, and optimization depth. Choosing the right one for your scenario matters more than fine-tuning any single optimizer's parameters.
Few-Shot Optimizers
| Optimizer | How It Works | Best For | Data Needed |
|---|---|---|---|
| LabeledFewShot | Selects k examples directly from labeled data | Baseline, no LM calls for optimization | 5+ examples |
| BootstrapFewShot | Generates demonstrations via a teacher LM, filters by metric | Small datasets, quick results | ~10 examples |
| BootstrapFewShotWithRandomSearch | Runs BootstrapFewShot multiple times, picks best via eval | Better coverage, moderate cost | 50+ examples |
| KNNFewShot | K-nearest neighbors selects training examples per query | Tasks where example relevance varies | 100+ examples |
BootstrapFewShot (most common starting point)
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(
metric=your_metric,
max_labeled_demos=4, # demos drawn directly from trainset
max_bootstrapped_demos=4, # demos generated by teacher LM
teacher_settings={"lm": dspy.LM("openai/gpt-5.4")} # use stronger model as teacher
)
optimized_program = optimizer.compile(program, trainset=trainset)Instruction Optimizers
| Optimizer | Strategy | Strengths | Trials Needed |
|---|---|---|---|
| COPRO | Coordinate ascent (hill-climbing) on instructions | Simple, interpretable | depth ร breadth |
| MIPROv2 | Bayesian optimization over instructions + demos | Best quality, data-aware | 40+ recommended |
| SIMBA | Stochastic mini-batch, LLM introspection on failures | Hard examples, failure analysis | 20+ |
| GEPA | LM reflection on execution trajectories | Iterative refinement, domain feedback | 10+ |
MIPROv2 (highest quality for instruction optimization)
from dspy.teleprompt import MIPROv2
# MIPROv2 is data-aware and demo-aware
# It uses Bayesian optimization to efficiently search the instruction space
optimizer = MIPROv2(
metric=your_metric,
auto="medium", # "light" (10 trials), "medium" (25), "heavy" (50+)
num_threads=8, # parallelize candidate evaluation
)
# Zero-shot mode: optimize instructions only, no few-shot examples
optimized_0shot = optimizer.compile(
program,
trainset=trainset,
num_trials=25,
requires_permission_to_run=False
)
# Few-shot mode: optimize both instructions and demonstrations
optimized_fewshot = optimizer.compile(
program,
trainset=trainset,
num_trials=40,
max_labeled_demos=5,
max_bootstrapped_demos=4,
requires_permission_to_run=False
)Fine-Tuning and Meta-Optimizers
Once you have a working prompt-based program, two additional optimizers can push performance further.
๐ฏBootstrapFinetune
Converts a prompt-based program into model weight updates. Builds a training dataset from successful executions, then runs fine-tuning. The resulting model is smaller and faster than prompting a larger model.
๐BetterTogether
Combines prompt optimization and fine-tuning in sequence. Each pass builds on the prior. Often outperforms either approach alone, particularly on specialized domains where both instruction quality and model weights matter.
Optimizer selection guide
Start with BootstrapFewShot for any new task. If you have 50+ examples and want better instructions, move to MIPROv2 with auto="medium". If you have a production system with a fixed model and want maximum performance, run BetterTogether. If you need to deploy on a smaller model, use BootstrapFinetune to distill the prompt-based behavior into weights.
Benchmarks: DSPy vs Manual Prompting
The ICLR 2024 DSPy paper tested compiled programs against two baselines: standard few-shot prompting and expert-written prompt chains. Both were on complex reasoning and retrieval tasks.
| Model | vs Standard Few-Shot | vs Expert Prompts |
|---|---|---|
| GPT-3.5 | +25%+ | +5% to +46% |
| Llama 2 13B | +65% | +16% to +40% |
| T5 770M | Competitive with GPT-3.5 expert prompts | Outperforms hand-crafted chains |
The T5 result is the most striking. A 770M-parameter open model compiled with DSPy matched GPT-3.5 performance on expert-written prompt chains. This illustrates the core DSPy argument: with the right optimized prompts and demonstrations, smaller models can reach parity with larger models on specific tasks.
The ReAct improvement from 24% to 51% is particularly relevant for agent developers. ReAct agents make multiple tool calls per question, and each call accumulates context. The optimized version found better search strategies and more accurate reasoning patterns, doubling accuracy without any architectural changes.
Where DSPy gains are largest
DSPy shows the largest improvements on: (1) multi-hop reasoning tasks requiring multiple LM calls, (2) smaller models where good few-shot examples have outsized impact, and (3) tasks with clear success metrics where the optimizer can reliably signal progress. Simple single-call classification tasks with large models show smaller gains because baseline prompting is already near ceiling.
DSPy vs LangChain
LangChain and DSPy solve different parts of the LLM pipeline problem. The choice between them is rarely about which is better overall; it is about what you are trying to do.
| Dimension | DSPy | LangChain |
|---|---|---|
| Prompt source | Generated by optimizer from your data | Pre-written templates, you maintain them |
| Model portability | Recompile for new model, prompts adapt automatically | Rewrite prompts manually when switching |
| Setup time | Higher upfront: need metric, training data | Lower: start with a template immediately |
| Performance ceiling | Higher: optimizer finds task-specific prompts | Capped by quality of generic templates |
| Pre-built integrations | Fewer, more generic | Extensive: 200+ integrations, vector stores, etc. |
| Best for | Novel tasks needing maximum performance | Standard patterns needing fast integration |
DSPy's documentation states the distinction directly: LangChain provides "batteries-included, pre-built application modules." DSPy provides "a small set of much more powerful and general-purpose modules that can learn to prompt your LM." These are not competing for the same use case. LangChain is faster to get working on common patterns. DSPy is better when you have a specific task and want the highest possible accuracy.
Many production systems use both. LangChain handles retrieval, memory, and tool orchestration. DSPy handles the LM calls within those pipelines, with optimized prompts for each step.
When to Use DSPy
Use DSPy when:
- You have a specific task with a measurable metric
- You can collect at least 20-30 labeled examples
- You expect to change models or need portability
- You are building a multi-step pipeline (RAG, agents)
- Manual prompt iteration has stalled
- You want the highest possible accuracy on a narrow task
Skip DSPy when:
- You need a working prototype in under an hour
- You lack labeled training data and cannot create it
- The task has no clear success metric
- You are using a standard pattern (summarization, basic Q&A) where templates work fine
- Your LM provider is fixed and will never change
The minimum viable DSPy setup is small: a Signature, a Predict or ChainOfThought module, 20 examples, and a metric function. BootstrapFewShot will run in under 5 minutes and typically cost under $0.50 on a mid-size model. If it does not improve your baseline, you have lost 5 minutes and $0.50. If it does, you have a data-driven, portable, self-documenting prompt pipeline.
Minimal end-to-end DSPy pipeline
import dspy
from dspy.teleprompt import BootstrapFewShot
# 1. Configure LM
dspy.configure(lm=dspy.LM("openai/gpt-5.4-mini"))
# 2. Define the task
class SentimentClassifier(dspy.Module):
def __init__(self):
self.classify = dspy.ChainOfThought(
"review: str -> sentiment: str, confidence: float"
)
def forward(self, review: str) -> dspy.Prediction:
return self.classify(review=review)
# 3. Define the metric
def exact_match(example, pred, trace=None):
return example.sentiment.lower() == pred.sentiment.lower()
# 4. Training data (minimum viable: 10-20 examples)
trainset = [
dspy.Example(review="This is amazing!", sentiment="positive").with_inputs("review"),
dspy.Example(review="Terrible product.", sentiment="negative").with_inputs("review"),
dspy.Example(review="It's okay.", sentiment="neutral").with_inputs("review"),
# ... more examples
]
# 5. Optimize
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=3)
optimized = optimizer.compile(SentimentClassifier(), trainset=trainset)
# 6. Save and use
optimized.save("sentiment_classifier.json")
result = optimized(review="Best purchase I've ever made!")
print(result.sentiment) # "positive"
print(result.confidence) # 0.97Frequently Asked Questions
What is DSPy prompt optimization?
DSPy prompt optimization is the process of automatically generating and refining LM prompts using optimizers like BootstrapFewShot, MIPROv2, and COPRO. Instead of writing prompts by hand, you define your task with a Signature, specify a metric, provide training examples, and run an optimizer that compiles the best prompt for your specific model and data.
What is the difference between DSPy and manual prompt engineering?
Manual prompt engineering requires hand-crafting instructions, selecting few-shot examples, and re-writing prompts every time you change models or data distributions. DSPy treats prompts as learned parameters. You define what the task is (Signatures) and how to measure success (a metric function), and the optimizer finds the best instructions and demonstrations automatically. DSPy programs also transfer across models without manual rewriting.
Which DSPy optimizer should I use?
For small datasets under 20 examples, start with BootstrapFewShot. For 50+ examples, use BootstrapFewShotWithRandomSearch. For instruction-level optimization without few-shot examples, use MIPROv2 in zero-shot mode. For the highest quality on large datasets with 40+ trials, use MIPROv2 with 200+ examples. SIMBA works well for hard examples that repeatedly fail during evaluation.
How much does DSPy optimization cost?
Typical runs cost approximately $2 and take about 10 minutes. Costs range from cents (small dataset, BootstrapFewShot, small model) to tens of dollars (large dataset, MIPROv2 with 50+ trials, GPT-4). You can cap costs by limiting num_candidate_programs or num_trials in the optimizer config.
Does DSPy work with all LLM providers?
Yes. DSPy supports OpenAI, Anthropic, Google Gemini, Cohere, Together AI, Groq, local models via Ollama, and any OpenAI-compatible API. Because prompts are compiled per-model, switching providers means recompiling rather than rewriting. This is a key advantage when evaluating model upgrades.
Can DSPy fine-tune models, not just prompts?
Yes. BootstrapFinetune converts a prompt-based DSPy program into weight updates by building a training dataset from successful executions. BetterTogether combines prompt optimization and fine-tuning in sequence. This lets you prototype with prompting (faster iteration) and then distill into a smaller, faster fine-tuned model for production.
How does DSPy compare to LangChain for prompt optimization?
LangChain provides pre-built prompt templates and chains but does not optimize prompts automatically. If you want better performance, you rewrite the prompts yourself. DSPy has no pre-built prompts: it generates them through optimization against your data. LangChain is better for rapid prototyping with standard patterns. DSPy is better when you have a specific task, a clear metric, and want to find the best possible prompt without manual iteration.
What is a DSPy Signature?
A DSPy Signature is a declarative specification of what an LM module takes as input and produces as output. Simple inline signatures look like "question -> answer". Class-based signatures support docstrings and field descriptions that guide optimization. Signatures replace hand-written prompts: you declare the task, DSPy figures out how to prompt the model to do it.
Build LLM Pipelines That Apply Edits at 10,500 tok/s
DSPy optimizes the prompts your agent uses. Morph handles the code edits those agents produce. The morph-v3-fast model applies LM-generated edits with 10,500 tok/s throughput and 97.3% accuracy, so your optimized agent pipeline can actually ship changes at scale.