TextGrad: Backpropagation Through Text Using LLM-Generated Gradients

TextGrad performs automatic differentiation via text, using LLMs to generate textual gradients that optimize prompts, code, and even molecules. Published in Nature. Here is how it works, benchmark results, and when to use it over DSPy.

March 14, 2026 · 2 min read

What TextGrad Does

Neural networks learn by backpropagating numerical gradients through differentiable functions. TextGrad applies the same principle to text. It builds a computation graph where nodes are text variables and edges are LLM calls, then backpropagates natural language feedback to optimize those variables.

The framework was built by James Zou's group at Stanford (Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, and Carlos Guestrin). It was published in Nature after an initial arxiv preprint in June 2024. The open-source implementation mirrors PyTorch's API: if you can write a training loop, you can use TextGrad.

51% → 55%
GPQA Accuracy (GPT-4o)
+20%
LeetCode Hard (Relative Gain)
Nature
Published In

The key insight: if an LLM can critique an output and explain what is wrong, that critique is a usable gradient. "The answer is incorrect because it confuses velocity with acceleration in step 3" tells the optimizer exactly what to change. This is richer signal than a scalar loss. Traditional optimization gets a number. TextGrad gets a paragraph.

Core Abstraction

TextGrad maps PyTorch concepts to text optimization:

  • torch.Tensor → tg.Variable: Holds text content that can be optimized
  • nn.Module → tg.BlackboxLLM: Wraps any LLM call as a differentiable function
  • loss function → tg.TextLoss: Natural language evaluation criteria
  • loss.backward() → loss.backward(): LLM generates textual gradients
  • SGD → TGD: Textual Gradient Descent applies feedback to update variables

How It Works

TextGrad constructs a computation graph at runtime, tracking which variables flow through which LLM calls. When you call loss.backward(), it walks the graph in reverse, asking an LLM (the "backward engine") to generate textual gradients at each node.

Variables

A tg.Variable holds a string and metadata. Setting requires_grad=True marks it as optimizable. The role_description tells the backward engine what this variable represents (e.g., "system prompt for a math tutor" or "Python solution to a coding problem").

Loss Functions

tg.TextLoss takes a natural language instruction describing how to evaluate quality. Unlike numerical loss functions that output a scalar, TextLoss outputs a text critique. "Evaluate whether this code handles edge cases correctly and produces the right output for the given test cases" is a valid loss function.

Textual Gradients

The backward engine (typically a strong model like GPT-5.4 or Claude) reads the loss output and generates improvement suggestions for each variable in the computation graph. These suggestions are the textual gradients. They flow backward through the graph, accumulating context at each node, analogous to the chain rule in calculus.

The Optimizer (TGD)

Textual Gradient Descent takes the accumulated gradients and applies them to update variables. The optimizer prompt instructs the LLM to revise the variable's content based on the feedback while preserving aspects that work well. This is the analog of a weight update step.

Architecture: The Optimization Loop

  1. 1. Forward Pass: Input variables flow through LLM calls to produce outputs
  2. 2. Loss Computation: TextLoss evaluates the output using natural language criteria, producing a critique
  3. 3. Backward Pass: The backward engine generates textual gradients at each node by asking "how should this input change to improve the output?"
  4. 4. Variable Update: TGD applies the gradients, rewriting the variable text based on the accumulated feedback
  5. 5. Repeat: Steps 1-4 iterate until convergence or a fixed budget

The Optimization Loop in Detail

A single TextGrad optimization step involves 3 LLM calls minimum: one for the forward pass (generating output), one for the loss evaluation (critiquing output), and one for gradient generation (producing improvement suggestions). The optimizer step may require an additional call. For a typical 3-iteration optimization, that is 9-12 LLM calls per problem instance.

StepPyTorch AnalogTextGrad Implementation
Forwardmodel(x)BlackboxLLM generates text output from input variables
Losscriterion(output, target)TextLoss evaluates output against natural language criteria
Backwardloss.backward()Backward engine LLM generates textual improvement suggestions
Updateoptimizer.step()TGD rewrites variable text based on accumulated gradients
Zero Gradoptimizer.zero_grad()Clear previous gradients before next iteration

The computation graph can be arbitrarily deep. A multi-step reasoning chain might have 5 nodes, each representing an LLM call. Gradients propagate backward through all of them. In practice, graphs deeper than 3-4 nodes start hitting diminishing returns due to gradient degradation.

Code Example

Here is a minimal TextGrad program that optimizes an answer to a question. The forward model generates an initial answer. TextLoss evaluates it. The backward pass produces textual gradients. TGD applies the gradients to improve the answer.

Answer Optimization with TextGrad

import textgrad as tg

# Configure engines
tg.set_backward_engine("gpt-5.4", override=True)
model = tg.BlackboxLLM("gpt-5.4")

# Define the question (not optimizable)
question = tg.Variable(
    "What is the derivative of x^2 * sin(x)?",
    role_description="question to answer",
    requires_grad=False
)

# Get initial answer (optimizable)
answer = model(question)
answer.set_role_description("answer to the math question")
answer.requires_grad = True

# Define loss function with natural language criteria
loss_fn = tg.TextLoss(
    "Evaluate this answer for mathematical correctness. "
    "Check each step of the derivation. "
    "Identify any errors in applying the product rule."
)

# Optimization loop
optimizer = tg.TGD(parameters=[answer])

for i in range(3):
    loss = loss_fn(answer)
    loss.backward()       # Generate textual gradients
    optimizer.step()      # Apply gradients to update answer
    optimizer.zero_grad() # Clear gradients for next step
    print(f"Step {i+1}: {answer.value[:100]}...")

For prompt optimization (improving a system prompt rather than a single answer), the pattern is similar but the optimizable variable is the system prompt itself:

System Prompt Optimization

import textgrad as tg

tg.set_backward_engine("gpt-5.4", override=True)

# System prompt is the variable being optimized
system_prompt = tg.Variable(
    "You are a helpful math tutor. Show your work step by step.",
    role_description="system prompt for a math tutoring LLM",
    requires_grad=True
)

# Wrap the model with the optimizable system prompt
model = tg.BlackboxLLM("gpt-3.5-turbo", system_prompt=system_prompt)
optimizer = tg.TGD(parameters=list(model.parameters()))

# Evaluate on a batch of questions
questions = [
    "Solve: 2x + 5 = 13",
    "What is the integral of e^(2x)?",
    "Factor: x^2 - 9x + 20"
]

for q in questions:
    question = tg.Variable(q, role_description="math question", requires_grad=False)
    answer = model(question)
    loss = tg.TextLoss(
        f"Question: {q}\nEvaluate the answer for correctness "
        "and clarity of explanation."
    )(answer)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

print(f"Optimized prompt: {system_prompt.value}")

Benchmark Results

TextGrad was evaluated across four domains: scientific reasoning, coding, molecular design, and medical treatment planning. The numbers come from the original paper using GPT-4o as both the forward and backward engine.

55%
GPQA (up from 51% baseline)
+20%
LeetCode Hard (relative improvement)

GPQA (Google-Proof Question Answering)

GPQA contains PhD-level science questions designed to resist simple search. GPT-4o scored 51% zero-shot. After 3 TextGrad optimization iterations with majority voting, accuracy rose to 55%. This was the best-known result on GPQA at the time of publication.

LeetCode Hard

TextGrad optimized code solutions by using test case results as feedback. The backward engine analyzed failing test cases and generated gradients suggesting specific code changes. This produced a 20% relative performance gain over the baseline, with a 36% overall completion rate on hard problems.

Molecule Optimization

Starting from simple molecular fragments (benzene), TextGrad optimized SMILES strings to improve binding affinity (Vina score) and druglikeness (QED score). Across 29 protein targets and 3 initial fragments, the optimized molecules matched the property distributions of clinically approved drugs in DrugBank. On 58 targets from the DOCKSTRING benchmark, significant improvement occurred at each optimization step.

Radiation Oncology

TextGrad optimized radiation treatment plans by adjusting beam parameters based on textual feedback about target coverage and organ sparing. The resulting plans achieved high target specificity without manual tuning by a medical physicist.

BenchmarkBaselineTextGrad ResultImprovement
GPQA (GPT-4o zero-shot)51%55%+4 percentage points
LeetCode Hard~30% completion36% completion+20% relative
Molecule Design (29 targets)Starting fragmentsDrugBank-level propertiesMatched clinical distributions
Radiation OncologyManual planningAuto-optimized plansHigh target specificity

TextGrad vs DSPy

TextGrad and DSPy are the two main frameworks for systematic prompt optimization. They share a goal (better LLM outputs through automated optimization) but differ in approach, scope, and ideal use case.

DimensionTextGradDSPy
Optimization TargetIndividual instances at inference timePrompt pipelines at compile time
Gradient SourceLLM-generated textual feedbackBootstrapped few-shot examples from data
Backward PassChain rule through text: LLM explains how to improve each variableRewards modules that contributed to correct final answers
Training Data RequiredNo (uses LLM as judge)Yes (needs labeled examples for compilation)
Best ForHard single-instance problems (PhD Q&A, code debugging)Reusable, scalable prompt pipelines
API StylePyTorch (Variables, backward, optimizer)Declarative (Signatures, Modules, Compilers)
Cost Per Optimization3-12 LLM calls per instance per iterationOne compile pass, then fast inference
Published InNature (2025)ICLR 2024
Framework MaturityResearch-stage, stable APIProduction-used, active community

When to Use Which

  • Use TextGrad when: You have a hard problem instance (a specific coding challenge, a specific molecule to optimize, a specific question to answer) and want to squeeze maximum performance out of the LLM by iteratively refining the output
  • Use DSPy when: You are building a multi-step prompt pipeline that needs to work reliably across many inputs, and you have labeled training data to compile against
  • Use both when: Compile a pipeline with DSPy for the general case, then apply TextGrad at inference time for the hardest instances that need extra optimization

DSPy has seen broader production adoption because compiled pipelines run at normal inference cost after the initial compile step. TextGrad's per-instance cost makes it impractical for high-throughput production use, but unbeatable for high-stakes individual problems where each percentage point of accuracy matters.

TextGrad vs Manual Prompt Engineering

Manual prompt engineering is guesswork with fast feedback. You rewrite the prompt, test it on a few examples, observe the result, adjust, and repeat. The iteration is fast (minutes per cycle) but unsystematic. You optimize for the examples you test on, not for the distribution. Two prompt engineers given the same task produce different prompts with different failure modes.

TextGrad automates the feedback loop. The LLM evaluates its own output, generates specific critiques, and applies targeted improvements. Each iteration is more expensive (3+ LLM calls) but more principled. The optimization targets the actual evaluation criteria, not the engineer's intuition.

DimensionManual EngineeringTextGrad
Iteration SpeedMinutes (human in the loop)Seconds per step (fully automated)
Cost Per IterationHuman time3-12 LLM API calls
ReproducibilityLow (depends on the engineer)High (deterministic given same seed)
Optimization SignalEngineer intuition + spot-checkingSystematic LLM evaluation
CeilingLimited by human creativityLimited by LLM judgment quality
Domain Expertise NeededYes (must understand the task)Minimal (define evaluation criteria)
ScalabilityOne prompt at a timeBatch optimization across instances

For most teams, the practical workflow is: start with manual engineering to get 80% of the way there, then apply TextGrad or DSPy to close the remaining gap systematically.

Practical Applications

Prompt Optimization

Optimize system prompts for specific tasks by defining evaluation criteria in natural language. TextGrad iteratively refines the prompt based on performance across test examples. No labeled dataset required.

Code Solution Refinement

Feed failing test cases as loss signal. TextGrad's backward engine analyzes error messages, identifies bugs, and generates targeted code patches. Achieved 36% completion on LeetCode Hard.

Molecular Design

Optimize SMILES strings for binding affinity and druglikeness. TextGrad interfaces with chemoinformatics tools (RDKit, AutoDock Vina) as external evaluators, using their scores as loss signal.

Multi-Step Reasoning

Build computation graphs with multiple reasoning steps. Each step is a node that receives targeted feedback. Useful for chain-of-thought optimization on math, science, and logic problems.

Medical Reasoning

Optimize clinical reasoning chains and treatment parameters. TextGrad has been applied to radiation oncology planning and medical Q&A tasks where correctness is safety-critical.

Agent Workflow Optimization

EvoAgentX (2025) integrates TextGrad with multi-agent system optimization, yielding up to 20% improvement on complex agentic tasks. Optimizes agent prompts and routing logic simultaneously.

Limitations

TextGrad is a research framework with real constraints. Understanding these helps you decide when to use it and when to pick alternatives.

Cost Scales with Depth

Each optimization step requires multiple LLM calls. For a computation graph with 5 nodes and 3 optimization iterations, you are looking at 15-45 LLM calls per problem instance. At GPT-5.4 pricing, a single optimization run on a hard problem can cost $0.50-$5.00. This makes TextGrad impractical for high-throughput production use.

Exploding Textual Gradients

In deep computation graphs, feedback accumulates across layers. Each backward step adds context to the gradient, growing prompt sizes. By layer 5+, the gradient prompt may exceed the backward engine's effective context window, triggering "lost in the middle" effects where critical corrections get buried.

Vanishing Textual Gradients

Compression strategies that try to manage growing gradients strip specificity. Early nodes in the graph receive diluted, generic feedback rather than targeted improvement suggestions. This limits effective optimization depth to 3-4 nodes in practice.

No Convergence Guarantees

Unlike numerical gradient descent with proven convergence for convex functions, textual gradients offer no formal convergence guarantees. The optimization may oscillate, overfit to the loss function's phrasing, or plateau. Setting a fixed iteration budget (3-5 steps) is standard practice.

Feedback Misalignment

TextGrad sometimes assigns feedback to the wrong variable. If the loss function criticizes the final output, the backward engine may blame the wrong intermediate step, sending unproductive gradients to variables that were not the actual problem.

Mitigations

  • Keep graphs shallow: 2-3 nodes works well. Beyond 4, consider restructuring the computation.
  • Use strong backward engines: GPT-5.4 or Claude as the backward engine produces better gradients than smaller models.
  • Set iteration budgets: 3-5 steps typically captures most gains. Diminishing returns set in fast.
  • Cache aggressively: TextGrad supports caching via litellm. Reuse gradients for similar problems.

Accelerating Optimization Loops with Fast Infrastructure

TextGrad's bottleneck is LLM call latency. Each optimization step requires a full round-trip to the backward engine, and each variable update requires applying textual patches. Faster infrastructure directly reduces total optimization time.

Fast Apply for Variable Updates

When TextGrad's optimizer applies gradients to code or structured text, Morph's Fast Apply engine processes the patch at 10,500 tok/s. A variable update that takes 3 seconds with standard generation completes in under 500ms.

WarpGrep for Context Retrieval

TextGrad optimization over codebases needs relevant context at each step. WarpGrep's semantic search finds related code in 2-3 seconds across repositories of any size, feeding better context into the forward and backward passes.

For code optimization workflows, the combination matters: WarpGrep provides the context that makes gradients specific, and Fast Apply makes the update step near-instant. A 5-iteration TextGrad loop over a code solution drops from ~45 seconds to ~15 seconds with both in the pipeline.

Frequently Asked Questions

What is TextGrad?

TextGrad is an open-source Python framework from Stanford's Zou group that performs automatic "differentiation" via text. It uses LLMs to generate natural language feedback (textual gradients) that optimize text-based variables like prompts, code solutions, and molecular structures. The work was published in Nature.

How does TextGrad differ from DSPy?

DSPy optimizes entire prompt pipelines at compile time using training data. TextGrad optimizes individual instances at inference time using LLM-generated feedback. DSPy builds reusable pipelines. TextGrad squeezes maximum performance out of specific, hard problems. They are complementary: compile with DSPy, then refine hard cases with TextGrad.

What benchmarks has TextGrad achieved?

GPQA accuracy went from 51% to 55% (GPT-4o zero-shot). LeetCode Hard saw a 20% relative performance gain with 36% overall completion. Molecule optimization matched clinical drug distributions. Radiation oncology plans achieved high target specificity without manual tuning.

How much does it cost to run TextGrad?

Cost depends on the backward engine model and optimization steps. A 3-step optimization on a single problem typically costs $0.05 to $0.50 with GPT-5.4. Costs scale linearly with iteration count and computation graph depth. For batch optimization across many instances, costs add up fast.

Can TextGrad optimize any type of text?

Any text where quality can be evaluated by an LLM. This includes system prompts, code, molecular SMILES strings, treatment plans, and multi-step reasoning chains. You define the evaluation criteria as a natural language loss function.

What LLMs does TextGrad support?

Any LLM supported by litellm: OpenAI (GPT-5.4, GPT-4), Anthropic (Claude), Google (Gemini), AWS Bedrock, Together AI, and more. You can use different models for forward passes and backward engines.

Is TextGrad production-ready?

For offline optimization tasks (improving prompts, refining solutions before deployment), yes. For real-time production systems with high throughput requirements, not yet. The per-instance cost (3-12 LLM calls per optimization) and lack of convergence guarantees make it better suited for research and offline optimization workflows.

What is Textual Gradient Descent (TGD)?

TGD is TextGrad's optimizer. After the backward pass generates textual gradients (natural language improvement suggestions), TGD applies those gradients to update the variable's text content. It is the text-space analog of SGD: instead of subtracting a scaled gradient from a weight, it rewrites the text based on the feedback.

Related Pages

Faster Optimization Loops for AI Systems

Morph's Fast Apply engine processes code patches at 10,500 tok/s, making TextGrad-style optimization loops 3x faster. WarpGrep provides semantic context retrieval for better gradient signals.