What TextGrad Does
Neural networks learn by backpropagating numerical gradients through differentiable functions. TextGrad applies the same principle to text. It builds a computation graph where nodes are text variables and edges are LLM calls, then backpropagates natural language feedback to optimize those variables.
The framework was built by James Zou's group at Stanford (Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, and Carlos Guestrin). It was published in Nature after an initial arxiv preprint in June 2024. The open-source implementation mirrors PyTorch's API: if you can write a training loop, you can use TextGrad.
The key insight: if an LLM can critique an output and explain what is wrong, that critique is a usable gradient. "The answer is incorrect because it confuses velocity with acceleration in step 3" tells the optimizer exactly what to change. This is richer signal than a scalar loss. Traditional optimization gets a number. TextGrad gets a paragraph.
Core Abstraction
TextGrad maps PyTorch concepts to text optimization:
- torch.Tensor → tg.Variable: Holds text content that can be optimized
- nn.Module → tg.BlackboxLLM: Wraps any LLM call as a differentiable function
- loss function → tg.TextLoss: Natural language evaluation criteria
- loss.backward() → loss.backward(): LLM generates textual gradients
- SGD → TGD: Textual Gradient Descent applies feedback to update variables
How It Works
TextGrad constructs a computation graph at runtime, tracking which variables flow through which LLM calls. When you call loss.backward(), it walks the graph in reverse, asking an LLM (the "backward engine") to generate textual gradients at each node.
Variables
A tg.Variable holds a string and metadata. Setting requires_grad=True marks it as optimizable. The role_description tells the backward engine what this variable represents (e.g., "system prompt for a math tutor" or "Python solution to a coding problem").
Loss Functions
tg.TextLoss takes a natural language instruction describing how to evaluate quality. Unlike numerical loss functions that output a scalar, TextLoss outputs a text critique. "Evaluate whether this code handles edge cases correctly and produces the right output for the given test cases" is a valid loss function.
Textual Gradients
The backward engine (typically a strong model like GPT-5.4 or Claude) reads the loss output and generates improvement suggestions for each variable in the computation graph. These suggestions are the textual gradients. They flow backward through the graph, accumulating context at each node, analogous to the chain rule in calculus.
The Optimizer (TGD)
Textual Gradient Descent takes the accumulated gradients and applies them to update variables. The optimizer prompt instructs the LLM to revise the variable's content based on the feedback while preserving aspects that work well. This is the analog of a weight update step.
Architecture: The Optimization Loop
- 1. Forward Pass: Input variables flow through LLM calls to produce outputs
- 2. Loss Computation: TextLoss evaluates the output using natural language criteria, producing a critique
- 3. Backward Pass: The backward engine generates textual gradients at each node by asking "how should this input change to improve the output?"
- 4. Variable Update: TGD applies the gradients, rewriting the variable text based on the accumulated feedback
- 5. Repeat: Steps 1-4 iterate until convergence or a fixed budget
The Optimization Loop in Detail
A single TextGrad optimization step involves 3 LLM calls minimum: one for the forward pass (generating output), one for the loss evaluation (critiquing output), and one for gradient generation (producing improvement suggestions). The optimizer step may require an additional call. For a typical 3-iteration optimization, that is 9-12 LLM calls per problem instance.
| Step | PyTorch Analog | TextGrad Implementation |
|---|---|---|
| Forward | model(x) | BlackboxLLM generates text output from input variables |
| Loss | criterion(output, target) | TextLoss evaluates output against natural language criteria |
| Backward | loss.backward() | Backward engine LLM generates textual improvement suggestions |
| Update | optimizer.step() | TGD rewrites variable text based on accumulated gradients |
| Zero Grad | optimizer.zero_grad() | Clear previous gradients before next iteration |
The computation graph can be arbitrarily deep. A multi-step reasoning chain might have 5 nodes, each representing an LLM call. Gradients propagate backward through all of them. In practice, graphs deeper than 3-4 nodes start hitting diminishing returns due to gradient degradation.
Code Example
Here is a minimal TextGrad program that optimizes an answer to a question. The forward model generates an initial answer. TextLoss evaluates it. The backward pass produces textual gradients. TGD applies the gradients to improve the answer.
Answer Optimization with TextGrad
import textgrad as tg
# Configure engines
tg.set_backward_engine("gpt-5.4", override=True)
model = tg.BlackboxLLM("gpt-5.4")
# Define the question (not optimizable)
question = tg.Variable(
"What is the derivative of x^2 * sin(x)?",
role_description="question to answer",
requires_grad=False
)
# Get initial answer (optimizable)
answer = model(question)
answer.set_role_description("answer to the math question")
answer.requires_grad = True
# Define loss function with natural language criteria
loss_fn = tg.TextLoss(
"Evaluate this answer for mathematical correctness. "
"Check each step of the derivation. "
"Identify any errors in applying the product rule."
)
# Optimization loop
optimizer = tg.TGD(parameters=[answer])
for i in range(3):
loss = loss_fn(answer)
loss.backward() # Generate textual gradients
optimizer.step() # Apply gradients to update answer
optimizer.zero_grad() # Clear gradients for next step
print(f"Step {i+1}: {answer.value[:100]}...")For prompt optimization (improving a system prompt rather than a single answer), the pattern is similar but the optimizable variable is the system prompt itself:
System Prompt Optimization
import textgrad as tg
tg.set_backward_engine("gpt-5.4", override=True)
# System prompt is the variable being optimized
system_prompt = tg.Variable(
"You are a helpful math tutor. Show your work step by step.",
role_description="system prompt for a math tutoring LLM",
requires_grad=True
)
# Wrap the model with the optimizable system prompt
model = tg.BlackboxLLM("gpt-3.5-turbo", system_prompt=system_prompt)
optimizer = tg.TGD(parameters=list(model.parameters()))
# Evaluate on a batch of questions
questions = [
"Solve: 2x + 5 = 13",
"What is the integral of e^(2x)?",
"Factor: x^2 - 9x + 20"
]
for q in questions:
question = tg.Variable(q, role_description="math question", requires_grad=False)
answer = model(question)
loss = tg.TextLoss(
f"Question: {q}\nEvaluate the answer for correctness "
"and clarity of explanation."
)(answer)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Optimized prompt: {system_prompt.value}")Benchmark Results
TextGrad was evaluated across four domains: scientific reasoning, coding, molecular design, and medical treatment planning. The numbers come from the original paper using GPT-4o as both the forward and backward engine.
GPQA (Google-Proof Question Answering)
GPQA contains PhD-level science questions designed to resist simple search. GPT-4o scored 51% zero-shot. After 3 TextGrad optimization iterations with majority voting, accuracy rose to 55%. This was the best-known result on GPQA at the time of publication.
LeetCode Hard
TextGrad optimized code solutions by using test case results as feedback. The backward engine analyzed failing test cases and generated gradients suggesting specific code changes. This produced a 20% relative performance gain over the baseline, with a 36% overall completion rate on hard problems.
Molecule Optimization
Starting from simple molecular fragments (benzene), TextGrad optimized SMILES strings to improve binding affinity (Vina score) and druglikeness (QED score). Across 29 protein targets and 3 initial fragments, the optimized molecules matched the property distributions of clinically approved drugs in DrugBank. On 58 targets from the DOCKSTRING benchmark, significant improvement occurred at each optimization step.
Radiation Oncology
TextGrad optimized radiation treatment plans by adjusting beam parameters based on textual feedback about target coverage and organ sparing. The resulting plans achieved high target specificity without manual tuning by a medical physicist.
| Benchmark | Baseline | TextGrad Result | Improvement |
|---|---|---|---|
| GPQA (GPT-4o zero-shot) | 51% | 55% | +4 percentage points |
| LeetCode Hard | ~30% completion | 36% completion | +20% relative |
| Molecule Design (29 targets) | Starting fragments | DrugBank-level properties | Matched clinical distributions |
| Radiation Oncology | Manual planning | Auto-optimized plans | High target specificity |
TextGrad vs DSPy
TextGrad and DSPy are the two main frameworks for systematic prompt optimization. They share a goal (better LLM outputs through automated optimization) but differ in approach, scope, and ideal use case.
| Dimension | TextGrad | DSPy |
|---|---|---|
| Optimization Target | Individual instances at inference time | Prompt pipelines at compile time |
| Gradient Source | LLM-generated textual feedback | Bootstrapped few-shot examples from data |
| Backward Pass | Chain rule through text: LLM explains how to improve each variable | Rewards modules that contributed to correct final answers |
| Training Data Required | No (uses LLM as judge) | Yes (needs labeled examples for compilation) |
| Best For | Hard single-instance problems (PhD Q&A, code debugging) | Reusable, scalable prompt pipelines |
| API Style | PyTorch (Variables, backward, optimizer) | Declarative (Signatures, Modules, Compilers) |
| Cost Per Optimization | 3-12 LLM calls per instance per iteration | One compile pass, then fast inference |
| Published In | Nature (2025) | ICLR 2024 |
| Framework Maturity | Research-stage, stable API | Production-used, active community |
When to Use Which
- Use TextGrad when: You have a hard problem instance (a specific coding challenge, a specific molecule to optimize, a specific question to answer) and want to squeeze maximum performance out of the LLM by iteratively refining the output
- Use DSPy when: You are building a multi-step prompt pipeline that needs to work reliably across many inputs, and you have labeled training data to compile against
- Use both when: Compile a pipeline with DSPy for the general case, then apply TextGrad at inference time for the hardest instances that need extra optimization
DSPy has seen broader production adoption because compiled pipelines run at normal inference cost after the initial compile step. TextGrad's per-instance cost makes it impractical for high-throughput production use, but unbeatable for high-stakes individual problems where each percentage point of accuracy matters.
TextGrad vs Manual Prompt Engineering
Manual prompt engineering is guesswork with fast feedback. You rewrite the prompt, test it on a few examples, observe the result, adjust, and repeat. The iteration is fast (minutes per cycle) but unsystematic. You optimize for the examples you test on, not for the distribution. Two prompt engineers given the same task produce different prompts with different failure modes.
TextGrad automates the feedback loop. The LLM evaluates its own output, generates specific critiques, and applies targeted improvements. Each iteration is more expensive (3+ LLM calls) but more principled. The optimization targets the actual evaluation criteria, not the engineer's intuition.
| Dimension | Manual Engineering | TextGrad |
|---|---|---|
| Iteration Speed | Minutes (human in the loop) | Seconds per step (fully automated) |
| Cost Per Iteration | Human time | 3-12 LLM API calls |
| Reproducibility | Low (depends on the engineer) | High (deterministic given same seed) |
| Optimization Signal | Engineer intuition + spot-checking | Systematic LLM evaluation |
| Ceiling | Limited by human creativity | Limited by LLM judgment quality |
| Domain Expertise Needed | Yes (must understand the task) | Minimal (define evaluation criteria) |
| Scalability | One prompt at a time | Batch optimization across instances |
For most teams, the practical workflow is: start with manual engineering to get 80% of the way there, then apply TextGrad or DSPy to close the remaining gap systematically.
Practical Applications
Prompt Optimization
Optimize system prompts for specific tasks by defining evaluation criteria in natural language. TextGrad iteratively refines the prompt based on performance across test examples. No labeled dataset required.
Code Solution Refinement
Feed failing test cases as loss signal. TextGrad's backward engine analyzes error messages, identifies bugs, and generates targeted code patches. Achieved 36% completion on LeetCode Hard.
Molecular Design
Optimize SMILES strings for binding affinity and druglikeness. TextGrad interfaces with chemoinformatics tools (RDKit, AutoDock Vina) as external evaluators, using their scores as loss signal.
Multi-Step Reasoning
Build computation graphs with multiple reasoning steps. Each step is a node that receives targeted feedback. Useful for chain-of-thought optimization on math, science, and logic problems.
Medical Reasoning
Optimize clinical reasoning chains and treatment parameters. TextGrad has been applied to radiation oncology planning and medical Q&A tasks where correctness is safety-critical.
Agent Workflow Optimization
EvoAgentX (2025) integrates TextGrad with multi-agent system optimization, yielding up to 20% improvement on complex agentic tasks. Optimizes agent prompts and routing logic simultaneously.
Limitations
TextGrad is a research framework with real constraints. Understanding these helps you decide when to use it and when to pick alternatives.
Cost Scales with Depth
Each optimization step requires multiple LLM calls. For a computation graph with 5 nodes and 3 optimization iterations, you are looking at 15-45 LLM calls per problem instance. At GPT-5.4 pricing, a single optimization run on a hard problem can cost $0.50-$5.00. This makes TextGrad impractical for high-throughput production use.
Exploding Textual Gradients
In deep computation graphs, feedback accumulates across layers. Each backward step adds context to the gradient, growing prompt sizes. By layer 5+, the gradient prompt may exceed the backward engine's effective context window, triggering "lost in the middle" effects where critical corrections get buried.
Vanishing Textual Gradients
Compression strategies that try to manage growing gradients strip specificity. Early nodes in the graph receive diluted, generic feedback rather than targeted improvement suggestions. This limits effective optimization depth to 3-4 nodes in practice.
No Convergence Guarantees
Unlike numerical gradient descent with proven convergence for convex functions, textual gradients offer no formal convergence guarantees. The optimization may oscillate, overfit to the loss function's phrasing, or plateau. Setting a fixed iteration budget (3-5 steps) is standard practice.
Feedback Misalignment
TextGrad sometimes assigns feedback to the wrong variable. If the loss function criticizes the final output, the backward engine may blame the wrong intermediate step, sending unproductive gradients to variables that were not the actual problem.
Mitigations
- Keep graphs shallow: 2-3 nodes works well. Beyond 4, consider restructuring the computation.
- Use strong backward engines: GPT-5.4 or Claude as the backward engine produces better gradients than smaller models.
- Set iteration budgets: 3-5 steps typically captures most gains. Diminishing returns set in fast.
- Cache aggressively: TextGrad supports caching via litellm. Reuse gradients for similar problems.
Accelerating Optimization Loops with Fast Infrastructure
TextGrad's bottleneck is LLM call latency. Each optimization step requires a full round-trip to the backward engine, and each variable update requires applying textual patches. Faster infrastructure directly reduces total optimization time.
Fast Apply for Variable Updates
When TextGrad's optimizer applies gradients to code or structured text, Morph's Fast Apply engine processes the patch at 10,500 tok/s. A variable update that takes 3 seconds with standard generation completes in under 500ms.
WarpGrep for Context Retrieval
TextGrad optimization over codebases needs relevant context at each step. WarpGrep's semantic search finds related code in 2-3 seconds across repositories of any size, feeding better context into the forward and backward passes.
For code optimization workflows, the combination matters: WarpGrep provides the context that makes gradients specific, and Fast Apply makes the update step near-instant. A 5-iteration TextGrad loop over a code solution drops from ~45 seconds to ~15 seconds with both in the pipeline.
Frequently Asked Questions
What is TextGrad?
TextGrad is an open-source Python framework from Stanford's Zou group that performs automatic "differentiation" via text. It uses LLMs to generate natural language feedback (textual gradients) that optimize text-based variables like prompts, code solutions, and molecular structures. The work was published in Nature.
How does TextGrad differ from DSPy?
DSPy optimizes entire prompt pipelines at compile time using training data. TextGrad optimizes individual instances at inference time using LLM-generated feedback. DSPy builds reusable pipelines. TextGrad squeezes maximum performance out of specific, hard problems. They are complementary: compile with DSPy, then refine hard cases with TextGrad.
What benchmarks has TextGrad achieved?
GPQA accuracy went from 51% to 55% (GPT-4o zero-shot). LeetCode Hard saw a 20% relative performance gain with 36% overall completion. Molecule optimization matched clinical drug distributions. Radiation oncology plans achieved high target specificity without manual tuning.
How much does it cost to run TextGrad?
Cost depends on the backward engine model and optimization steps. A 3-step optimization on a single problem typically costs $0.05 to $0.50 with GPT-5.4. Costs scale linearly with iteration count and computation graph depth. For batch optimization across many instances, costs add up fast.
Can TextGrad optimize any type of text?
Any text where quality can be evaluated by an LLM. This includes system prompts, code, molecular SMILES strings, treatment plans, and multi-step reasoning chains. You define the evaluation criteria as a natural language loss function.
What LLMs does TextGrad support?
Any LLM supported by litellm: OpenAI (GPT-5.4, GPT-4), Anthropic (Claude), Google (Gemini), AWS Bedrock, Together AI, and more. You can use different models for forward passes and backward engines.
Is TextGrad production-ready?
For offline optimization tasks (improving prompts, refining solutions before deployment), yes. For real-time production systems with high throughput requirements, not yet. The per-instance cost (3-12 LLM calls per optimization) and lack of convergence guarantees make it better suited for research and offline optimization workflows.
What is Textual Gradient Descent (TGD)?
TGD is TextGrad's optimizer. After the backward pass generates textual gradients (natural language improvement suggestions), TGD applies those gradients to update the variable's text content. It is the text-space analog of SGD: instead of subtracting a scaled gradient from a weight, it rewrites the text based on the feedback.
Related Pages
Faster Optimization Loops for AI Systems
Morph's Fast Apply engine processes code patches at 10,500 tok/s, making TextGrad-style optimization loops 3x faster. WarpGrep provides semantic context retrieval for better gradient signals.