Every production coding agent compresses context. Claude Code auto-compacts at 95% capacity. OpenAI Codex runs server-side compaction after every turn. Cursor truncates old history. The methods differ, but the premise is shared: context fills up, so you compress it. This page compares eight approaches and explains why prevention beats compression.
The Problem: Why Coding Agents Need Compaction
A coding agent working on a real codebase reads files, greps for patterns, runs tests, reads error output, edits files, and reads more files. Each operation dumps tokens into the context window. A single file read can add 2,000-10,000 tokens. A grep that returns 50 matches adds 5,000-15,000 tokens. After 20-30 tool calls, the agent has consumed 100,000+ tokens, and most of it is noise: sections of files that weren't relevant, grep matches that didn't help, full-file rewrites that echoed 95% unchanged code back into context.
Cognition (the team behind Devin) measured this directly: agents spend 60% of their time searching for code. That means 60% of the tokens consumed by a typical agent session are search results, most of which turn out to be irrelevant. Those tokens push the agent toward compaction faster, and when compaction fires, it summarizes away details the agent still needs.
The consequence is a loop: search fills context with noise, compaction summarizes away signal along with the noise, the agent re-searches to find what it lost, those results fill context again, and compaction fires sooner. Context rot research shows that LLM accuracy degrades as input length increases, even before the window is full. Noise doesn't just waste space. It actively makes the model worse at attending to the tokens that matter.
The two sources of context waste
Search waste: Reading entire files to find 10-line functions. Grep returns 500 lines to find 5 relevant ones. File listings of entire directories to locate one path.
Write waste: Full-file rewrites that echo 95% unchanged code back into context. A 3-line edit to a 200-line file consumes 200 lines of context, not 3.
Method 1: LLM Summarization (Claude Code, Cursor)
The most common approach. An LLM reads the full conversation history and produces a structured summary. Claude Code organizes this into sections: completed work, current state, pending tasks, file modifications, and key decisions. The summary replaces the full history, freeing context space.
How Claude Code Auto-Compact Works
Claude Code triggers auto-compact when context usage reaches approximately 95% of the 200K token window (~167K tokens). The threshold was raised from ~77-78% in mid-2025, giving users roughly 12K additional tokens per cycle. When triggered, Claude:
- Analyzes the full conversation for key information
- Creates a structured summary with sections for completed work, decisions, and next steps
- Discards old tool outputs and intermediate reasoning
- Restarts the conversation with the compressed state
Users see a "context left until auto-compact" warning as the threshold approaches. After compaction, the agent continues with the summary as its only memory of previous work.
Claude Code: /compact vs /clear
/compact manually triggers context compression, summarizing the conversation while preserving key information. You can pass a custom prompt: /compact focus on the auth refactor. /clear wipes the entire conversation and starts fresh. Use /compact when you want to continue current work with less overhead. Use /clear when switching to a completely different task.
Strengths
- High compression ratio (70-90% reduction)
- Organized output with clear structure
- Preserves high-level intent and progress
- Human-readable result you can inspect
Weaknesses
- Paraphrasing corrupts code:
src/lib/auth.ts:47becomes "the auth module" - Error messages get shortened or altered
- Line numbers, variable names, and exact values are lost
- Costs a full LLM inference call (slow, expensive)
- Triggers the re-reading loop: agent re-searches for details that were summarized away
What summarization loses
# BEFORE compaction (exact context):
Error at src/api/webhooks/stripe.ts:98
TypeError: Cannot read property 'retryCount' of undefined
subscription.metadata.retryCount is null when
customer has no prior failed payments.
Fix: null check at line 98. Update test at
test/webhooks.test.ts:156.
# AFTER LLM summarization:
"Found bug in Stripe webhook retry logic.
Need to add null check for subscription metadata."
→ Lost: file path, line number, exact error, test location
→ Agent will re-read the file to find the line againMethod 2: Opaque Compression (OpenAI Codex)
OpenAI's Codex uses server-side compression through the /responses/compact endpoint. The model produces a compressed representation that's not human-readable. You send the full conversation, get back a compressed blob, and pass that blob as context for the next turn.
Factory.ai's evaluation found that Codex's opaque compression achieves a remarkable 99.3% compression ratio, but scored 3.35/5 on information retention versus 3.70/5 for structured summarization. The compression is aggressive.
Strengths
- Extreme compression ratio (99.3%)
- No user-side implementation needed
- Built into the Codex API flow
Weaknesses
- Black box: cannot inspect what was preserved or lost
- Lower quality score than structured summarization (3.35 vs 3.70)
- Vendor lock-in to OpenAI's endpoint
- Same hallucination class as summarization (rewrites, doesn't preserve)
OpenAI's own recommendation
OpenAI's Codex team explicitly recommends compaction as a "default long-run primitive", not an emergency fallback. They run compaction continuously, not just at the context limit. This is the right instinct (compress early, not late), but the opaque method sacrifices inspectability.
Method 3: Verbatim Compaction (Morph Compact)
Verbatim compaction deletes low-signal tokens while keeping every surviving sentence word-for-word identical to the original. No paraphrasing. No rewriting. The output is a strict subset of the input.
Morph Compact uses a purpose-built model that scores each token for relevance to the task, then removes tokens below a threshold. The critical property: if a file path, error message, or code block survives compression, it's identical to the original. Zero hallucination risk in the compressed output.
Strengths
- Zero hallucination: every surviving token is identical to the original
- Preserves exact file paths, line numbers, error messages, code blocks
- Breaks the re-reading loop (no details are paraphrased away)
- Fast (3,300+ tok/s), cheap (no full LLM call)
- Inspectable output (you can diff against the original)
Weaknesses
- Lower compression ratio than summarization (50-70% vs 70-90%)
- No reorganization: output preserves original order, not structured sections
- Cannot infer progress or intent (only preserves what was explicitly stated)
Verbatim compaction preserves what matters
# BEFORE (1,200 tokens):
Tool output: Read file src/api/webhooks/stripe.ts (247 lines)
Lines 1-41: imports, config, types
Lines 42-89: handleSubscriptionUpdated() - working correctly
Lines 90-134: handlePaymentFailed() — has retry logic bug
Lines 135-180: handleInvoicePaid() - working correctly
Lines 181-247: handleRefund(), exports
Error at line 98: TypeError: Cannot read property 'retryCount'
of undefined. subscription.metadata.retryCount is null when
the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
retryCount. Also update the test at test/webhooks.test.ts:156.
# AFTER verbatim compaction (480 tokens, 60% reduction):
Tool output: Read file src/api/webhooks/stripe.ts (247 lines)
Lines 90-134: handlePaymentFailed() — has retry logic bug
Error at line 98: TypeError: Cannot read property 'retryCount'
of undefined. subscription.metadata.retryCount is null when
the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
retryCount. Also update the test at test/webhooks.test.ts:156.
→ Kept: exact file, line numbers, error, fix instructions, test path
→ Removed: irrelevant functions (working correctly sections)Method 4: Token-Level Pruning (LLMlingua)
LLMlingua (Microsoft, EMNLP 2023) and LLMlingua-2 (ACL 2024) use a small language model to score each token's importance, then remove low-importance tokens. The idea: a cheap model (GPT-2 or BERT) identifies which tokens the expensive model actually needs, and you skip the rest.
How It Works
LLMlingua uses perplexity scoring from a small LM. Tokens that are highly predictable given their context (low perplexity) are candidates for removal, since the target LLM can reconstruct them. Tokens with high information content (high perplexity) are preserved. LLMlingua-2 replaced this with a BERT-based classifier trained via data distillation from GPT-4, achieving 3-6x faster compression.
The Problem for Code
Token-level pruning was designed for natural language, not code. Code has different information density properties. A variable name like retryCount might look low-importance to a language model (it's just a word), but it's critical for the coding agent. Semicolons, brackets, and indentation are low-perplexity tokens that are essential for syntax. Removing them corrupts the code.
Strengths
- Extreme compression ratios (up to 20x)
- Uses cheap models (GPT-2, BERT) for scoring
- Good for natural language (documentation, READMEs, comments)
- Preserves high-information tokens
Weaknesses
- Corrupts code: removes syntactically critical tokens (brackets, keywords)
- Requires a separate model (GPT-2/BERT) running alongside the agent
- Not designed for multi-turn conversation context
- Compression quality degrades at extreme ratios
Method 5: Observation Masking (JetBrains Junie)
JetBrains tested a simpler approach for their Junie agent: after the model has processed a tool output, replace it with a placeholder. The model has already seen the file contents, grep results, or error output. It doesn't need them sitting in context occupying tokens for every subsequent turn.
On SWE-bench, observation masking matched full LLM summarization quality at a fraction of the cost. No LLM call needed. No risk of hallucination. Just replace processed observations with [output masked, 2,847 tokens].
The ACON Extension
The ACON framework extends observation masking with adaptive control. Instead of masking all observations equally, ACON uses gradient-free optimization to learn which observations can be safely masked and which must be preserved. Result: 26-54% peak token reduction while preserving 95%+ task accuracy. It also achieved 20-46% performance improvement for smaller LMs, suggesting that removing noise actually helps the model focus.
Strengths
- Zero cost (no model call for compression)
- Zero hallucination risk
- Matched summarization quality on SWE-bench (JetBrains)
- Simple to implement
Weaknesses
- Lossy: if the agent needs to re-reference a masked observation, it must re-run the tool
- No selective preservation (entire observations are masked or kept)
- Requires careful tuning of when to mask vs. keep
Method 6: Selective Attention (Activation Beacon, SnapKV)
These methods modify the model's attention mechanism rather than the input. Activation Beacon condenses old context into compact "beacon" activations in the KV cache. SnapKV identifies which keys in the KV cache are important per attention head and evicts the rest.
The idea: instead of compressing the text, compress the model's internal representation of the text. The model keeps attending to important information while shedding the KV cache entries for irrelevant tokens.
Strengths
- No changes to input text (preserves full context semantically)
- Can extend effective context well beyond the nominal window
- Works transparently without changing the agent loop
Weaknesses
- Requires model-level access: only works with self-hosted models
- Cannot use with API-based models (Claude, GPT-4, etc.)
- Complex implementation requiring custom inference code
- Quality degrades at extreme compression ratios
API-based models only
If you're using Claude, GPT-4, or any API-based model, selective attention methods are not available to you. They require modifying the model's inference engine, which is only possible with self-hosted or open-source models.
Method 7: Context Distillation
Context distillation takes a fundamentally different approach: instead of compressing context at runtime, train the model to internalize the context so it doesn't need to be included at all. Anthropic used this technique in Constitutional AI training: they distilled alignment principles into the model's weights rather than including them in every prompt.
For coding agents, this means fine-tuning the model on your codebase, coding standards, and architectural patterns so it doesn't need to read them from context every session. The context is already in the weights.
Strengths
- Zero runtime overhead (knowledge is in the weights)
- No compression artifacts or hallucination
- Model behaves as if it "knows" the codebase natively
Weaknesses
- Requires fine-tuning (expensive, slow, complex)
- Knowledge is frozen at training time (doesn't adapt to code changes)
- Only works for stable, slowly-changing context (coding standards, not active files)
- Not available for API-based models
Method 8: Subagent Isolation
Instead of compressing one agent's context, split the work across multiple agents with isolated context windows. Each subagent handles a focused subtask (search, test, refactor) and returns only its result to the parent. The parent never sees the intermediate search results, grep outputs, or debugging steps.
This is not compression in the traditional sense. It's context avoidance. The noise never enters the parent context because it stays in the subagent's isolated window. Anthropic reports 90% improvement in agent task completion when using multi-agent architectures versus single-agent approaches.
How WarpGrep Uses This
WarpGrep is a subagent optimized for codebase search. Instead of the main agent running 12 grep calls and dumping all results into its context, WarpGrep runs those searches in an isolated context window, reasons about the results, and returns only the relevant code snippets. The main agent gets 10 lines of relevant code instead of 500 lines of search results.
Strengths
- Parent context stays clean (noise never enters)
- Enables parallel execution
- Each subagent gets a fresh, focused context window
- 90% improvement in task completion (Anthropic)
Weaknesses
- Higher total token cost (multiple agent calls)
- Coordination overhead between agents
- Not a solution for within-session compression
Head-to-Head: All 8 Methods Compared
| Dimension | LLM Summary | Opaque | Verbatim | LLMlingua | Masking | Sel. Attn | Distillation | Subagents |
|---|---|---|---|---|---|---|---|---|
| Compression | 70-90% | 99% | 50-70% | Up to 20x | 40-70% | 3.5x KV | 50-70% | N/A |
| Hallucination | Medium | Medium | Zero | Low* | Zero | None | None | None |
| Code fidelity | Low | Unknown | High | Low | N/A | High | High | High |
| Speed | Slow | Variable | 3,300 t/s | Moderate | Instant | Inline | Offline | Parallel |
| Cost | High | Included | Low | Low | $0 | Infra | Training | Higher |
| API models? | Yes | OpenAI | Yes | Yes | Yes | No | No | Yes |
| Real-time? | Yes | Yes | Yes | Moderate | Yes | Yes | No | Yes |
*LLMlingua has low hallucination risk for prose but can corrupt code by removing syntactically critical tokens.
No single method wins on every dimension
Summarization compresses the most but hallucinates. Verbatim compaction preserves fidelity but compresses less. Masking is free but lossy. Selective attention is transparent but requires self-hosting. The right choice depends on your constraints. Or you avoid the choice entirely by preventing context waste in the first place.
FlashCompact: Prevention Over Compression
Every method above treats compaction as damage control: context fills up, so you compress it. FlashCompact asks a different question: what if the context didn't fill up in the first place?
The two largest sources of context waste in coding agents are search (reading entire files to find small sections) and writing (rewriting entire files to make small edits). FlashCompact eliminates both, then applies verbatim compaction to clean up whatever noise remains. Three components:
WarpGrep: Targeted Search
RL-trained subagent that returns only relevant code snippets. 0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps. The main agent sees 10 lines of relevant code, not 500 lines of search results.
Fast Apply: Compact Diffs
10,500 tok/s code editing model that uses compact diffs instead of full file rewrites. A 3-line edit consumes 3 lines of context, not the entire 200-line file echoed back.
Morph Compact: Verbatim Cleanup
Deletion-based compaction at 3,300+ tok/s. 50-70% reduction with zero hallucination. Cleans up whatever noise remains after WarpGrep and Fast Apply have done their work.
Why Prevention Beats Compression
Compression has an unavoidable tradeoff: you lose information. Even the best methods sacrifice something. Summarization loses exact details. Verbatim compaction loses some context. Masking loses entire observations. The more you compress, the more you lose.
Prevention has no such tradeoff. When WarpGrep returns 10 relevant lines instead of 500 irrelevant ones, nothing is lost. The agent never needed those 490 lines. When Fast Apply uses a 200-token diff instead of a 2,000-token full rewrite, no information is sacrificed. The unchanged code was never relevant to context.
The math: a typical coding session runs 30-50 tool calls. Without FlashCompact, each tool call averages 3,000-5,000 tokens of context consumption (file reads, grep results, file writes). With WarpGrep and Fast Apply, each call averages 800-1,500 tokens. That's a 3-4x reduction in context consumption rate, which means auto-compact fires 3-4x less often. The agent completes more work per session, with better accuracy, because it never had to choose between detail and space.
| Dimension | Traditional (compress after) | FlashCompact (prevent waste) |
|---|---|---|
| When it acts | After context fills (reactive) | Before context fills (proactive) |
| Information loss | Always (tradeoff is inherent) | None (noise never entered context) |
| Re-reading loop | Common (agent re-searches for lost details) | Rare (details were never lost) |
| Context lifetime | Baseline | 3-4x longer |
| Compaction events | Every 20-30 tool calls | Every 60-100 tool calls |
| Code fidelity | Depends on method | 100% (verbatim or never-entered) |
The Compound Effect
Each FlashCompact component helps independently. Together they compound. WarpGrep reduces search waste by ~80%. Fast Apply reduces write waste by ~90%. Morph Compact reduces remaining noise by 50-70%. The compound effect: a session that would normally hit auto-compact after 25 tool calls can run for 80-100 tool calls before needing compression. And when compression does fire, there's less noise to remove, so less information is lost.
Benchmarks
Factory.ai: 36K Real Engineering Messages
Factory.ai evaluated compression methods on 36,000 real software engineering messages from production sessions across debugging, code review, and feature implementation.
| Method | Overall Score | Information Retention | Code Fidelity |
|---|---|---|---|
| Factory Structured Summary | 3.70/5 | High | Medium (paraphrased) |
| OpenAI Compact (Codex) | 3.35/5 | Medium | Unknown (opaque) |
| Morph Verbatim Compaction | N/A* | Highest (verbatim) | Highest (exact) |
*Factory's evaluation measured summarization quality. Verbatim compaction optimizes for a different metric (fidelity over compression ratio), so direct comparison requires a fidelity-focused benchmark.
ACON: Adaptive Context Optimization
Tested on AppWorld, OfficeBench, and Multi-objective QA benchmarks:
- 26-54% peak token reduction
- 95%+ task accuracy preserved
- 20-46% performance improvement for smaller LMs (removing noise helps)
JetBrains: Masking vs Summarization
On SWE-bench, simple observation masking matched LLM summarization quality at zero cost. The expensive summarization call was unnecessary. This finding supports the FlashCompact thesis: if you can avoid putting noise into context in the first place, you don't need expensive methods to remove it later.
WarpGrep: Search Efficiency
| Metric | Standard Grep | WarpGrep |
|---|---|---|
| F1 Score | 0.19 | 0.73 |
| Steps per query | 12 | 3.8 |
| Tokens consumed | ~5,000-15,000 | ~500-1,500 |
| Context waste | ~80% irrelevant | ~10% irrelevant |
Integration Guide
FlashCompact components are available through standard APIs. All three use the OpenAI SDK format.
1. WarpGrep: Targeted code search (MCP Server)
// Install the WarpGrep MCP server
// In your Claude Code settings or MCP config:
{
"mcpServers": {
"warpgrep": {
"command": "npx",
"args": ["-y", "@anthropic/warpgrep-mcp"],
"env": { "MORPH_API_KEY": "your-key" }
}
}
}
// WarpGrep replaces grep/ripgrep with targeted search
// Agent calls warpgrep_codebase_search("retry logic in webhooks")
// Returns: exact file, line numbers, relevant code snippet
// Instead of: 500 lines of grep matches across 30 files2. Fast Apply: Compact code diffs
import OpenAI from "openai";
const morph = new OpenAI({
apiKey: process.env.MORPH_API_KEY,
baseURL: "https://api.morphllm.com/v1",
});
// Fast Apply: send only the diff, not the full file
const response = await morph.chat.completions.create({
model: "morph-v3-fast",
messages: [{
role: "user",
content: `Apply this edit to the file:
<<<< ORIGINAL
if (subscription.metadata.retryCount > 3) {
==== MODIFIED
if (subscription.metadata?.retryCount ?? 0 > 3) {
>>>> END`
}],
});
// Returns the updated file. Agent context sees only the diff,
// not the entire 247-line file echoed back.3. Morph Compact: Verbatim context cleanup
import OpenAI from "openai";
const morph = new OpenAI({
apiKey: process.env.MORPH_API_KEY,
baseURL: "https://api.morphllm.com/v1",
});
// Compact long tool outputs inline
async function compactIfNeeded(content: string): Promise<string> {
const tokens = content.split(/\s+/).length; // rough estimate
if (tokens < 500) return content; // short outputs pass through
const response = await morph.chat.completions.create({
model: "morph-compact",
messages: [{ role: "user", content }],
});
return response.choices[0].message.content ?? content;
}
// In your agent loop:
for (const toolCall of pendingToolCalls) {
const result = await executeTool(toolCall);
const compacted = await compactIfNeeded(result.output);
conversation.addToolResult(toolCall.id, compacted);
// Agent context stays clean — only high-signal tokens
}Frequently Asked Questions
What is context compaction in coding agents?
Context compaction reduces the number of tokens in an LLM's context window while preserving the information needed for correct output. Eight approaches exist: LLM summarization, opaque compression, verbatim compaction, token-level pruning (LLMlingua), observation masking, selective attention, context distillation, and subagent isolation. Each trades off compression ratio, fidelity, speed, and hallucination risk differently.
What is auto-compact in Claude Code?
Claude Code auto-compact triggers when context usage reaches ~95% of the 200K token window. Claude summarizes the conversation into structured sections and discards old tool outputs. The summarization is lossy: exact file paths, line numbers, and error messages may be paraphrased. Users see a "context left until auto-compact" warning as the threshold approaches.
What is the difference between /compact and /clear in Claude Code?
/compact manually triggers context compression, creating a summary while preserving key information. You can pass a custom prompt (/compact focus on auth work). /clear wipes everything and starts fresh. Use /compact to continue current work with less overhead. Use /clear when switching tasks entirely.
How does Morph FlashCompact work?
FlashCompact prevents context waste rather than compressing it after the fact. Three components: WarpGrep returns only relevant code snippets instead of entire files (0.73 F1 in 3.8 steps). Fast Apply uses compact diffs instead of full file rewrites (10,500 tok/s). Morph Compact cleans up remaining noise with verbatim deletion (50-70% reduction, zero hallucination). Together they extend context life by 3-4x.
What is LLMlingua?
LLMlingua is Microsoft's prompt compression framework that uses a small language model to score token importance and remove low-importance tokens. It achieves up to 20x compression on natural language. LLMlingua-2 uses a BERT-based classifier for 3-6x faster compression. Both struggle with code (removing syntactically critical tokens corrupts programs).
Does context compression cause hallucinations?
Summarization-based compression (Claude Code, Codex) rewrites context, which can alter code, file paths, and error messages. This is a form of hallucination. Verbatim compaction and observation masking avoid this by deleting tokens rather than rewriting. Token-level pruning (LLMlingua) can corrupt code syntax. Selective attention and context distillation operate on model internals and don't change the input text.
Which context compaction method is best for coding agents?
Prevention beats compression. Morph FlashCompact (WarpGrep + Fast Apply + Morph Compact) extends context life by 3-4x by eliminating waste at the source. When compression is unavoidable, verbatim compaction preserves code fidelity better than summarization. Observation masking is the cheapest option. Compaction vs summarization depends on whether you need structured progress tracking (summarization) or exact code preservation (compaction).
Can I disable auto-compact in Claude Code?
You cannot fully disable auto-compact. You can delay it by reducing context waste: use WarpGrep instead of reading entire files, use Fast Apply for compact diffs, and run /compact manually at 50-60% capacity to maintain quality. You can also customize the compact prompt in Claude Code settings to preserve specific types of information.
Stop Compressing. Start Preventing.
FlashCompact extends your agent's context life by 3-4x. WarpGrep for targeted search, Fast Apply for compact diffs, Morph Compact for verbatim cleanup. Three tools, one goal: auto-compact fires less often because context waste never happens.