Coding agents burn through context windows fast. Every file read, every grep result, every full-file rewrite pushes the agent closer to the compaction threshold. When compaction fires, the question is: does the agent rewrite its memory (risking hallucination) or delete the noise and keep the signal intact? Context compaction takes the second path.
What Is Context Compaction
Context compaction is a deletion-based approach to reducing the number of tokens in an LLM's context window. Instead of rewriting content into a summary or encoding it into an opaque format, compaction identifies low-signal content and removes it. What survives is unchanged from the original input.
The term has become overloaded. Anthropic uses "compaction" to describe their server-side summarization API, which actually generates new text (a summary). In this guide, we use compaction in its strict sense: deletion-based reduction where no new tokens are generated.
The core property that makes compaction valuable for coding agents is verbatim fidelity. A file path like src/api/webhooks/stripe.ts:98 survives compaction exactly as written, or it gets deleted entirely. It never becomes "the webhook handler" or "the Stripe file." For agents that need to navigate codebases, edit specific lines, and match exact error strings, this property is more important than raw compression ratio.
Compaction in one sentence
Compaction removes tokens. Summarization rewrites tokens. Compression encodes tokens. Only compaction guarantees that surviving output is identical to the input.
Compaction vs Compression vs Summarization
These three terms get used interchangeably in LLM tooling, but they describe fundamentally different operations with different trade-offs. The distinction matters because each approach fails differently when applied to code.
| Property | Compaction (Deletion) | Compression (Encoding) | Summarization (Rewriting) |
|---|---|---|---|
| Mechanism | Delete low-signal tokens | Encode into opaque format | LLM rewrites into summary |
| Compression Ratio | 50-70% | 99.3% (OpenAI) | 70-90% |
| Hallucination Risk | None | Unknown (opaque) | Moderate to high |
| Output Inspectable | Yes, fully | No | Yes, but rewritten |
| Verbatim Fidelity | 98% (Morph) | N/A | Low |
| File Paths Preserved | Exact or deleted | Unknown | Often paraphrased |
| Error Messages Preserved | Exact or deleted | Unknown | Often shortened |
| Vendor Lock-in | None | Full (OpenAI only) | None |
| Speed | 3,300+ tok/s | Server-side | Requires full LLM call |
| Portable Across Providers | Yes | No | Yes |
For a deeper comparison including benchmark scores from Factory.ai's evaluation, see our compaction vs summarization guide.
Why Coding Agents Need Verbatim Fidelity
Coding agents operate on exact tokens. A summarizer that paraphrases Error: ECONNREFUSED 127.0.0.1:5432 as "a database connection error" destroys the debugging context. The agent can no longer grep for that error string, can no longer match it against known issues, can no longer include it in a fix commit message.
Compaction either keeps that error message verbatim or deletes it entirely. If the error is still relevant to the current task, the compaction model scores it as high-signal and preserves it. If the agent has already fixed the issue and moved on, the error gets deleted. Either way, no new text is generated, and no detail is corrupted.
Verbatim Compaction: Delete, Never Rewrite
Verbatim compaction is the purest form of context compaction. The model reads the full context, scores each line or block by relevance to the current task, and removes low-signal content. The output is a strict subset of the input.
Morph Compact implements verbatim compaction at production scale. The model processes context at 3,300+ tokens per second with 98% verbatim accuracy, meaning 98% of output lines are character-for-character matches with input lines.
What Gets Deleted
The compaction model learns which context types are high-signal vs expendable through training on real agent conversations. In practice, the following categories are consistently scored as low-signal:
- Redundant tool outputs: File reads the agent already acted on. Grep results already processed. Test output from passing tests.
- Exploratory dead ends: Files read but found irrelevant. Search queries that returned nothing useful.
- Verbose boilerplate: License headers, import blocks the agent isn't modifying, configuration files read for reference.
- Superseded information: Earlier versions of files that have since been edited. Old error messages from bugs already fixed.
What Survives
- Active file paths and line numbers: Any reference the agent might need to navigate
- Current error messages: Unresolved bugs and their exact text
- Reasoning decisions: Why the agent chose approach A over B
- Recent tool calls and their results: The last 3-5 operations
- User instructions: The original task and any clarifications
Verbatim compaction: input vs output
# INPUT: 1,247 tokens across 15 tool calls
User: Fix the rate limiting bug in the API gateway
Agent: I'll investigate the rate limiter...
[read_file src/middleware/rateLimit.ts] # 340 tokens
[read_file src/middleware/cors.ts] # 280 tokens ← DELETED (irrelevant)
[grep "rateLimitExceeded" src/] # 190 tokens
[read_file src/config/limits.json] # 95 tokens ← DELETED (already processed)
[read_file tests/rateLimit.test.ts] # 410 tokens
Agent: Found the issue on line 47...
[edit_file src/middleware/rateLimit.ts:47-52]
Agent: Running tests...
[test output - 200 lines, all passing] # 380 tokens ← DELETED (all pass)
Agent: Tests pass. The fix handles the edge case where...
# OUTPUT: 612 tokens — 51% reduction
# Every surviving line: unchanged from input
# File paths: exact. Error text: exact. Line numbers: exact.The re-reading loop, broken
Summarization creates a failure mode where the agent loses a detail, re-searches for it, fills context with new results, summarizes again, and loses it again. Verbatim compaction breaks this loop because surviving content is exact. If the agent needs src/middleware/rateLimit.ts:47, it either still has that exact string in context or it was deleted. No paraphrased "the rate limit middleware" to chase down.
Token-Level Pruning (LLMlingua)
LLMlingua takes compaction to the individual token level. Instead of deleting entire lines or blocks, it scores each token's importance using perplexity from a small language model (GPT-2 or LLaMA-7B), then removes low-importance tokens one by one.
LLMlingua-2 improved on the original with a BERT-based encoder for faster scoring (3-6x speedup) and better generalization to out-of-domain data. Both achieve impressive compression ratios on natural language benchmarks: up to 20x compression with minimal accuracy loss on reasoning tasks.
The Code Problem
Token-level pruning works well for natural language, where removing filler words ("the," "is," "that") rarely changes meaning. Code is different. Removing a single token can break syntax, change semantics, or corrupt identifiers:
Token-level pruning risks with code
# ORIGINAL:
if (user.role !== "admin" && !user.permissions.includes("write")) {
throw new ForbiddenError("Insufficient permissions");
}
# AFTER aggressive token pruning (conceptual):
if (user.role "admin" && user.permissions.includes("write")) {
throw ForbiddenError("Insufficient permissions");
}
# Removed: !==, !, new — completely changed the logicFor natural language reasoning tasks, LLMlingua is effective. For coding agent context where syntax integrity matters, line-level or block-level compaction (verbatim deletion) is safer. The granularity of deletion maps to the granularity of meaning: in code, the meaningful unit is a line or a block, not a token.
| Dimension | Token-Level (LLMlingua) | Line-Level (Verbatim) |
|---|---|---|
| Compression Ratio | Up to 20x | 50-70% |
| Code Safety | Low (can break syntax) | High (lines stay intact) |
| Speed | Requires separate model pass | 3,300+ tok/s inline |
| Best For | Natural language, RAG prompts | Code, agent conversations |
| Hallucination Risk | Low but syntax corruption | None |
Observation Masking (JetBrains Approach)
JetBrains published research on their Junie agent showing that you can get surprisingly far without any sophisticated compression at all. Their technique, observation masking, replaces old tool outputs with a placeholder while keeping the tool call itself visible.
The agent remembers what it did (read a file, ran a grep, executed a test) but does not carry the full output in context. The output is simply replaced with [masked]. On SWE-bench, this matched the quality of full LLM summarization while using zero extra compute for the masking step.
Observation masking in practice
# BEFORE masking: full tool outputs in context
Tool: read_file("src/middleware/rateLimit.ts")
Output: [340 tokens of file content]
Tool: grep("rateLimitExceeded", "src/")
Output: [190 tokens of grep results]
Tool: read_file("tests/rateLimit.test.ts")
Output: [410 tokens of test file]
# AFTER masking: tool calls preserved, old outputs removed
Tool: read_file("src/middleware/rateLimit.ts")
Output: [masked — 340 tokens freed]
Tool: grep("rateLimitExceeded", "src/")
Output: [masked — 190 tokens freed]
Tool: read_file("tests/rateLimit.test.ts")
Output: [still visible — most recent, likely still relevant]
# Agent knows it read rateLimit.ts and grepped for the error.
# If it needs the file content again, it re-reads (targeted).The insight behind observation masking aligns with context rot research: most tool outputs are consumed once and never referenced again. The agent reads a file to understand it, makes a decision, and moves on. Carrying 340 tokens of that file through the rest of the conversation is pure waste. Masking removes the waste while preserving the agent's action history.
When masking falls short
Observation masking works best when tool outputs are consumed once. It fails when the agent needs to reference old output later, such as comparing two versions of a file or correlating errors across multiple test runs. In these cases, verbatim compaction is more appropriate: it selectively removes low-signal content while preserving the specific lines the agent still needs.
Adaptive Compaction (ACON Framework)
The ACON (Adaptive Context Optimization for Agents) framework demonstrated that uniform compression is suboptimal. Different parts of an agent's context have different information density and different relevance to the current task. ACON treats each context segment independently, applying different compression levels based on content type.
The Key Insight: Not All Context Is Equal
ACON found that aggressive compression of tool outputs is safe because the reasoning trace matters more than the raw data. An agent's chain of thought (why it chose this file, what pattern it noticed, what fix it decided on) carries more information-per-token than the raw grep output or file content that informed those decisions.
This maps to a hierarchy of compaction safety:
Safe to Compress Aggressively
Raw tool outputs (file reads, grep results, test output). High token count, low information density after initial consumption. 60-80% of context in typical agent sessions.
Compress Carefully
Reasoning traces and decision records. Medium token count, high information density. The agent's chain of thought about why it chose a particular approach.
Never Compress
User instructions, active error messages, current file paths, pending task state. Low token count, irreplaceable information. Loss causes agent failure.
Morph Compact implicitly applies this hierarchy. The model is trained on real agent conversations and learns which content types are expendable vs critical. The result is similar to ACON's adaptive approach but without requiring explicit segment classification.
Server-Side Compaction APIs
Both Anthropic and OpenAI now offer server-side compaction as API features. Their implementations differ significantly in mechanism and trade-offs.
Anthropic's Compaction API
Anthropic's server-side compaction (beta, available for Claude Opus 4.6 and Sonnet 4.6) generates a structured summary when input tokens exceed a configurable threshold. The API detects when context approaches the limit, generates a compaction block containing the summary, and continues the conversation from that compressed state.
This is technically summarization, not compaction in the strict sense. The output is new text generated by the model, not a subset of the input. The summary is human-readable and structured, but carries the risks of any generative approach: paraphrased file paths, shortened error messages, lost line numbers.
OpenAI's Opaque Compression
OpenAI's approach produces a server-side, non-human-readable compressed representation. It achieves 99.3% compression ratio, the most aggressive of any approach, but the output is not inspectable. You cannot verify what was preserved or dropped. The compressed state only works with OpenAI's models, creating full vendor lock-in.
| Property | Anthropic Compaction | OpenAI Compression | Morph Compact |
|---|---|---|---|
| Mechanism | LLM summary generation | Opaque encoding | Verbatim deletion |
| Compression Ratio | ~70-90% | 99.3% | 50-70% |
| Output Inspectable | Yes (summary) | No | Yes (verbatim) |
| Hallucination Risk | Moderate | Unknown | None |
| Vendor Lock-in | Anthropic API | OpenAI API | None (OpenAI SDK compatible) |
| Speed | Full LLM call | Server-side | 3,300+ tok/s |
| Verbatim Fidelity | Low (rewritten) | Unknown | 98% |
Benchmarks: How Compaction Methods Compare
Evaluating compaction methods requires measuring more than compression ratio. The metrics that matter for coding agents are: does the agent still complete the task correctly after compaction? Does it hallucinate file paths? Does it re-search for information it already had?
Factory.ai Memory Layer Evaluation
Factory.ai evaluated context management approaches across 36,000 real engineering messages. Their structured summarization approach scored 3.70/5 overall, while OpenAI's opaque compression scored 3.35/5. The evaluation measured accuracy, information retention, and consistency across sessions.
The most revealing metric was multi-session information retention: only 37%. When agents used summarization to carry context across sessions, nearly two-thirds of information was lost or corrupted. This highlights why verbatim compaction's guarantee of exact preservation matters more than higher compression ratios.
| Metric | LLM Summary | Opaque (OpenAI) | Verbatim (Morph) | Observation Masking |
|---|---|---|---|---|
| Overall Score | 3.70/5 | 3.35/5 | N/A (not in eval) | Matched summary |
| Accuracy | 3.74-4.04/5 | 3.43/5 | 98% verbatim | No new errors |
| Compression | 70-90% | 99.3% | 50-70% | 60-80% |
| Multi-Session Retention | 37% | Not measured | N/A (stateless) | N/A |
| Hallucination Risk | Moderate | Unknown | None | None |
| Extra Compute Cost | Full LLM call | Included | 3,300+ tok/s | $0 |
SWE-Bench Pro Results
FlashCompact's prevention-first approach, combining WarpGrep targeted search, Fast Apply compact diffs, and Morph Compact verbatim cleanup, achieved state-of-the-art results on SWE-Bench Pro. The key factor was not better compaction but less need for compaction: agents using FlashCompact consumed 3-4x fewer tokens per session, meaning they hit the compaction threshold 3-4x less often.
Prevention-First: Reducing the Need for Compaction
The best compaction is the one you never run. Cognition measured that agents spend 60% of their time searching for code, dumping entire files into context to find 10-line functions. Every full-file read accelerates the countdown to compaction.
FlashCompact attacks context waste at two sources before compaction becomes necessary:
Search Waste: WarpGrep
RL-trained semantic codebase search returns only relevant code snippets, not entire files. 0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps. Prevents 60%+ of context waste from search operations.
Write Waste: Fast Apply
10,500 tok/s compact diffs instead of full file rewrites. A 3-line edit to a 200-line file consumes 3 lines of context, not 200. Prevents write operations from echoing unchanged code back into context.
Residual Noise: Morph Compact
Verbatim deletion at 3,300+ tok/s cleans up whatever noise remains. 50-70% compression with 98% verbatim accuracy. Zero hallucination. The last line of defense after prevention.
The combination extends effective context life by 3-4x. An agent that would normally hit compaction after 25 tool calls can run 75-100 tool calls before needing cleanup. When compaction does fire, the input is already higher-signal (because prevention removed the noise), so the compaction output is more useful too.
Prevention-first vs compaction-only
# WITHOUT prevention (standard agent):
# Tool call 1-10: file reads, greps → ~60K tokens consumed
# Tool call 11-20: more reads, edits → ~120K tokens consumed
# Tool call 21-25: approaching 167K limit
# → Auto-compact fires, summarizes everything
# → Agent loses file paths, re-searches, fills context again
# → Second compaction at tool call 35. Third at 45.
# Total compactions for 50-tool-call task: 3-4
# WITH FlashCompact (prevention-first):
# Tool call 1-10: WarpGrep returns snippets → ~15K tokens consumed
# Tool call 11-20: Fast Apply uses diffs → ~25K tokens consumed
# Tool call 21-50: steady, efficient growth → ~80K tokens consumed
# → Morph Compact runs once at tool call 40 for cleanup
# → Agent retains exact file paths, no re-searching
# Total compactions for 50-tool-call task: 0-1Why prevention compounds
Each compaction cycle has a cost: the agent loses some context fidelity, may re-search for lost details, and burns tokens on the re-search. Preventing one compaction doesn't just save the compaction time. It prevents the downstream re-searching and re-reading that follows. Three prevented compactions might save 30-50K tokens of redundant re-searching.
Integration Guide: Using Morph Compact
Morph Compact is OpenAI SDK compatible. Point the base URL at api.morphllm.com, use morph-compact as the model, and send standard chat completion requests. The response contains the compacted version of your input where every surviving line is verbatim from the original.
Morph Compact via OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(
base_url="https://api.morphllm.com/v1",
api_key="your-morph-api-key"
)
# Compact a conversation that's approaching the context limit
response = client.chat.completions.create(
model="morph-compact",
messages=[
{
"role": "user",
"content": full_conversation_text
# Can be the entire agent conversation as a single string
}
]
)
compacted = response.choices[0].message.content
# compacted contains only high-signal lines from the input
# Every surviving line is character-for-character identicalMorph Compact via OpenAI SDK (TypeScript)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: "your-morph-api-key",
});
const response = await client.chat.completions.create({
model: "morph-compact",
messages: [
{
role: "user",
content: fullConversationText,
},
],
});
const compacted = response.choices[0].message.content;
// Every surviving line is verbatim from the inputWhen to Trigger Compaction
Trigger compaction at 80% context capacity, not at the limit. This gives the agent room to finish its current operation before compacting. If you wait until 95% (the Claude Code default), the agent has almost no room to work after compaction fires, and the compaction itself adds tokens.
For a belt-and-suspenders approach, store pre-compaction context as a virtual file. If compaction drops something the agent needs later, it can re-read it from the stored snapshot rather than re-running the original tool call.
Frequently Asked Questions
What is context compaction?
Context compaction is a deletion-based strategy for reducing tokens in an LLM's context window. Instead of rewriting content into a summary or encoding it into a compressed format, compaction identifies low-signal content and removes it. Every surviving line is character-for-character identical to the original input. Morph Compact achieves 50-70% compression with 98% verbatim accuracy at 3,300+ tokens per second.
How does compaction differ from compression and summarization?
Compression encodes content into a smaller representation, like OpenAI's opaque format (99.3% ratio, not inspectable, vendor-locked). Summarization rewrites content into a condensed version (70-90% ratio, but introduces hallucination risk). Compaction deletes tokens without rewriting (50-70% ratio, zero hallucination). Only compaction guarantees verbatim fidelity of surviving content. For a detailed comparison, see compaction vs summarization.
What is verbatim compaction?
Verbatim compaction means every surviving line in the output is identical to the input. No paraphrasing, no rewriting, no new text generation. The model scores lines by relevance and deletes low-signal ones. This eliminates hallucination risk entirely because the model never generates new content during the compaction process.
What is token-level pruning?
Token-level pruning (LLMlingua) removes individual tokens based on perplexity scores rather than entire lines. It achieves up to 20x compression on natural language, but risks breaking code syntax by removing syntactically important tokens. For coding agents, line-level verbatim compaction is safer.
What is observation masking?
Observation masking (from JetBrains' Junie research) replaces old tool outputs with placeholders while keeping tool calls visible. The agent remembers what it did but doesn't carry full outputs in context. On SWE-bench, it matched LLM summarization quality while costing zero extra compute. It works best when tool outputs are consumed once and not referenced later.
Does context compaction cause hallucinations?
Verbatim compaction does not cause hallucinations because no new text is generated. Every surviving line is identical to the input. By contrast, summarization-based approaches (used by Claude Code and Cursor) rewrite context in the model's own words, which can alter file paths, error messages, and line numbers. OpenAI's opaque compression produces non-inspectable output, making hallucination risk unknown.
How does ACON adaptive compaction work?
ACON (Adaptive Context Optimization for Agents) applies different compression levels to different context segments. Tool outputs get compressed aggressively (they're high-token, low-information-density after consumption). Reasoning traces get preserved more carefully. The framework achieved 26-54% token reduction while preserving 95%+ accuracy on agent benchmarks.
Is Morph Compact compatible with the OpenAI SDK?
Yes. Point base_url at api.morphllm.com/v1 and use morph-compact as the model. Standard chat completion requests work without modification. No new SDK, no new dependencies, no custom client.
Related Pages
Verbatim Compaction at 3,300+ tok/s
Morph Compact deletes noise and keeps signal. 98% verbatim accuracy, zero hallucination, OpenAI SDK compatible. Every surviving line is character-for-character from your original context.