Every frontier coding agent now compresses context as a core primitive. OpenAI recommends compaction as the default long-run primitive, not an emergency fallback. Anthropic's Claude Code runs automatic compaction when conversations approach limits. The question is not whether to compress, but which method loses the least signal.
What Is Context Compression (and Why It Matters for AI Agents)
Context compression reduces the number of tokens an LLM processes while preserving the information it needs to produce correct output. For coding agents that run multi-step tasks over long sessions, context accumulates fast: file reads, grep results, tool outputs, error traces, and conversation history all compete for the same finite window.
The problem is not running out of space. It's signal dilution. As context rot research shows, LLM performance degrades as input length increases, even when the window is nowhere near full. Every irrelevant token makes the model worse at attending to the tokens that matter.
Jason Liu framed the value precisely: "If in-context learning is gradient descent, then compaction is momentum." It preserves the trajectory of the conversation while shedding the weight of irrelevant history. The model keeps its direction without dragging dead tokens forward.
Multi-session memory is still broken
The best approaches to cross-session memory retention achieve only ~37% retention. This makes within-session compression critical: if you cannot carry context forward reliably between sessions, you need to maximize what fits in the current window.
Three Approaches to Context Compression
Three distinct methods have emerged, each with different tradeoffs between compression ratio, fidelity, and hallucination risk.
Structured Summarization
Rewrites conversation history into organized sections: completed work, current state, pending tasks. Used by Anthropic's Claude Code. High compression, but rewrites the original text.
Opaque Compression
Uses the model itself to produce a compressed representation via API. OpenAI's Codex /responses/compact endpoint. The compression logic is a black box you cannot inspect.
Verbatim Compaction
Deletes low-signal tokens while keeping every surviving sentence word-for-word identical. Morph Compact. Zero hallucination risk because nothing is rewritten.
| Dimension | Structured Summary | Opaque Compression | Verbatim Compaction |
|---|---|---|---|
| How it works | LLM rewrites into sections | Model-internal compression | Deletes noise, keeps text verbatim |
| Hallucination risk | Medium (paraphrasing) | Medium (black box) | Zero (no rewriting) |
| Code fidelity | Low (code may be altered) | Unknown (opaque) | High (exact original text) |
| Inspectability | High (readable summary) | None (opaque blob) | High (subset of original) |
| Compression ratio | High (70-90%) | Variable | 50-70% |
| Speed | Slow (full LLM call) | Variable | 3,300+ tok/s |
The fundamental tradeoff: summarization gives you more compression but rewrites the text. Compaction gives you less compression but guarantees fidelity. For coding agents where exact file paths, error messages, and code snippets must survive compression intact, this distinction matters.
Benchmarks: How Compression Methods Compare
Factory.ai's 36K Message Evaluation
Factory.ai ran the most rigorous public evaluation of context compression to date. They tested three compression methods on 36,000 real software engineering messages from production coding sessions, not synthetic benchmarks.
| Method | Overall Score | Information Retention | Actionability |
|---|---|---|---|
| Structured Summary (Factory) | 3.70 | High | High |
| Opaque Compression (OpenAI) | 3.35 | Medium | Medium |
| Verbatim Compaction (Morph) | N/A (different axis) | Highest (verbatim) | Highest (exact code) |
Factory's structured summaries scored 3.70 overall versus OpenAI's 3.35. The key advantage: structured summaries organize compressed output into sections (what was done, what failed, what's next), making it easier for the model to pick up where it left off.
But both approaches share a deeper problem: the re-reading loop. When an agent fills context with search results or file reads, summarization compresses those results into a paraphrase. The paraphrase loses exact details (line numbers, error strings, specific values), so the agent re-runs the search. The new output refills context. The agent summarizes again. In the worst case, this becomes a near-infinite loop where the agent oscillates between searching and summarizing without making progress.
Verbatim compaction operates on a different axis. It does not aim to reorganize information. It aims to preserve exact content with zero rewriting. Factory's evaluation measured summarization quality. Morph Compact optimizes for a different metric: fidelity.
ACON: Adaptive Context Optimization
The ACON framework from academic research targets the largest source of context bloat: tool call observations. File reads, grep results, and command outputs often fill 60-80% of an agent's context window.
ACON adaptively controls observation length based on task requirements. Short observations pass through. Long ones get compressed. The approach preserves 95%+ task accuracy while cutting peak token consumption by up to 54%.
JetBrains: Masking Matches Summarization
JetBrains tested observation masking against full LLM summarization on SWE-bench. Simple masking, replacing tool outputs with placeholders after the model has processed them, matched LLM summarization quality at a fraction of the cost. You do not always need an expensive generation call to compress context.
The cost of summarization
Every summarization call is itself an LLM inference. If you compress context by calling GPT-4 or Claude to summarize, you pay for that call in tokens and latency. Masking and deletion-based approaches skip that cost entirely. For high-frequency compression (every few turns), the savings compound fast.
Verbatim Compaction: The Zero-Hallucination Approach
Morph Compact takes a fundamentally different approach: deletion, not rewriting. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives compression is verbatim from the original.
Summarization-based compression can alter code snippets, mangle file paths, or introduce subtle factual errors during the rewriting process. When a coding agent compresses context that contains src/lib/auth.ts:47 and the summary paraphrases it as "the auth module," the agent loses the exact reference it needs for the next edit.
Verbatim compaction sidesteps this entirely. The output is a strict subset of the input. If a file path, error message, or code block survives compression, it's identical to what was originally there. No paraphrasing. No "close enough." And because surviving lines are exact matches, the agent doesn't need to re-search for details that got paraphrased away. This breaks the re-reading loop that plagues summarization.
The historical problem with deletion-based compaction was speed. Scoring every token for relevance is expensive, and early approaches took 8-15 seconds per compaction. Too slow for inline use. Morph built custom inference engines optimized specifically for the compaction workload, bringing latency under 3 seconds. Fast enough to run before every LLM call, not just as an emergency measure at the context limit.
Summarization vs. verbatim compaction
# Original context (from a coding agent session):
Tool output: Read file src/api/webhooks/stripe.ts (247 lines)
Lines 42-89: handleSubscriptionUpdated()
Lines 90-134: handlePaymentFailed() — has retry logic bug
Lines 135-180: handleInvoicePaid()
Error at line 98: TypeError: Cannot read property 'retryCount'
of undefined. subscription.metadata.retryCount is null when
the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
retryCount. Also update the test at test/webhooks.test.ts:156.
# After SUMMARIZATION:
"Found a bug in the Stripe webhook handler related to retry
logic. The subscription metadata needs a null check."
→ Lost: exact file path, line numbers, error message, test location
# After VERBATIM COMPACTION:
Tool output: Read file src/api/webhooks/stripe.ts
Lines 90-134: handlePaymentFailed() — has retry logic bug
Error at line 98: TypeError: Cannot read property 'retryCount'
of undefined. subscription.metadata.retryCount is null when
the customer has no prior failed payments.
Agent note: Need to add null check at line 98 before accessing
retryCount. Also update the test at test/webhooks.test.ts:156.
→ Kept: exact file, line numbers, error, test path. Removed: irrelevant functions.The summarized version is shorter but lost the information the agent actually needs: the file path, line number, error details, and test file location. The compacted version is longer but preserves every actionable detail.
When to Compress: Inline vs. Threshold-Based
Two strategies govern when compression runs during an agent session.
| Strategy | How It Works | Best For |
|---|---|---|
| Threshold-based | Triggers at 70-80% of window limit | Long sessions with gradual accumulation |
| Inline / continuous | Compresses tool outputs as they arrive | High-throughput agents with many tool calls |
Threshold-Based (Claude Code, Codex)
Most production agents use threshold-based compression. Claude Code triggers compaction when the conversation approaches the context window limit. The model summarizes the full conversation history, commits progress to git with descriptive messages, and restarts with the compressed state.
The downside: context quality degrades before the threshold triggers. An agent at 60% capacity with 40% noise tokens is already performing worse than the same agent with clean context. The compression happens too late.
Inline Compression (Recommended)
Inline compression runs continuously. Tool outputs get compacted as they arrive. Observation data from file reads and grep results is reduced before it enters the main conversation context. The agent never accumulates noise in the first place.
Inline compression with Morph Compact
// Threshold-based: compress when you're almost out of space
// Problem: context already degraded by the time you compress
agent.on('contextLimit', () => {
const compressed = await compact(conversation);
agent.restart(compressed);
});
// Inline: compress tool outputs before they enter context
// The agent's context stays clean throughout the session
agent.on('toolResult', async (result) => {
if (result.tokens > 500) {
result.content = await morph.compact(result.content);
}
agent.addToContext(result);
});OpenAI agrees
OpenAI's Codex team explicitly recommends compaction as a default long-run primitive, not an emergency fallback. This aligns with inline compression: run it continuously as part of the normal agent loop, not just when you hit a wall.
Code Example: Morph Compact SDK
Morph Compact is available through the standard OpenAI SDK. Point the base URL at Morph's API and use the morph-compact model.
Basic usage with OpenAI SDK (Python)
from openai import OpenAI
client = OpenAI(
api_key="your-morph-api-key",
base_url="https://api.morphllm.com/v1"
)
# Compact a long context
response = client.chat.completions.create(
model="morph-compact",
messages=[
{
"role": "user",
"content": long_context_string # Your agent's accumulated context
}
]
)
compacted = response.choices[0].message.content
# Every surviving sentence is verbatim from the original
# 50-70% smaller, zero hallucination riskInline compression in an agent loop (TypeScript)
import OpenAI from "openai";
const morph = new OpenAI({
apiKey: process.env.MORPH_API_KEY,
baseURL: "https://api.morphllm.com/v1",
});
async function compactIfNeeded(content: string): Promise<string> {
// Only compact long outputs (short ones pass through)
const tokens = estimateTokens(content);
if (tokens < 500) return content;
const response = await morph.chat.completions.create({
model: "morph-compact",
messages: [{ role: "user", content }],
});
return response.choices[0].message.content ?? content;
}
// In your agent loop:
for (const toolCall of pendingToolCalls) {
const result = await executeTool(toolCall);
const compacted = await compactIfNeeded(result.output);
conversation.addToolResult(toolCall.id, compacted);
// Agent never sees the noise — only high-signal tokens
}Streaming support
# Streaming for large context compression
response = client.chat.completions.create(
model="morph-compact",
messages=[{"role": "user", "content": large_context}],
stream=True
)
compacted_chunks = []
for chunk in response:
if chunk.choices[0].delta.content:
compacted_chunks.append(chunk.choices[0].delta.content)
compacted = "".join(compacted_chunks)Frequently Asked Questions
What is context compression for LLMs?
Context compression reduces the number of tokens an LLM processes while preserving the information needed for correct output. Three main approaches exist: structured summarization (rewriting into organized sections), opaque compression (model-internal reduction), and verbatim compaction (deleting noise while keeping surviving text identical to the original).
What is the difference between summarization and compaction?
Summarization rewrites context into shorter form using natural language generation. This can introduce paraphrasing errors, altered code snippets, or hallucinated details. Compaction deletes low-signal tokens while keeping every surviving sentence word-for-word identical. Summarization is better for high-level progress tracking. Compaction is better when exact code and error messages must survive intact.
How much can context compression reduce token usage?
The ACON framework achieves 26-54% peak token reduction while preserving 95%+ task accuracy. Morph Compact achieves 50-70% reduction with 98% verbatim accuracy. Factory.ai's evaluation showed structured summarization can compress 36K real engineering messages while maintaining a 3.70/5 quality score.
Does context compression cause hallucinations?
Summarization-based compression can introduce hallucinations because the model rewrites context in its own words. File paths, line numbers, and code snippets may be altered. Verbatim compaction avoids this entirely: every surviving token is identical to the original input. Zero hallucination risk because nothing is rewritten.
When should I compress context in a coding agent?
OpenAI recommends compaction as a default long-run primitive, not an emergency fallback. Inline compression (compacting tool outputs as they arrive) prevents noise accumulation. Threshold-based compression (triggering at 70-80% capacity) is simpler but lets noise degrade performance before it kicks in.
How does Morph Compact work?
Morph Compact uses a purpose-built model that identifies which tokens carry signal and which are noise, then deletes the noise. Every sentence that survives is verbatim from the original. It runs at 3,300+ tokens per second with 50-70% reduction. Compatible with the OpenAI SDK by pointing the base URL at api.morphllm.com/v1 and using the morph-compact model.
Compress Context Without Losing Signal
Morph Compact is verbatim compaction for coding agents. 50-70% token reduction, 3,300+ tok/s, and zero hallucination risk. Every surviving sentence is word-for-word identical to the original.