Compaction vs Summarization: How AI Agents Should Manage Context

Three approaches to agent context management: LLM summarization, opaque compression, and verbatim compaction. Real benchmarks from Factory.ai, JetBrains, and ACON research on what works, what hallucinates, and when to use each.

February 27, 2026 · 3 min read

Every coding agent hits a wall. The context window fills up, and the agent needs to decide what to keep, what to cut, and how to compress. Three approaches have emerged, each with fundamentally different trade-offs between compression ratio, accuracy, and inspectability.

37%
Multi-session retention with summarization
99.3%
Compression ratio (OpenAI opaque)
98%
Verbatim accuracy (Morph Compact)
3,300+
Tokens/sec compaction speed

The Three Approaches to Agent Context Management

When a coding agent's context window approaches capacity, it has three options. Each one makes a trade-off that shapes whether the agent remembers file paths accurately, preserves error messages verbatim, or even recalls what it was working on.

LLM Summarization

An LLM rewrites conversation history into a natural language summary. High compression, human-readable, but lossy. Can hallucinate or paraphrase critical details like file paths and error codes.

Opaque Compression

Server-side, non-human-readable compression. Highest compression ratio (99.3%), fully automated. Not inspectable, not portable, locked to a single vendor.

Verbatim Compaction

Deletes tokens rather than rewriting. Every surviving line is character-for-character from the original. Zero hallucination risk, fully inspectable. Lower compression ratio.

The core tension

Higher compression means more information loss. The question is not which approach compresses best. It is which information loss is acceptable for your use case. Losing a reasoning chain is recoverable. Losing an exact file path or error code is not.

LLM Summarization: How Claude Code and Factory Handle It

LLM summarization is the most common approach. When the context window fills up, an LLM reads the conversation history and produces a condensed natural language summary.

Anthropic's Approach (Claude Code)

Anthropic's context engineering docs describe Claude Code's compaction as generating detailed structured summaries of 7,000 to 12,000 characters. These include sections for analysis completed, files modified, key decisions, and pending tasks. The summaries are designed to be comprehensive enough that the agent can continue working after a context reset.

Claude Code structured summary (simplified)

## Analysis Completed
- Investigated auth middleware in src/middleware/auth.ts
- Found JWT validation bypassed for /api/health endpoints
- Confirmed issue reproduces with expired tokens

## Files Modified
- src/middleware/auth.ts (lines 47-89)
- src/tests/auth.test.ts (added 3 test cases)

## Pending Tasks
- Update API documentation for auth changes
- Run full integration test suite

Factory.ai's Anchored Iterative Summaries

Factory.ai's memory layer research uses an anchored iterative approach: instead of regenerating the summary from scratch, each new piece of information merges into a persistent summary state. This reduces drift but doesn't eliminate it. Their evaluation found accuracy scores of 3.74 to 4.04 out of 5, meaning roughly 1 in 5 facts gets distorted or lost.

The most damaging finding: multi-session information retention was only 37%. When an agent used summarization to carry context across sessions, nearly two-thirds of the information was lost or corrupted.

The Re-Reading Loop

Summarization creates a subtle failure mode that benchmarks miss. When an agent searches for information and fills context with results, summarization compresses those results into a paraphrase. The paraphrase loses the specific details the agent needed, so the agent searches again. The new results fill context again. The agent summarizes again. This is the re-reading loop: search, fill, summarize, lose details, search again.

The loop is worst when context was filled by search or tool outputs. A 50K-token grep result gets summarized to 2K tokens. The agent needs a specific line number from those results, can't find it in the summary, and re-runs the grep. The new output refills context. In the worst case, the agent oscillates between searching and summarizing without making progress.

What summarization loses

  • Exact file paths become paraphrased or hallucinated
  • Error messages get reworded, losing grep-able strings
  • Configuration values get rounded or approximated
  • Line numbers shift or disappear entirely
  • Multi-step reasoning chains get flattened into conclusions
  • Search results get collapsed, triggering re-searches that refill context

Opaque Compression: OpenAI's /responses/compact

OpenAI took a different path with their Codex /responses/compact endpoint. Instead of producing a human-readable summary, it generates a server-side compressed representation that only OpenAI's models can interpret.

99.3%
Compression ratio
3.35/5
Overall score (Factory eval)
3.43/5
Accuracy score (Factory eval)

The compression ratio is extraordinary. A 100K token conversation compresses to roughly 700 tokens. But the trade-offs are significant:

  • Not inspectable. You cannot read, debug, or verify what the compressed representation contains.
  • Not portable. The compressed state only works with OpenAI's API. Switching providers means losing all compressed context.
  • Vendor lock-in. Your agent's memory is stored in a format only one company can decode.
  • Lower accuracy. Factory's eval scored it below both summarization approaches at 3.35/5 overall.

OpenAI opaque compression (conceptual)

// Before: 100,000 tokens of conversation history
const response = await openai.responses.create({
  model: "codex-mini",
  previous_response_id: lastResponseId,
  truncation: "auto"  // triggers server-side compression
});

// After: ~700 tokens of opaque compressed state
// You cannot inspect what was preserved or dropped
// Only OpenAI models can interpret this representation

Verbatim Compaction: The Delete-Not-Rewrite Approach

Verbatim compaction takes a fundamentally different approach: instead of rewriting or compressing, it deletes tokens. The model identifies which lines are least important and removes them. Every line that survives is character-for-character identical to the original input.

Morph Compact implements this approach. The model reads the full context, scores each line by relevance, and removes low-signal content while preserving the structure and exact text of everything that remains.

50-70%
Compression ratio
98%
Verbatim accuracy
3,300+
Tokens per second
0%
Hallucination risk

Verbatim compaction: what goes in, what comes out

# INPUT: 847 lines of agent conversation
User: Fix the JWT validation bug in src/middleware/auth.ts
Agent: I'll investigate the auth middleware...
[reads src/middleware/auth.ts]
[reads src/middleware/cors.ts]         ← removed (irrelevant)
[reads src/tests/auth.test.ts]
[grep results for "jwt.verify"]
[grep results for "express.Router"]   ← removed (irrelevant)
Agent: Found the issue on line 52...
[20 lines of reasoning about the fix]
[applies edit to auth.ts]
Agent: The fix is deployed. Running tests...
[full test output - 200 lines]        ← removed (low signal)
Agent: All 47 tests pass.

# OUTPUT: 312 lines — every surviving line unchanged
# File paths: exact. Error messages: exact. Line numbers: exact.

The compression ratio (50-70%) is lower than summarization or opaque compression. That is the trade-off. You keep less context in the window, but what you keep is guaranteed accurate. For coding agents that rely on exact file paths, error strings, and configuration values, this matters more than raw compression.

Verbatim compaction also breaks the re-reading loop. Because surviving lines are exact matches from the original, the agent can still find the specific file path, line number, or error code it needs. No re-search required. The context shrinks, but the signal stays intact.

The hard part was speed. Deletion-based compaction requires the model to score every token for relevance, which historically took 8-15 seconds per compaction. That is too slow for inline use. Morph solved this with custom inference engines purpose-built for the compaction workload, bringing latency under 3 seconds. Fast enough to run before every LLM call, not just at the 95% capacity cliff.

Why zero hallucination matters for code

When a summarizer paraphrases src/middleware/auth.ts:52 as "the auth middleware file," the agent loses the ability to navigate directly to the right location. When it rewrites Error: ECONNREFUSED 127.0.0.1:5432 as "a database connection error," the debugging context is destroyed. Verbatim compaction preserves these details exactly.

Head-to-Head: Factory's Compression Benchmark

Factory.ai ran a systematic evaluation of context compression approaches across multiple dimensions: overall quality, accuracy, information retention, and multi-session persistence. Their benchmark used real coding agent conversations with ground-truth verification of preserved facts.

MetricLLM SummarizationOpaque (OpenAI)Verbatim Compaction
Overall Score3.74-4.04 / 53.35 / 5N/A (not tested)
Accuracy Score3.74-4.04 / 53.43 / 598% verbatim
Compression Ratio~80-90%99.3%50-70%
Multi-Session Retention37%Not measuredN/A (stateless)
Inspectable OutputYesNoYes
Hallucination RiskModerateUnknownNone
Vendor Lock-inNoneFullNone
SpeedDepends on modelServer-side3,300+ tok/s

Factory's key finding: no single approach dominates across all dimensions. Summarization scored highest on overall quality because human-readable summaries are easier to evaluate. Opaque compression achieved the highest compression ratio. Verbatim compaction wins on accuracy and inspectability. The right choice depends on what your agent needs to preserve.

Observation Masking: The Surprising JetBrains Finding

JetBrains published research on their Junie agent that challenges the assumption that context management requires sophisticated compression at all. Their technique, observation masking, simply hides old tool outputs from the agent's context.

When a tool call becomes stale (the agent has moved past it), observation masking replaces the output with a placeholder. The tool call itself stays visible so the agent remembers what it did, but the potentially large output is removed. On SWE-bench, this approach matched the quality of full LLM summarization while using less compute.

Observation masking vs summarization

# OBSERVATION MASKING:
Tool call: read_file("src/middleware/auth.ts")
Output: [masked — 847 tokens removed]

Tool call: grep("jwt.verify", "src/")
Output: [masked — 234 tokens removed]

Tool call: read_file("src/tests/auth.test.ts")
Output: [still visible — most recent, still relevant]

# The agent remembers WHAT it did (read auth.ts, searched for jwt.verify)
# but doesn't carry the full output in context anymore.

# LLM SUMMARIZATION of same history:
"Investigated auth middleware and found JWT validation issue.
 Searched for jwt.verify usage across the codebase.
 Currently reviewing test file for auth module."
# ↑ Lost: exact file content, line numbers, grep matches

The ACON framework reached a similar conclusion through different methodology. They demonstrated 26-54% peak token reduction while preserving 95%+ accuracy by aggressively compressing tool outputs. Their key insight: the reasoning trace matters more than the raw data. The agent needs to remember what it decided, not necessarily the full output that informed the decision.

26-54%
ACON peak token reduction
95%+
Accuracy preserved
$0
Extra compute for masking

Practical implication

If your agent spends most of its tokens on tool outputs (file reads, grep results, API responses), observation masking or verbatim compaction of those outputs may be more effective than summarizing the entire conversation. Target the source of bloat, not the whole context.

Which Approach to Use When

The right approach depends on your agent's failure modes. If your agent hallucinates file paths after compaction, it needs higher fidelity. If it runs out of context too quickly, it needs higher compression. If you can't debug what it forgot, it needs inspectability.

ScenarioRecommended ApproachWhy
Coding agent editing filesVerbatim compactionExact file paths and line numbers must survive compression
Long chat with general reasoningLLM summarizationReasoning chains compress well; exact tokens matter less
Maximum context window usageOpaque compression99.3% ratio when you need every token of headroom
Debugging agent failuresVerbatim compactionInspectable output lets you see exactly what was preserved
Multi-provider compatibilityVerbatim compaction or summarizationOpaque compression is locked to OpenAI
Tool-output-heavy workflowsObservation masking + compactionTarget the bloat source directly; JetBrains showed this matches summarization quality
Context exhaustion on long tasksAgent hand-off (Sourcegraph approach)Spawn a new agent with a task summary instead of compressing

Will Larson's Production Advice

Will Larson recommends triggering compaction at 80% context capacity, not at the limit. This gives the agent room to finish its current operation before compacting. He also suggests storing pre-compaction context as a virtual file. If compaction drops something critical, the agent can re-read it from the stored snapshot.

Production compaction strategy

// Trigger compaction at 80% capacity, not 100%
const COMPACTION_THRESHOLD = 0.8;

if (contextTokens / maxContextTokens > COMPACTION_THRESHOLD) {
  // 1. Save full context as a recoverable snapshot
  await writeVirtualFile(".context-snapshot", fullContext);

  // 2. Compact the context
  const compacted = await morphCompact(fullContext, {
    preserveRecentTurns: 3,
    targetRatio: 0.5
  });

  // 3. Agent can re-read snapshot if needed
  // "Read .context-snapshot to recover details about X"
}

The Sourcegraph Alternative: Hand-Off Instead of Compression

Sourcegraph took the most radical approach. They retired compaction in Amp entirely. When an agent's context fills up, instead of compressing, Amp spawns a new agent instance with a structured task summary. The new agent starts fresh with a clean context window, carrying only the essential task state.

This reframes context exhaustion from a compression problem to a coordination problem. The question shifts from "how do I compress this conversation?" to "how do I hand off this task to a fresh agent?"

Compression Approach

Keep the same agent, compress its context. Works when the agent has accumulated state that's hard to transfer. Risk: lossy compression drops critical details.

Hand-Off Approach

Spawn a new agent with a task summary. Works for long tasks with clear milestones. Risk: handoff summary must capture all essential state.

Hybrid Approach

Use verbatim compaction for the first compression cycle. If the agent needs a second compaction, hand off to a fresh agent instead. Best of both worlds.

Frequently Asked Questions

What is the difference between context compaction and summarization?

Summarization uses an LLM to rewrite conversation history into a condensed natural language summary. It can introduce hallucinations, paraphrase exact details, and lose technical specifics. Compaction deletes tokens from the original text without rewriting anything. Every surviving line is character-for-character from the input. Summarization achieves higher compression ratios but is lossy. Compaction preserves exact technical details at the cost of lower compression.

How does OpenAI's /responses/compact work?

OpenAI's compact endpoint produces a server-side, opaque compressed representation of conversation history. It achieves a 99.3% compression ratio, the most aggressive of any approach, but the output is not human-readable. You cannot inspect, debug, or verify what was preserved. The compressed state only works with OpenAI's API. Factory.ai's benchmark scored it 3.35/5 overall.

What is observation masking?

Observation masking is a technique from JetBrains' Junie agent research that replaces old tool outputs with placeholders while keeping tool calls visible. The agent remembers what actions it took without carrying the full output. On SWE-bench, it matched LLM summarization quality while using less compute.

When should I use verbatim compaction instead of summarization?

Use verbatim compaction when preserving exact technical details is critical: file paths, error codes, configuration values, API responses, and line numbers. Summarization tends to paraphrase or hallucinate these specifics. Morph Compact guarantees every surviving line is unchanged from the original, with 98% verbatim accuracy at 3,300+ tok/s.

What is the ACON framework?

ACON (Adaptive Context Optimization for Agents) demonstrated 26-54% peak token reduction while preserving 95%+ accuracy. The key insight: aggressive compression of tool outputs is safe because the reasoning trace matters more than the raw data the tools returned.

How does Sourcegraph Amp handle context limits?

Sourcegraph retired context compaction in their Amp agent. When context fills up, Amp spawns a new agent with a structured task summary instead of compressing the existing conversation. They treat context exhaustion as a coordination problem, not a compression problem.

What compression ratio does verbatim compaction achieve?

Morph Compact achieves 50-70% compression with 98% verbatim accuracy at 3,300+ tokens per second. The compression ratio is lower than LLM summarization (~80-90%) or opaque compression (99.3%), but the output is fully inspectable, diffable, and portable across any LLM provider.

Try Verbatim Compaction

Morph Compact deletes tokens instead of rewriting them. 98% verbatim accuracy, 3,300+ tok/s, zero hallucination risk. Every surviving line is character-for-character from your original context.