Context Compaction: The Technical Guide to Deletion-Based Context Management

Context compaction reduces LLM context windows by deleting low-signal tokens rather than rewriting them. This guide covers verbatim compaction, token-level pruning, observation masking, adaptive frameworks like ACON, and why prevention-first approaches outperform all of them. Real benchmarks from Factory.ai, JetBrains, and SWE-Bench Pro.

March 13, 2026 · 2 min read

Coding agents burn through context windows fast. Every file read, every grep result, every full-file rewrite pushes the agent closer to the compaction threshold. When compaction fires, the question is: does the agent rewrite its memory (risking hallucination) or delete the noise and keep the signal intact? Context compaction takes the second path.

50-70%
Compression via verbatim deletion
98%
Verbatim accuracy (Morph Compact)
3,300+
Tokens/sec compaction speed
0%
Hallucination risk

What Is Context Compaction

Context compaction is a deletion-based approach to reducing the number of tokens in an LLM's context window. Instead of rewriting content into a summary or encoding it into an opaque format, compaction identifies low-signal content and removes it. What survives is unchanged from the original input.

The term has become overloaded. Anthropic uses "compaction" to describe their server-side summarization API, which actually generates new text (a summary). In this guide, we use compaction in its strict sense: deletion-based reduction where no new tokens are generated.

The core property that makes compaction valuable for coding agents is verbatim fidelity. A file path like src/api/webhooks/stripe.ts:98 survives compaction exactly as written, or it gets deleted entirely. It never becomes "the webhook handler" or "the Stripe file." For agents that need to navigate codebases, edit specific lines, and match exact error strings, this property is more important than raw compression ratio.

Compaction in one sentence

Compaction removes tokens. Summarization rewrites tokens. Compression encodes tokens. Only compaction guarantees that surviving output is identical to the input.

Compaction vs Compression vs Summarization

These three terms get used interchangeably in LLM tooling, but they describe fundamentally different operations with different trade-offs. The distinction matters because each approach fails differently when applied to code.

PropertyCompaction (Deletion)Compression (Encoding)Summarization (Rewriting)
MechanismDelete low-signal tokensEncode into opaque formatLLM rewrites into summary
Compression Ratio50-70%99.3% (OpenAI)70-90%
Hallucination RiskNoneUnknown (opaque)Moderate to high
Output InspectableYes, fullyNoYes, but rewritten
Verbatim Fidelity98% (Morph)N/ALow
File Paths PreservedExact or deletedUnknownOften paraphrased
Error Messages PreservedExact or deletedUnknownOften shortened
Vendor Lock-inNoneFull (OpenAI only)None
Speed3,300+ tok/sServer-sideRequires full LLM call
Portable Across ProvidersYesNoYes

For a deeper comparison including benchmark scores from Factory.ai's evaluation, see our compaction vs summarization guide.

Why Coding Agents Need Verbatim Fidelity

Coding agents operate on exact tokens. A summarizer that paraphrases Error: ECONNREFUSED 127.0.0.1:5432 as "a database connection error" destroys the debugging context. The agent can no longer grep for that error string, can no longer match it against known issues, can no longer include it in a fix commit message.

Compaction either keeps that error message verbatim or deletes it entirely. If the error is still relevant to the current task, the compaction model scores it as high-signal and preserves it. If the agent has already fixed the issue and moved on, the error gets deleted. Either way, no new text is generated, and no detail is corrupted.

Verbatim Compaction: Delete, Never Rewrite

Verbatim compaction is the purest form of context compaction. The model reads the full context, scores each line or block by relevance to the current task, and removes low-signal content. The output is a strict subset of the input.

Morph Compact implements verbatim compaction at production scale. The model processes context at 3,300+ tokens per second with 98% verbatim accuracy, meaning 98% of output lines are character-for-character matches with input lines.

50-70%
Compression ratio
98%
Verbatim accuracy
3,300+
Tokens per second
0%
Hallucination risk

What Gets Deleted

The compaction model learns which context types are high-signal vs expendable through training on real agent conversations. In practice, the following categories are consistently scored as low-signal:

  • Redundant tool outputs: File reads the agent already acted on. Grep results already processed. Test output from passing tests.
  • Exploratory dead ends: Files read but found irrelevant. Search queries that returned nothing useful.
  • Verbose boilerplate: License headers, import blocks the agent isn't modifying, configuration files read for reference.
  • Superseded information: Earlier versions of files that have since been edited. Old error messages from bugs already fixed.

What Survives

  • Active file paths and line numbers: Any reference the agent might need to navigate
  • Current error messages: Unresolved bugs and their exact text
  • Reasoning decisions: Why the agent chose approach A over B
  • Recent tool calls and their results: The last 3-5 operations
  • User instructions: The original task and any clarifications

Verbatim compaction: input vs output

# INPUT: 1,247 tokens across 15 tool calls
User: Fix the rate limiting bug in the API gateway
Agent: I'll investigate the rate limiter...
[read_file src/middleware/rateLimit.ts]        # 340 tokens
[read_file src/middleware/cors.ts]             # 280 tokens ← DELETED (irrelevant)
[grep "rateLimitExceeded" src/]               # 190 tokens
[read_file src/config/limits.json]            # 95 tokens  ← DELETED (already processed)
[read_file tests/rateLimit.test.ts]           # 410 tokens
Agent: Found the issue on line 47...
[edit_file src/middleware/rateLimit.ts:47-52]
Agent: Running tests...
[test output - 200 lines, all passing]        # 380 tokens ← DELETED (all pass)
Agent: Tests pass. The fix handles the edge case where...

# OUTPUT: 612 tokens — 51% reduction
# Every surviving line: unchanged from input
# File paths: exact. Error text: exact. Line numbers: exact.

The re-reading loop, broken

Summarization creates a failure mode where the agent loses a detail, re-searches for it, fills context with new results, summarizes again, and loses it again. Verbatim compaction breaks this loop because surviving content is exact. If the agent needs src/middleware/rateLimit.ts:47, it either still has that exact string in context or it was deleted. No paraphrased "the rate limit middleware" to chase down.

Token-Level Pruning (LLMlingua)

LLMlingua takes compaction to the individual token level. Instead of deleting entire lines or blocks, it scores each token's importance using perplexity from a small language model (GPT-2 or LLaMA-7B), then removes low-importance tokens one by one.

20x
Max compression ratio (LLMlingua)
3-6x
Speed improvement (LLMlingua-2)
77.9%
GSM8K accuracy at 20x compression

LLMlingua-2 improved on the original with a BERT-based encoder for faster scoring (3-6x speedup) and better generalization to out-of-domain data. Both achieve impressive compression ratios on natural language benchmarks: up to 20x compression with minimal accuracy loss on reasoning tasks.

The Code Problem

Token-level pruning works well for natural language, where removing filler words ("the," "is," "that") rarely changes meaning. Code is different. Removing a single token can break syntax, change semantics, or corrupt identifiers:

Token-level pruning risks with code

# ORIGINAL:
if (user.role !== "admin" && !user.permissions.includes("write")) {
  throw new ForbiddenError("Insufficient permissions");
}

# AFTER aggressive token pruning (conceptual):
if (user.role "admin" && user.permissions.includes("write")) {
  throw ForbiddenError("Insufficient permissions");
}
# Removed: !==, !, new — completely changed the logic

For natural language reasoning tasks, LLMlingua is effective. For coding agent context where syntax integrity matters, line-level or block-level compaction (verbatim deletion) is safer. The granularity of deletion maps to the granularity of meaning: in code, the meaningful unit is a line or a block, not a token.

DimensionToken-Level (LLMlingua)Line-Level (Verbatim)
Compression RatioUp to 20x50-70%
Code SafetyLow (can break syntax)High (lines stay intact)
SpeedRequires separate model pass3,300+ tok/s inline
Best ForNatural language, RAG promptsCode, agent conversations
Hallucination RiskLow but syntax corruptionNone

Observation Masking (JetBrains Approach)

JetBrains published research on their Junie agent showing that you can get surprisingly far without any sophisticated compression at all. Their technique, observation masking, replaces old tool outputs with a placeholder while keeping the tool call itself visible.

The agent remembers what it did (read a file, ran a grep, executed a test) but does not carry the full output in context. The output is simply replaced with [masked]. On SWE-bench, this matched the quality of full LLM summarization while using zero extra compute for the masking step.

Observation masking in practice

# BEFORE masking: full tool outputs in context
Tool: read_file("src/middleware/rateLimit.ts")
Output: [340 tokens of file content]

Tool: grep("rateLimitExceeded", "src/")
Output: [190 tokens of grep results]

Tool: read_file("tests/rateLimit.test.ts")
Output: [410 tokens of test file]

# AFTER masking: tool calls preserved, old outputs removed
Tool: read_file("src/middleware/rateLimit.ts")
Output: [masked — 340 tokens freed]

Tool: grep("rateLimitExceeded", "src/")
Output: [masked — 190 tokens freed]

Tool: read_file("tests/rateLimit.test.ts")
Output: [still visible — most recent, likely still relevant]

# Agent knows it read rateLimit.ts and grepped for the error.
# If it needs the file content again, it re-reads (targeted).
$0
Extra compute for masking
60-80%
Token reduction (typical)
SWE-bench
Matched summarization quality

The insight behind observation masking aligns with context rot research: most tool outputs are consumed once and never referenced again. The agent reads a file to understand it, makes a decision, and moves on. Carrying 340 tokens of that file through the rest of the conversation is pure waste. Masking removes the waste while preserving the agent's action history.

When masking falls short

Observation masking works best when tool outputs are consumed once. It fails when the agent needs to reference old output later, such as comparing two versions of a file or correlating errors across multiple test runs. In these cases, verbatim compaction is more appropriate: it selectively removes low-signal content while preserving the specific lines the agent still needs.

Adaptive Compaction (ACON Framework)

The ACON (Adaptive Context Optimization for Agents) framework demonstrated that uniform compression is suboptimal. Different parts of an agent's context have different information density and different relevance to the current task. ACON treats each context segment independently, applying different compression levels based on content type.

26-54%
Peak token reduction
95%+
Accuracy preserved
Adaptive
Per-segment compression

The Key Insight: Not All Context Is Equal

ACON found that aggressive compression of tool outputs is safe because the reasoning trace matters more than the raw data. An agent's chain of thought (why it chose this file, what pattern it noticed, what fix it decided on) carries more information-per-token than the raw grep output or file content that informed those decisions.

This maps to a hierarchy of compaction safety:

Safe to Compress Aggressively

Raw tool outputs (file reads, grep results, test output). High token count, low information density after initial consumption. 60-80% of context in typical agent sessions.

Compress Carefully

Reasoning traces and decision records. Medium token count, high information density. The agent's chain of thought about why it chose a particular approach.

Never Compress

User instructions, active error messages, current file paths, pending task state. Low token count, irreplaceable information. Loss causes agent failure.

Morph Compact implicitly applies this hierarchy. The model is trained on real agent conversations and learns which content types are expendable vs critical. The result is similar to ACON's adaptive approach but without requiring explicit segment classification.

Server-Side Compaction APIs

Both Anthropic and OpenAI now offer server-side compaction as API features. Their implementations differ significantly in mechanism and trade-offs.

Anthropic's Compaction API

Anthropic's server-side compaction (beta, available for Claude Opus 4.6 and Sonnet 4.6) generates a structured summary when input tokens exceed a configurable threshold. The API detects when context approaches the limit, generates a compaction block containing the summary, and continues the conversation from that compressed state.

This is technically summarization, not compaction in the strict sense. The output is new text generated by the model, not a subset of the input. The summary is human-readable and structured, but carries the risks of any generative approach: paraphrased file paths, shortened error messages, lost line numbers.

OpenAI's Opaque Compression

OpenAI's approach produces a server-side, non-human-readable compressed representation. It achieves 99.3% compression ratio, the most aggressive of any approach, but the output is not inspectable. You cannot verify what was preserved or dropped. The compressed state only works with OpenAI's models, creating full vendor lock-in.

PropertyAnthropic CompactionOpenAI CompressionMorph Compact
MechanismLLM summary generationOpaque encodingVerbatim deletion
Compression Ratio~70-90%99.3%50-70%
Output InspectableYes (summary)NoYes (verbatim)
Hallucination RiskModerateUnknownNone
Vendor Lock-inAnthropic APIOpenAI APINone (OpenAI SDK compatible)
SpeedFull LLM callServer-side3,300+ tok/s
Verbatim FidelityLow (rewritten)Unknown98%

Benchmarks: How Compaction Methods Compare

Evaluating compaction methods requires measuring more than compression ratio. The metrics that matter for coding agents are: does the agent still complete the task correctly after compaction? Does it hallucinate file paths? Does it re-search for information it already had?

Factory.ai Memory Layer Evaluation

Factory.ai evaluated context management approaches across 36,000 real engineering messages. Their structured summarization approach scored 3.70/5 overall, while OpenAI's opaque compression scored 3.35/5. The evaluation measured accuracy, information retention, and consistency across sessions.

The most revealing metric was multi-session information retention: only 37%. When agents used summarization to carry context across sessions, nearly two-thirds of information was lost or corrupted. This highlights why verbatim compaction's guarantee of exact preservation matters more than higher compression ratios.

MetricLLM SummaryOpaque (OpenAI)Verbatim (Morph)Observation Masking
Overall Score3.70/53.35/5N/A (not in eval)Matched summary
Accuracy3.74-4.04/53.43/598% verbatimNo new errors
Compression70-90%99.3%50-70%60-80%
Multi-Session Retention37%Not measuredN/A (stateless)N/A
Hallucination RiskModerateUnknownNoneNone
Extra Compute CostFull LLM callIncluded3,300+ tok/s$0

SWE-Bench Pro Results

FlashCompact's prevention-first approach, combining WarpGrep targeted search, Fast Apply compact diffs, and Morph Compact verbatim cleanup, achieved state-of-the-art results on SWE-Bench Pro. The key factor was not better compaction but less need for compaction: agents using FlashCompact consumed 3-4x fewer tokens per session, meaning they hit the compaction threshold 3-4x less often.

3.70/5
Factory.ai: summary score
3.35/5
Factory.ai: OpenAI score
37%
Multi-session retention
3-4x
Fewer compactions (FlashCompact)

Prevention-First: Reducing the Need for Compaction

The best compaction is the one you never run. Cognition measured that agents spend 60% of their time searching for code, dumping entire files into context to find 10-line functions. Every full-file read accelerates the countdown to compaction.

FlashCompact attacks context waste at two sources before compaction becomes necessary:

Search Waste: WarpGrep

RL-trained semantic codebase search returns only relevant code snippets, not entire files. 0.73 F1 in 3.8 steps vs grep's 0.19 F1 in 12 steps. Prevents 60%+ of context waste from search operations.

Write Waste: Fast Apply

10,500 tok/s compact diffs instead of full file rewrites. A 3-line edit to a 200-line file consumes 3 lines of context, not 200. Prevents write operations from echoing unchanged code back into context.

Residual Noise: Morph Compact

Verbatim deletion at 3,300+ tok/s cleans up whatever noise remains. 50-70% compression with 98% verbatim accuracy. Zero hallucination. The last line of defense after prevention.

The combination extends effective context life by 3-4x. An agent that would normally hit compaction after 25 tool calls can run 75-100 tool calls before needing cleanup. When compaction does fire, the input is already higher-signal (because prevention removed the noise), so the compaction output is more useful too.

Prevention-first vs compaction-only

# WITHOUT prevention (standard agent):
# Tool call 1-10: file reads, greps → ~60K tokens consumed
# Tool call 11-20: more reads, edits → ~120K tokens consumed
# Tool call 21-25: approaching 167K limit
# → Auto-compact fires, summarizes everything
# → Agent loses file paths, re-searches, fills context again
# → Second compaction at tool call 35. Third at 45.
# Total compactions for 50-tool-call task: 3-4

# WITH FlashCompact (prevention-first):
# Tool call 1-10: WarpGrep returns snippets → ~15K tokens consumed
# Tool call 11-20: Fast Apply uses diffs → ~25K tokens consumed
# Tool call 21-50: steady, efficient growth → ~80K tokens consumed
# → Morph Compact runs once at tool call 40 for cleanup
# → Agent retains exact file paths, no re-searching
# Total compactions for 50-tool-call task: 0-1

Why prevention compounds

Each compaction cycle has a cost: the agent loses some context fidelity, may re-search for lost details, and burns tokens on the re-search. Preventing one compaction doesn't just save the compaction time. It prevents the downstream re-searching and re-reading that follows. Three prevented compactions might save 30-50K tokens of redundant re-searching.

Integration Guide: Using Morph Compact

Morph Compact is OpenAI SDK compatible. Point the base URL at api.morphllm.com, use morph-compact as the model, and send standard chat completion requests. The response contains the compacted version of your input where every surviving line is verbatim from the original.

Morph Compact via OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-api-key"
)

# Compact a conversation that's approaching the context limit
response = client.chat.completions.create(
    model="morph-compact",
    messages=[
        {
            "role": "user",
            "content": full_conversation_text
            # Can be the entire agent conversation as a single string
        }
    ]
)

compacted = response.choices[0].message.content
# compacted contains only high-signal lines from the input
# Every surviving line is character-for-character identical

Morph Compact via OpenAI SDK (TypeScript)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: "your-morph-api-key",
});

const response = await client.chat.completions.create({
  model: "morph-compact",
  messages: [
    {
      role: "user",
      content: fullConversationText,
    },
  ],
});

const compacted = response.choices[0].message.content;
// Every surviving line is verbatim from the input

When to Trigger Compaction

Trigger compaction at 80% context capacity, not at the limit. This gives the agent room to finish its current operation before compacting. If you wait until 95% (the Claude Code default), the agent has almost no room to work after compaction fires, and the compaction itself adds tokens.

For a belt-and-suspenders approach, store pre-compaction context as a virtual file. If compaction drops something the agent needs later, it can re-read it from the stored snapshot rather than re-running the original tool call.

Frequently Asked Questions

What is context compaction?

Context compaction is a deletion-based strategy for reducing tokens in an LLM's context window. Instead of rewriting content into a summary or encoding it into a compressed format, compaction identifies low-signal content and removes it. Every surviving line is character-for-character identical to the original input. Morph Compact achieves 50-70% compression with 98% verbatim accuracy at 3,300+ tokens per second.

How does compaction differ from compression and summarization?

Compression encodes content into a smaller representation, like OpenAI's opaque format (99.3% ratio, not inspectable, vendor-locked). Summarization rewrites content into a condensed version (70-90% ratio, but introduces hallucination risk). Compaction deletes tokens without rewriting (50-70% ratio, zero hallucination). Only compaction guarantees verbatim fidelity of surviving content. For a detailed comparison, see compaction vs summarization.

What is verbatim compaction?

Verbatim compaction means every surviving line in the output is identical to the input. No paraphrasing, no rewriting, no new text generation. The model scores lines by relevance and deletes low-signal ones. This eliminates hallucination risk entirely because the model never generates new content during the compaction process.

What is token-level pruning?

Token-level pruning (LLMlingua) removes individual tokens based on perplexity scores rather than entire lines. It achieves up to 20x compression on natural language, but risks breaking code syntax by removing syntactically important tokens. For coding agents, line-level verbatim compaction is safer.

What is observation masking?

Observation masking (from JetBrains' Junie research) replaces old tool outputs with placeholders while keeping tool calls visible. The agent remembers what it did but doesn't carry full outputs in context. On SWE-bench, it matched LLM summarization quality while costing zero extra compute. It works best when tool outputs are consumed once and not referenced later.

Does context compaction cause hallucinations?

Verbatim compaction does not cause hallucinations because no new text is generated. Every surviving line is identical to the input. By contrast, summarization-based approaches (used by Claude Code and Cursor) rewrite context in the model's own words, which can alter file paths, error messages, and line numbers. OpenAI's opaque compression produces non-inspectable output, making hallucination risk unknown.

How does ACON adaptive compaction work?

ACON (Adaptive Context Optimization for Agents) applies different compression levels to different context segments. Tool outputs get compressed aggressively (they're high-token, low-information-density after consumption). Reasoning traces get preserved more carefully. The framework achieved 26-54% token reduction while preserving 95%+ accuracy on agent benchmarks.

Is Morph Compact compatible with the OpenAI SDK?

Yes. Point base_url at api.morphllm.com/v1 and use morph-compact as the model. Standard chat completion requests work without modification. No new SDK, no new dependencies, no custom client.

Related Pages

Verbatim Compaction at 3,300+ tok/s

Morph Compact deletes noise and keeps signal. 98% verbatim accuracy, zero hallucination, OpenAI SDK compatible. Every surviving line is character-for-character from your original context.