Compaction API: Compress LLM Context Without Losing Information

What Is a Compaction API?

A compaction API takes a conversation history, agent transcript, or any accumulated context and returns a shorter version that preserves the information the model needs to keep working. The agent continues with a compressed context instead of restarting or degrading.

The core problem: LLMs perform worse as context grows. Stanford's "Lost in the Middle" research demonstrated a U-shaped performance curve. Models recall information at the beginning and end of context well, but accuracy drops 15-47% for information in the middle. This is not a bug in any particular model. Chroma tested 18 frontier models in 2025, including GPT-4.1, Claude Opus 4, and Gemini 2.5, and every one exhibited this behavior at every context length tested.

The Compaction Trade-off

Every compaction method makes the same fundamental trade-off: compression ratio vs. information preservation. Aggressive compression (90%+) requires rewriting content, which introduces hallucination risk. Conservative compression (50-70%) can keep every surviving line verbatim, but retains more tokens. The right choice depends on whether your agent can tolerate fabricated details in its compressed history.

Three approaches exist today: LLM summarization (rewrite the context as a summary), opaque compression (encode context into a non-human-readable format), and verbatim compaction (delete low-value lines while keeping everything else word-for-word). Each has real trade-offs in speed, accuracy, and portability.

Why Agents Need Compaction

A coding agent working on a non-trivial task accumulates context fast. Each file read, tool call, command output, and reasoning step adds tokens. An agent editing code across 10 files can burn through 100K tokens of context in 15 minutes. At that rate, even a 200K-token window lasts about 30 minutes before quality starts falling off.

Context Rot

LLM accuracy degrades as context grows, even before hitting the window limit. A model with 200K capacity can start degrading at 50K. Compaction keeps the effective context small and dense.

Lost in the Middle

Information placed in the middle of long context is recalled 30%+ less accurately than information at the start or end. Compaction removes filler so critical details stay closer to the attention window's edges.

Cost Accumulation

Every token in the context window is billed on every subsequent API call. An agent making 50 calls with 150K tokens of context costs 3x more than one running at 50K tokens. Compaction cuts costs proportionally.

The 95% Cliff

Most agent frameworks trigger compaction only when the context window is nearly full. Claude Code, for example, auto-compacts at roughly 95% of its 200K window (~167K tokens). By that point, the agent has been operating with degraded context for thousands of tokens. It has already made decisions based on information it can no longer recall accurately. It contradicts earlier reasoning, re-reads files it already analyzed, and loops on problems it already solved.

Proactive compaction, running compression before the cliff rather than at it, keeps the agent operating in its highest-accuracy range. This requires a compaction method fast enough to run frequently without stalling the agent's workflow.

15-47%

Accuracy drop as context grows (Stanford)

65%

Enterprise AI failures from context drift (2025)

How Morph Compact Works

Morph Compact is verbatim compaction. It reads the input, identifies which lines carry information relevant to the agent's continued operation, and keeps those lines unchanged. Everything else is deleted. The output is a subset of the input, not a rewrite.

33,000

Tokens/sec throughput

50-70%

Typical token reduction

<3s

Max compaction time (any input size)

Architecture

The morph-compactor model runs on a custom inference engine purpose-built for compaction workloads. It processes 33,000 tokens per second and completes 100K-token inputs in under 2 seconds. The model supports a 1M-token context window, so it can handle the longest agent transcripts without chunking.

Two Modes of Operation

Objective compaction removes filler without specific guidance. Pass the conversation history, get back a compressed version. The model identifies boilerplate, redundant tool outputs, repeated information, and verbose formatting, then strips them while keeping decisions, code changes, file paths, error messages, and reasoning chains.

Query-based compaction biases retention toward a specific objective. Pass a query parameter describing what the agent needs next, and the model weights retention accordingly. If the next task is "fix the authentication middleware," auth-related context is preserved more aggressively than unrelated tool outputs.

What Gets Preserved

The model prioritizes: file paths and code references, error messages and stack traces, architectural decisions and their reasoning, function signatures and type definitions, explicit instructions from the user, and state that the agent will need for its next action. It removes: verbose tool output (full file contents when only a few lines matter), repeated information, exploratory dead ends the agent already abandoned, and formatting noise.

Zero Hallucination Guarantee

Because Morph Compact only deletes lines and never generates new content, the output cannot contain hallucinated file paths, fabricated function names, or invented error codes. Every line in the compacted output existed in the original input. This is not true of summarization-based approaches, where the model rewrites context and can introduce subtle inaccuracies.

API Usage

Morph Compact exposes a POST /v1/compact endpoint. It also works through the OpenAI-compatible /v1/chat/completions endpoint with model morph-compactor. Use whichever fits your existing stack.

Native Compact Endpoint

TypeScript (Morph SDK)

import { MorphClient } from "@morphllm/morphsdk";

const morph = new MorphClient({ apiKey: process.env.MORPH_API_KEY });

// Objective compaction — no query needed
const result = await morph.compact({
  input: conversationHistory,
  compression_ratio: 0.5,   // keep ~50% of tokens
  preserve_recent: 2,       // last 2 messages stay uncompressed
});

console.log(result.usage.compression_ratio); // e.g. 0.48
console.log(result.output);                  // compressed context

Query-Based Compaction

// Bias retention toward what the agent needs next
const result = await morph.compact({
  input: conversationHistory,
  query: "Fix the JWT validation in auth middleware",
  compression_ratio: 0.4,
});

// Result preserves auth-related context more aggressively

OpenAI-Compatible Endpoint

Works with any OpenAI SDK client

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.morphllm.com/v1",
  apiKey: process.env.MORPH_API_KEY,
});

const response = await client.chat.completions.create({
  model: "morph-compactor",
  messages: [
    { role: "user", content: agentTranscript },
  ],
});

const compactedContext = response.choices[0].message.content;

cURL

Direct HTTP request

curl -X POST "https://api.morphllm.com/v1/compact" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "... conversation history ...",
    "query": "authentication middleware",
    "compression_ratio": 0.5,
    "preserve_recent": 2
  }'

Response Format

Compact API response

{
  "id": "cmpr-7373faf8af65",
  "object": "compact",
  "model": "morph-compactor",
  "output": "... compressed context ...",
  "messages": [
    {
      "role": "user",
      "content": "... compressed content ...",
      "compacted_line_ranges": [{ "start": 5, "end": 10 }],
      "kept_line_ranges": [{ "start": 1, "end": 4 }, { "start": 11, "end": 20 }]
    }
  ],
  "usage": {
    "input_tokens": 98432,
    "output_tokens": 47890,
    "compression_ratio": 0.487,
    "processing_time_ms": 1840
  }
}

The response includes compacted_line_ranges and kept_line_ranges so you can see exactly which parts of the input were removed and which survived. This makes the compaction fully inspectable and debuggable.

Comparison to Alternatives

Five approaches to context management exist in production today. Each makes a different trade-off.

LLM Summarization

The model rewrites the conversation into a shorter natural-language summary. Claude Code and Cursor both use this approach. Anthropic's implementation produces structured summaries (7K-12K characters) with sections for analysis, files modified, and pending tasks. Factory.ai scored this approach 3.7/5 for accuracy across 36,000 real engineering messages. The failure mode: summaries can hallucinate file paths, drop exact error codes, and lose function signatures. Multi-session information retention was only 37%.

Opaque Compression (OpenAI)

OpenAI's /responses/compact endpoint produces a server-side compressed representation that is not human-readable. It achieves 99.3% compression, the most aggressive of any approach. But the output cannot be inspected, debugged, or used with any model other than OpenAI's. Factory.ai scored it 3.35/5 overall.

Sliding Window

Drop the oldest messages when context exceeds a threshold. Simple to implement, but discards information indiscriminately. An error message from early in the conversation might be the key to solving the current problem. Sliding windows have no way to distinguish valuable old context from noise.

RAG / Retrieval

Store conversation chunks in a vector database and retrieve relevant pieces when needed. This works for knowledge bases but is poorly suited to agent transcripts, where the relationship between messages is sequential and contextual. Retrieving individual chunks from a debugging session loses the causal chain of "I tried X, it failed because Y, so I switched to Z."

Verbatim Compaction (Morph Compact)

Delete low-value lines, keep everything else unchanged. No rewriting, no hallucination, no vendor lock-in. The compressed output works with any downstream model. The trade-off is a lower compression ratio (50-70% vs. 90%+ for summarization), but the output is trustworthy.

	LLM Summarization	OpenAI Opaque	Sliding Window	Morph Compact
Compression ratio	70-90%	99.3%	Fixed window	50-70%
Hallucination risk	Medium	Medium	None (drops data)	Zero
Factory.ai accuracy	3.70/5	3.35/5	N/A	Verbatim (no rewrite)
Speed	1-2 min (LLM call)	Seconds	Instant	33,000 tok/s (<3s)
Output inspectable	Yes (summary)	No (opaque)	Yes (truncated)	Yes (verbatim lines)
Vendor lock-in	Model-dependent	OpenAI only	None	None
Works with any model	Yes	No	Yes	Yes

Benchmarks

Performance data from Factory.ai's systematic evaluation (36,000 real engineering messages), JetBrains' SWE-bench experiments, and the ACON research framework.

Factory.ai Evaluation

Factory.ai tested compaction approaches across real coding agent sessions. Structured summarization scored 3.70/5. OpenAI's opaque compression scored 3.35/5. The key finding: no approach scored above 4/5, meaning all methods lose some information. The question is whether the lost information is fabricated (summarization) or simply absent (compaction).

JetBrains Observation Masking

JetBrains tested a simple technique on SWE-bench: hide old tool outputs entirely rather than summarizing them. The result matched the quality of full LLM summarization while eliminating the compute cost. This validates the core insight behind verbatim compaction: for many agent workloads, you do not need to summarize. Removing noise is sufficient.

ACON Framework

The ACON research framework (October 2025) treats context compression as an optimization problem. Results: 26-54% peak token reduction while preserving 95%+ task accuracy. For smaller models, removing context noise actually improved performance by 20-46%. More context is not always better.

33K

Tokens/sec (Morph Compact)

3.70/5

Summarization accuracy (Factory)

3.35/5

OpenAI opaque accuracy (Factory)

95%+

Accuracy preserved (ACON, 26-54% compression)

Morph Compact Performance

Morph Compact processes 33,000 tokens per second. 100K tokens compress in under 2 seconds. Every compaction completes in under 3 seconds. The model supports a 1M-token context window, handling the longest agent sessions without chunking or multi-pass processing.

Compression quality: 50-70% token reduction with 98% verbatim accuracy. Zero hallucination risk. Output is inspectable, diffable, and works with any downstream model (Claude, GPT, Gemini, open-source).

Pricing

Morph Compact is available on all plans, including the free tier. Pricing is credit-based, scaling with usage.

	Free	Starter	Pro	Scale
Monthly price	$0	$20	$60	$400
Credits included	250K	2M	8M	80M
Compact access	Yes	Yes	Yes	Yes
Rate limits	Low	Generous	Generous	Unlimited

Cost Impact

Compaction pays for itself by reducing downstream token costs. An agent running at 150K tokens of context makes every subsequent LLM call expensive. Compacting to 60K tokens (60% reduction) cuts the cost of every future call by 60%. For agents making dozens of calls per task, the savings compound quickly.

Example: an agent making 50 LLM calls per task with Claude Sonnet 4 at $3/M input tokens. At 150K tokens per call, that is $22.50 in input costs. Compacting to 60K tokens drops it to $9.00. The compaction itself costs a fraction of the savings.

Frequently Asked Questions

What is a compaction API?

A compaction API takes a conversation history or context window and returns a compressed version that preserves critical information while reducing token count. Unlike summarization, verbatim compaction like Morph Compact keeps every surviving line character-for-character from the original, eliminating hallucination risk.

How is compaction different from summarization?

Summarization rewrites your context into a shorter natural-language summary. Factory.ai scored summarization accuracy at 3.7/5 across 36,000 real engineering messages. Compaction deletes tokens rather than rewriting them. Every line in the output existed in the input. Summarization achieves higher compression (70-90%) but introduces hallucination risk. Compaction achieves 50-70% compression with zero hallucination.

What is context rot?

Context rot is the degradation in LLM performance as the context window fills up. Stanford's "Lost in the Middle" research showed 15-47% accuracy drops as context grows. Performance is highest for information at the beginning or end of context, and drops significantly for information in the middle. Compaction prevents context rot by removing low-value tokens before the window fills. See our full context rot guide for details.

How fast is Morph Compact?

33,000 tokens per second on a custom inference engine. 100K tokens compress in under 2 seconds. Every compaction finishes in under 3 seconds regardless of input size. Fast enough to run inline before every LLM call, not just as an emergency measure at 95% capacity.

Does Morph Compact hallucinate or rewrite content?

No. Morph Compact deletes lines from the input but never generates new text. Every line in the output is a character-for-character match from the input. This eliminates the hallucination risk present in summarization approaches, where the model might fabricate file paths, error codes, or function signatures.

What compression ratio does Morph Compact achieve?

50-70% token reduction. The compression_ratio parameter (0.3-0.7 typical range) controls how aggressively to compress. You can also pass a query parameter to bias retention toward information relevant to the agent's next task.

Can I use Morph Compact with the OpenAI SDK?

Yes. Point the baseURL to https://api.morphllm.com/v1 and use model morph-compactor with the standard chat completions endpoint. It also works with the Anthropic SDK and Vercel AI SDK.

How does Morph Compact compare to OpenAI's compaction?

OpenAI's /responses/compact endpoint produces opaque, non-human-readable compressed representations. It achieves 99.3% compression but the output is not inspectable and is locked to OpenAI's infrastructure. Morph Compact produces readable, verbatim output that works with any downstream model. Factory.ai scored OpenAI's approach 3.35/5 overall.

Try Morph Compact API

33,000 tok/s verbatim compaction. 50-70% token reduction. Zero hallucination. Every surviving line is character-for-character from the original input.

Get API Key

Read the Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Compaction API: Compress LLM Context Without Losing Information (2026)