What Is a Compaction API?
A compaction API takes a conversation history, agent transcript, or any accumulated context and returns a shorter version that preserves the information the model needs to keep working. The agent continues with a compressed context instead of restarting or degrading.
The core problem: LLMs perform worse as context grows. Stanford's "Lost in the Middle" research demonstrated a U-shaped performance curve. Models recall information at the beginning and end of context well, but accuracy drops 15-47% for information in the middle. This is not a bug in any particular model. Chroma tested 18 frontier models in 2025, including GPT-4.1, Claude Opus 4, and Gemini 2.5, and every one exhibited this behavior at every context length tested.
The Compaction Trade-off
Every compaction method makes the same fundamental trade-off: compression ratio vs. information preservation. Aggressive compression (90%+) requires rewriting content, which introduces hallucination risk. Conservative compression (50-70%) can keep every surviving line verbatim, but retains more tokens. The right choice depends on whether your agent can tolerate fabricated details in its compressed history.
Three approaches exist today: LLM summarization (rewrite the context as a summary), opaque compression (encode context into a non-human-readable format), and verbatim compaction (delete low-value lines while keeping everything else word-for-word). Each has real trade-offs in speed, accuracy, and portability.
Why Agents Need Compaction
A coding agent working on a non-trivial task accumulates context fast. Each file read, tool call, command output, and reasoning step adds tokens. An agent editing code across 10 files can burn through 100K tokens of context in 15 minutes. At that rate, even a 200K-token window lasts about 30 minutes before quality starts falling off.
Context Rot
LLM accuracy degrades as context grows, even before hitting the window limit. A model with 200K capacity can start degrading at 50K. Compaction keeps the effective context small and dense.
Lost in the Middle
Information placed in the middle of long context is recalled 30%+ less accurately than information at the start or end. Compaction removes filler so critical details stay closer to the attention window's edges.
Cost Accumulation
Every token in the context window is billed on every subsequent API call. An agent making 50 calls with 150K tokens of context costs 3x more than one running at 50K tokens. Compaction cuts costs proportionally.
The 95% Cliff
Most agent frameworks trigger compaction only when the context window is nearly full. Claude Code, for example, auto-compacts at roughly 95% of its 200K window (~167K tokens). By that point, the agent has been operating with degraded context for thousands of tokens. It has already made decisions based on information it can no longer recall accurately. It contradicts earlier reasoning, re-reads files it already analyzed, and loops on problems it already solved.
Proactive compaction, running compression before the cliff rather than at it, keeps the agent operating in its highest-accuracy range. This requires a compaction method fast enough to run frequently without stalling the agent's workflow.
How Morph Compact Works
Morph Compact is verbatim compaction. It reads the input, identifies which lines carry information relevant to the agent's continued operation, and keeps those lines unchanged. Everything else is deleted. The output is a subset of the input, not a rewrite.
Architecture
The morph-compactor model runs on a custom inference engine purpose-built for compaction workloads. It processes 33,000 tokens per second and completes 100K-token inputs in under 2 seconds. The model supports a 1M-token context window, so it can handle the longest agent transcripts without chunking.
Two Modes of Operation
Objective compaction removes filler without specific guidance. Pass the conversation history, get back a compressed version. The model identifies boilerplate, redundant tool outputs, repeated information, and verbose formatting, then strips them while keeping decisions, code changes, file paths, error messages, and reasoning chains.
Query-based compaction biases retention toward a specific objective. Pass a query parameter describing what the agent needs next, and the model weights retention accordingly. If the next task is "fix the authentication middleware," auth-related context is preserved more aggressively than unrelated tool outputs.
What Gets Preserved
The model prioritizes: file paths and code references, error messages and stack traces, architectural decisions and their reasoning, function signatures and type definitions, explicit instructions from the user, and state that the agent will need for its next action. It removes: verbose tool output (full file contents when only a few lines matter), repeated information, exploratory dead ends the agent already abandoned, and formatting noise.
Zero Hallucination Guarantee
Because Morph Compact only deletes lines and never generates new content, the output cannot contain hallucinated file paths, fabricated function names, or invented error codes. Every line in the compacted output existed in the original input. This is not true of summarization-based approaches, where the model rewrites context and can introduce subtle inaccuracies.
API Usage
Morph Compact exposes a POST /v1/compact endpoint. It also works through the OpenAI-compatible /v1/chat/completions endpoint with model morph-compactor. Use whichever fits your existing stack.
Native Compact Endpoint
TypeScript (Morph SDK)
import { MorphClient } from "@morphllm/morphsdk";
const morph = new MorphClient({ apiKey: process.env.MORPH_API_KEY });
// Objective compaction — no query needed
const result = await morph.compact({
input: conversationHistory,
compression_ratio: 0.5, // keep ~50% of tokens
preserve_recent: 2, // last 2 messages stay uncompressed
});
console.log(result.usage.compression_ratio); // e.g. 0.48
console.log(result.output); // compressed contextQuery-Based Compaction
// Bias retention toward what the agent needs next
const result = await morph.compact({
input: conversationHistory,
query: "Fix the JWT validation in auth middleware",
compression_ratio: 0.4,
});
// Result preserves auth-related context more aggressivelyOpenAI-Compatible Endpoint
Works with any OpenAI SDK client
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: process.env.MORPH_API_KEY,
});
const response = await client.chat.completions.create({
model: "morph-compactor",
messages: [
{ role: "user", content: agentTranscript },
],
});
const compactedContext = response.choices[0].message.content;cURL
Direct HTTP request
curl -X POST "https://api.morphllm.com/v1/compact" \
-H "Authorization: Bearer $MORPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "... conversation history ...",
"query": "authentication middleware",
"compression_ratio": 0.5,
"preserve_recent": 2
}'Response Format
Compact API response
{
"id": "cmpr-7373faf8af65",
"object": "compact",
"model": "morph-compactor",
"output": "... compressed context ...",
"messages": [
{
"role": "user",
"content": "... compressed content ...",
"compacted_line_ranges": [{ "start": 5, "end": 10 }],
"kept_line_ranges": [{ "start": 1, "end": 4 }, { "start": 11, "end": 20 }]
}
],
"usage": {
"input_tokens": 98432,
"output_tokens": 47890,
"compression_ratio": 0.487,
"processing_time_ms": 1840
}
}The response includes compacted_line_ranges and kept_line_ranges so you can see exactly which parts of the input were removed and which survived. This makes the compaction fully inspectable and debuggable.
Comparison to Alternatives
Five approaches to context management exist in production today. Each makes a different trade-off.
LLM Summarization
The model rewrites the conversation into a shorter natural-language summary. Claude Code and Cursor both use this approach. Anthropic's implementation produces structured summaries (7K-12K characters) with sections for analysis, files modified, and pending tasks. Factory.ai scored this approach 3.7/5 for accuracy across 36,000 real engineering messages. The failure mode: summaries can hallucinate file paths, drop exact error codes, and lose function signatures. Multi-session information retention was only 37%.
Opaque Compression (OpenAI)
OpenAI's /responses/compact endpoint produces a server-side compressed representation that is not human-readable. It achieves 99.3% compression, the most aggressive of any approach. But the output cannot be inspected, debugged, or used with any model other than OpenAI's. Factory.ai scored it 3.35/5 overall.
Sliding Window
Drop the oldest messages when context exceeds a threshold. Simple to implement, but discards information indiscriminately. An error message from early in the conversation might be the key to solving the current problem. Sliding windows have no way to distinguish valuable old context from noise.
RAG / Retrieval
Store conversation chunks in a vector database and retrieve relevant pieces when needed. This works for knowledge bases but is poorly suited to agent transcripts, where the relationship between messages is sequential and contextual. Retrieving individual chunks from a debugging session loses the causal chain of "I tried X, it failed because Y, so I switched to Z."
Verbatim Compaction (Morph Compact)
Delete low-value lines, keep everything else unchanged. No rewriting, no hallucination, no vendor lock-in. The compressed output works with any downstream model. The trade-off is a lower compression ratio (50-70% vs. 90%+ for summarization), but the output is trustworthy.
| LLM Summarization | OpenAI Opaque | Sliding Window | Morph Compact | |
|---|---|---|---|---|
| Compression ratio | 70-90% | 99.3% | Fixed window | 50-70% |
| Hallucination risk | Medium | Medium | None (drops data) | Zero |
| Factory.ai accuracy | 3.70/5 | 3.35/5 | N/A | Verbatim (no rewrite) |
| Speed | 1-2 min (LLM call) | Seconds | Instant | 33,000 tok/s (<3s) |
| Output inspectable | Yes (summary) | No (opaque) | Yes (truncated) | Yes (verbatim lines) |
| Vendor lock-in | Model-dependent | OpenAI only | None | None |
| Works with any model | Yes | No | Yes | Yes |
Benchmarks
Performance data from Factory.ai's systematic evaluation (36,000 real engineering messages), JetBrains' SWE-bench experiments, and the ACON research framework.
Factory.ai Evaluation
Factory.ai tested compaction approaches across real coding agent sessions. Structured summarization scored 3.70/5. OpenAI's opaque compression scored 3.35/5. The key finding: no approach scored above 4/5, meaning all methods lose some information. The question is whether the lost information is fabricated (summarization) or simply absent (compaction).
JetBrains Observation Masking
JetBrains tested a simple technique on SWE-bench: hide old tool outputs entirely rather than summarizing them. The result matched the quality of full LLM summarization while eliminating the compute cost. This validates the core insight behind verbatim compaction: for many agent workloads, you do not need to summarize. Removing noise is sufficient.
ACON Framework
The ACON research framework (October 2025) treats context compression as an optimization problem. Results: 26-54% peak token reduction while preserving 95%+ task accuracy. For smaller models, removing context noise actually improved performance by 20-46%. More context is not always better.
Morph Compact Performance
Morph Compact processes 33,000 tokens per second. 100K tokens compress in under 2 seconds. Every compaction completes in under 3 seconds. The model supports a 1M-token context window, handling the longest agent sessions without chunking or multi-pass processing.
Compression quality: 50-70% token reduction with 98% verbatim accuracy. Zero hallucination risk. Output is inspectable, diffable, and works with any downstream model (Claude, GPT, Gemini, open-source).
Pricing
Morph Compact is available on all plans, including the free tier. Pricing is credit-based, scaling with usage.
| Free | Starter | Pro | Scale | |
|---|---|---|---|---|
| Monthly price | $0 | $20 | $60 | $400 |
| Credits included | 250K | 2M | 8M | 80M |
| Compact access | Yes | Yes | Yes | Yes |
| Rate limits | Low | Generous | Generous | Unlimited |
Cost Impact
Compaction pays for itself by reducing downstream token costs. An agent running at 150K tokens of context makes every subsequent LLM call expensive. Compacting to 60K tokens (60% reduction) cuts the cost of every future call by 60%. For agents making dozens of calls per task, the savings compound quickly.
Example: an agent making 50 LLM calls per task with Claude Sonnet 4 at $3/M input tokens. At 150K tokens per call, that is $22.50 in input costs. Compacting to 60K tokens drops it to $9.00. The compaction itself costs a fraction of the savings.
Frequently Asked Questions
What is a compaction API?
A compaction API takes a conversation history or context window and returns a compressed version that preserves critical information while reducing token count. Unlike summarization, verbatim compaction like Morph Compact keeps every surviving line character-for-character from the original, eliminating hallucination risk.
How is compaction different from summarization?
Summarization rewrites your context into a shorter natural-language summary. Factory.ai scored summarization accuracy at 3.7/5 across 36,000 real engineering messages. Compaction deletes tokens rather than rewriting them. Every line in the output existed in the input. Summarization achieves higher compression (70-90%) but introduces hallucination risk. Compaction achieves 50-70% compression with zero hallucination.
What is context rot?
Context rot is the degradation in LLM performance as the context window fills up. Stanford's "Lost in the Middle" research showed 15-47% accuracy drops as context grows. Performance is highest for information at the beginning or end of context, and drops significantly for information in the middle. Compaction prevents context rot by removing low-value tokens before the window fills. See our full context rot guide for details.
How fast is Morph Compact?
33,000 tokens per second on a custom inference engine. 100K tokens compress in under 2 seconds. Every compaction finishes in under 3 seconds regardless of input size. Fast enough to run inline before every LLM call, not just as an emergency measure at 95% capacity.
Does Morph Compact hallucinate or rewrite content?
No. Morph Compact deletes lines from the input but never generates new text. Every line in the output is a character-for-character match from the input. This eliminates the hallucination risk present in summarization approaches, where the model might fabricate file paths, error codes, or function signatures.
What compression ratio does Morph Compact achieve?
50-70% token reduction. The compression_ratio parameter (0.3-0.7 typical range) controls how aggressively to compress. You can also pass a query parameter to bias retention toward information relevant to the agent's next task.
Can I use Morph Compact with the OpenAI SDK?
Yes. Point the baseURL to https://api.morphllm.com/v1 and use model morph-compactor with the standard chat completions endpoint. It also works with the Anthropic SDK and Vercel AI SDK.
How does Morph Compact compare to OpenAI's compaction?
OpenAI's /responses/compact endpoint produces opaque, non-human-readable compressed representations. It achieves 99.3% compression but the output is not inspectable and is locked to OpenAI's infrastructure. Morph Compact produces readable, verbatim output that works with any downstream model. Factory.ai scored OpenAI's approach 3.35/5 overall.
Related Pages
Try Morph Compact API
33,000 tok/s verbatim compaction. 50-70% token reduction. Zero hallucination. Every surviving line is character-for-character from the original input.