What Is Token Compression
Token compression reduces the number of tokens processed by an LLM while preserving the information needed for correct output. Every token has a cost: input tokens burn money and prefill latency, output tokens burn more money and decode latency. Compression targets the gap between tokens sent and tokens that actually matter.
The gap is large. In a typical coding agent session, a 10-file codebase read pulls 30,000-50,000 tokens into context. The agent needs maybe 2,000-5,000 of those tokens to make its next edit. The remaining 90% is structural boilerplate, import blocks, comments, and code paths unrelated to the current task. Every one of those tokens increases cost, increases latency, and dilutes the model's attention across irrelevant content.
Token Compression vs. Prompt Engineering
Prompt engineering reduces tokens by writing better prompts. Token compression reduces tokens automatically, operating on existing context that has already accumulated. Prompt engineering is preventive. Token compression is reactive. Production systems need both: careful prompt design to minimize waste at the source, and compression to handle the context that builds up over multi-turn interactions.
The field has matured from research papers to production APIs. Microsoft's LLMLingua series (2023-2024) proved that automated compression preserves task accuracy. The Token Company (YC W26) demonstrated that compression middleware can work as a drop-in API layer. Morph Compact pushed throughput to 33,000 tok/s with verbatim fidelity, making inline compression fast enough to run before every LLM call.
Why Token Count Drives LLM Cost and Latency
LLM pricing is per-token. Claude Sonnet 4.6 costs $3/M input, $15/M output. GPT-4o costs $2.50/M input, $10/M output. At these rates, a coding agent running 500 tasks/day at 100K tokens/task spends $150-500/day on input alone. Cut input tokens by 60%, and that drops to $60-200/day.
Cost is the straightforward argument. The subtler argument is quality. Liu et al. (2023) demonstrated that LLMs exhibit a U-shaped accuracy curve: they access information well at the beginning and end of a prompt but lose track of content in the middle. This "lost in the middle" effect means 100K tokens of context is not 5x more useful than 20K tokens. It is often worse, because the model's attention is spread across irrelevant content that actively interferes with retrieval of the relevant parts.
Compression removes the noise. A well-compressed 40K token context often outperforms the uncompressed 100K version because the signal-to-noise ratio is higher. The model spends its attention budget on content that matters.
Latency: Prefill Is the Bottleneck
LLM inference has two phases: prefill (process all input tokens) and decode (generate output tokens one at a time). Prefill time scales linearly with input token count. A 100K token prompt takes roughly 2-4x longer to prefill than a 30K token prompt, depending on the model and hardware. For interactive applications, that difference is the difference between responsive and sluggish.
Context caching (Anthropic, Google) mitigates this for repeated prefixes, charging 90% less for cached tokens. But caching only helps with static content: system prompts, reference docs, few-shot examples. The dynamic context that agents accumulate turn by turn (tool outputs, file reads, conversation history) changes every request and cannot be cached. Compression is the only way to reduce its cost and latency.
Five Token Compression Techniques
Each technique makes fundamentally different tradeoffs. The right choice depends on your content type, latency budget, and tolerance for information loss.
1. Token Pruning
Token pruning scores individual tokens by information content and removes the lowest-scoring ones. LLMLingua (Microsoft, EMNLP 2023) uses a small model (GPT-2 or LLaMA-7B) to compute perplexity scores, then drops tokens that contribute the least information. A budget controller allocates compression capacity across prompt segments. Token-level iterative compression models interdependencies between adjacent tokens.
LLMLingua achieves up to 20x compression on reasoning benchmarks (GSM8K, BBH) with only 1.5-point accuracy loss. LLMLingua-2 (ACL 2024) reformulated the problem as token classification using a BERT-class encoder, running 3-6x faster than the original with comparable quality.
The Code Problem with Token Pruning
Token pruning operates below the semantic level. Removing a single token from a file path (src/api/webhook instead of src/api/webhooks/stripe.ts) or a JSON key produces malformed output. Perplexity-based scoring works well on natural language, where removing a filler word preserves meaning. It fails on code, where every character in an identifier is load-bearing.
2. Soft-Prompt Compression
Soft-prompt methods encode long contexts into compact learned vectors that replace the original text. AutoCompressors (EMNLP 2023) fine-tune models to produce "summary vectors" that function as compressed soft prompts. 500xCompressor pushes this to extreme ratios, compressing contexts into as few as one special token with models retaining 62-73% of capabilities at 480x compression.
The advantage is extreme compression ratios impossible with text-based methods. The disadvantage is that these vectors are model-specific, opaque (you cannot inspect what was preserved), and require fine-tuning or specialized architectures. They are research results, not production APIs.
3. Verbatim Compaction
Verbatim compaction deletes low-signal content while guaranteeing that every surviving token is identical to the input. Nothing is rewritten. Nothing is paraphrased. The output is a strict subset of the input tokens.
This is the approach Morph Compact uses. The model identifies filler (greetings, redundant explanations, stale tool outputs, boilerplate) and removes it. What remains is character-for-character from the original: file paths, error codes, function signatures, and reasoning steps are either present exactly or absent entirely. 50-70% compression at 33,000 tok/s.
The tradeoff: verbatim compaction cannot achieve the extreme ratios of token pruning or soft-prompt methods. It compresses less aggressively because it refuses to corrupt content. For production systems where a corrupted file path means a failed agent task, that tradeoff is correct.
4. Embedding-Based Retrieval (RAG)
Retrieval-augmented generation sidesteps compression by never putting the full corpus into context. Chunk documents, embed them, store in a vector database. At query time, retrieve only the top-k relevant chunks. The model sees 2,000 tokens of relevant context instead of 200,000 tokens of everything.
RAG is not strictly compression, but it solves the same problem: reducing token count to what the model actually needs. RECOMP (2023) trains dedicated compressors for retrieved documents, selecting relevant sentences (extractive) or generating summaries (abstractive). At 6% compression rates, it achieves minimal performance loss on QA tasks.
The limitation: RAG works for static knowledge bases. For dynamic context (conversation history, tool outputs, evolving agent state), there is no corpus to index. You need a compression method that operates on the context as it accumulates.
5. Selective Summarization
Summarization rewrites context into fewer tokens. This is the default in most agent frameworks. Claude Code auto-compacts at 95% context capacity. OpenAI Codex runs server-side compaction after every turn. The approach is flexible and produces natural-language summaries that models handle well.
The problem is fidelity. Factory.ai tested three summarization approaches on 36,000 messages from real Claude Code sessions. Structured summaries scored 3.70/5. Anthropic summaries scored 3.44/5. OpenAI opaque scored 3.35/5. The biggest accuracy gap was in preserving file paths, error codes, stack traces, and debugging context, the exact content coding agents need to avoid re-reading loops.
Summarization also introduces hallucination risk. The compressor is an LLM generating new text. It can paraphrase imprecisely, drop specific details, or confabulate connections between unrelated context items.
| Token Pruning | Soft Prompts | Verbatim Compaction | RAG | Summarization | |
|---|---|---|---|---|---|
| Compression ratio | Up to 20x | Up to 480x | 50-70% | 90%+ | 60-98% |
| Fidelity | Can corrupt structure | Opaque vectors | Byte-identical | Chunk-level | Can hallucinate |
| Latency | 100-500ms | Model-dependent | <3s for 100K | ~50ms retrieval | 2-30s |
| Code safety | Dangerous | N/A | Safe | Chunk boundaries | Risky |
| Production-ready | LLMLingua-2 | Research only | Morph Compact | Many providers | Built into agents |
| Dynamic context | Yes | Requires fine-tuning | Yes | No (static corpus) | Yes |
Benchmarks: Compression Ratio vs. Accuracy
Compression is a quality-cost tradeoff. More aggressive compression saves more tokens but risks losing signal. The key question: at what compression ratio does task accuracy start to degrade?
LLMLingua on Reasoning Tasks
On GSM8K (math reasoning), LLMLingua achieves 4x compression with less than 1 point accuracy loss. At 10x compression, accuracy drops 3-5 points. At 20x, the drop is measurable but still within usable range for many applications. Chain-of-thought prompts compress well because the reasoning structure is redundant: once the model sees the pattern, intermediate steps contribute less new information per token.
Verbatim Compaction on Agent Workloads
Morph Compact targets 50-70% compression on agent conversation history. On Factory.ai's benchmark of 36,000 real engineering messages, verbatim compaction preserved 98% of content accuracy while reducing token count by 60%. The key metric is re-read rate: how often the agent needs to re-read a file because the compressed context lost the reference. With verbatim compaction, the re-read rate stays near baseline because paths and identifiers are preserved exactly.
Summarization Quality Scores
Factory.ai scored summarization-based compression on five dimensions: accuracy, completeness, readability, relevance, and action items. The overall scores (3.35-3.70/5) hide a sharper problem: accuracy specifically scored lower on technical content. Summarizers consistently paraphrased file paths, approximated error messages, and merged distinct issues into single summary bullets. Each of these failures triggers downstream costs as the agent re-reads to recover exact information.
Token Compression for Coding Agents
Coding agents have a unique relationship with token compression. They are the heaviest context consumers (reading files, running tools, accumulating outputs) and the most sensitive to compression errors (one corrupted path or identifier breaks the task).
Cognition (the team behind Devin) measured that 60% of their agent's time went to searching for code. Every grep that returns 500 lines, every full-file read to find a 10-line function, every tool output echoed back into context accelerates the countdown to context exhaustion. Compression is not optional for long-running agent sessions. The context window fills, compaction fires, and the quality of that compaction determines whether the agent continues or starts looping.
What Coding Agents Cannot Afford to Lose
File Paths and Line Numbers
src/api/webhooks/stripe.ts:98 must survive compression exactly. Approximations ('the webhook handler') force the agent to search again, wasting tokens and time.
Error Messages and Stack Traces
TypeError: Cannot read properties of undefined (reading 'id') at line 47. The exact error text is needed for diagnosis. Paraphrased errors ('there was a type error') are useless.
Function Signatures and Types
async function processWebhook(event: Stripe.Event): Promise<void>. A compressed version that drops the type annotation or alters the function name breaks downstream edits.
The Prevention-First Approach
The most effective token compression is the compression you never need to run. If the agent accumulates less waste per turn, the context window fills slower, and compaction fires less often.
This is the idea behind Morph FlashCompact. WarpGrep returns only relevant code snippets (0.73 F1 in 3.8 steps) instead of dumping entire files. Fast Apply uses compact diffs at 10,500 tok/s instead of echoing full file rewrites back into context. Morph Compact cleans up whatever noise remains through verbatim deletion. The combination extends effective context life by 3-4x, so compaction fires 3-4x less often.
Context Life Extension
Agent frameworks typically trigger compaction at 95% context capacity. If your agent fills 200K tokens in 50 turns, that is one compaction every 50 turns. With prevention-first compression (targeted search, compact diffs, inline compaction), the same agent reaches 200K tokens in 150-200 turns. Fewer compaction events means less information loss over long sessions.
Production Implementation with Morph Compact
Morph Compact is an OpenAI-compatible API. Point your SDK at https://api.morphllm.com/v1, use model morph-compactor, and send the context you want compressed.
Basic Compaction
Objective compaction strips filler with no additional guidance. Send the text, get back a shorter version with every surviving sentence identical to the original.
Morph Compact: Objective Compaction
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.morphllm.com/v1",
apiKey: process.env.MORPH_API_KEY,
});
const response = await client.chat.completions.create({
model: "morph-compactor",
messages: [
{
role: "user",
content: agentConversationHistory, // 80K tokens of context
},
],
});
// ~30K tokens, every sentence byte-identical to original
const compressed = response.choices[0].message.content;Query-Based Compaction
Pass the agent's next task as a query. Compact weights its keep/drop decisions against what the agent needs next, preserving more context that is relevant to the upcoming work.
Morph Compact: Query-Based Compaction
const response = await client.chat.completions.create({
model: "morph-compactor",
messages: [
{
role: "system",
content: "Fix the authentication middleware to handle expired tokens",
},
{
role: "user",
content: agentConversationHistory,
},
],
});
// Preserves auth-related context, drops unrelated tool outputs
const compressed = response.choices[0].message.content;Inline Compression in Agent Loops
At 33,000 tok/s, Compact is fast enough to run before every LLM call, not just at the 95% capacity cliff. This keeps the context window lean throughout the session rather than waiting for a crisis-mode compaction at near-capacity.
Inline Compression in an Agent Loop
async function agentStep(history: Message[], task: string) {
// Compress if history exceeds threshold
if (estimateTokens(history) > 50_000) {
const compressed = await compact(history, task);
history = [{ role: "system", content: compressed }];
}
// LLM call with lean context
const response = await llm.chat(history);
return response;
}
async function compact(messages: Message[], query: string) {
const res = await compactClient.chat.completions.create({
model: "morph-compactor",
messages: [
{ role: "system", content: query },
{ role: "user", content: formatHistory(messages) },
],
});
return res.choices[0].message.content;
}Cost Analysis: Token Compression at Scale
The math is straightforward. If you process 10M tokens/day through Claude Sonnet at $3/M input, that is $30/day in input costs. Compressing by 60% reduces input to 4M tokens, saving $18/day or $540/month. The compression step itself costs a fraction of the savings because Morph Compact pricing is substantially lower than frontier model pricing.
| No Compression | 60% Compression | |
|---|---|---|
| Daily input tokens | 10M | 4M |
| Monthly input cost (Sonnet) | $900 | $360 |
| Compression cost | $0 | ~$50 |
| Net monthly cost | $900 | ~$410 |
| Monthly savings | - | ~$490 (54%) |
The savings compound with model cost. Compressing before Claude Opus ($15/M input) saves 3x more per token than compressing before Haiku ($0.25/M input). Token compression is most valuable when used with the most expensive models, which are also the models that benefit most from cleaner context.
Latency savings are harder to quantify but equally real. 60% fewer input tokens means roughly 60% less prefill time. For an interactive agent that makes 20 LLM calls per task, shaving 1-2 seconds of prefill per call saves 20-40 seconds per task. Over hundreds of tasks, that is hours of wall-clock time recovered.
Frequently Asked Questions
What is token compression?
Token compression reduces the number of tokens in an LLM prompt or context while preserving the information needed for accurate output. Methods include token pruning (removing low-information tokens via perplexity scoring), verbatim compaction (deleting low-signal blocks while keeping surviving text identical), embedding-based retrieval (replacing large contexts with relevant chunks), and selective summarization (rewriting context into fewer tokens).
How much can token compression reduce LLM costs?
Typical compression achieves 50-70% token reduction, translating directly to 50-70% cost savings on input tokens. LLMLingua achieves up to 20x compression on reasoning tasks. Morph Compact achieves 50-70% at 33,000 tok/s with zero hallucination risk. The exact savings depend on content type: conversational context compresses more aggressively than structured code.
Does token compression reduce output quality?
It depends on the method. Summarization can introduce errors (Factory.ai scored it 3.4-3.7/5 on accuracy). Token pruning can corrupt structured content by removing individual tokens from identifiers. Verbatim compaction preserves exact text, so surviving content is identical to the original. Research shows compressed prompts sometimes improve quality by removing noise that causes "lost in the middle" degradation.
What is the difference between token compression and prompt compression?
They overlap significantly. Prompt compression specifically targets the input prompt before it reaches the model. Token compression is broader, covering prompt compression, context compaction during multi-turn conversations, KV-cache compression during inference, and output-side techniques. In practice, the terms are often used interchangeably.
Can you compress tokens for code without breaking syntax?
Standard token pruning methods can corrupt code by removing individual tokens from paths or identifiers. Verbatim compaction avoids this by operating on semantic units: a file path is either present exactly as written or removed entirely. Morph Compact is built for coding agent workloads where preserving exact paths, error codes, and line numbers is non-negotiable.
When should I compress tokens vs. use a larger context window?
Larger windows help with capacity but not with quality or cost. LLMs exhibit "lost in the middle" degradation where accuracy drops for content in the middle of long prompts. Compressing 100K tokens to 40K often produces better results than feeding all 100K, while cutting cost by 60%. Compress when you have more context than signal.
Is token compression the same as RAG?
No, but they solve overlapping problems. RAG retrieves small relevant chunks so the full corpus never enters context. Token compression reduces tokens already in context. They are complementary: use RAG to select what enters context, then compression to trim what remains.
How does Morph Compact compare to LLMLingua?
LLMLingua prunes individual tokens using perplexity scores. It achieves higher compression ratios (up to 20x) but can corrupt structured content. Morph Compact uses verbatim compaction at 33,000 tok/s, achieving 50-70% compression with zero hallucination risk. LLMLingua is better for prose-heavy NLP tasks. Morph Compact is better for coding agent workloads where fidelity matters more than compression ratio.
Related Pages
Try Morph Compact API
33,000 tok/s verbatim compaction. 50-70% token reduction with zero hallucination risk. Every surviving sentence is byte-for-byte identical to the original. OpenAI-compatible API.