LLM Context Management: How Production Agents Handle Memory, Compression, and Retrieval

A practical guide to LLM context management for production agents. Tool-by-tool comparison of how Claude Code, Codex, Cursor, and Devin handle context windows, with Anthropic's four pillars (Write, Select, Compress, Isolate), compression benchmarks, and decision frameworks for RAG vs compaction vs summarization vs subagents.

February 27, 2026 · 3 min read

57% of organizations have agents in production. 32% cite quality as the top barrier to scaling. The root cause is not the model. It's what the model sees. Context management is the engineering discipline that separates agents that degrade after 20 exchanges from agents that sustain quality across hundreds.

57%
Of orgs have agents in production (LangChain 2025)
5.5x
Fewer tokens: Claude Code vs Cursor
32%
Cite quality as top scaling barrier
80%
Of context problems solved by 3 patterns

What Is LLM Context Management

LLM context management is the discipline of controlling what enters and exits a language model's context window across agent sessions. It covers four areas: memory persistence (writing state to files that survive resets), just-in-time retrieval (loading only what's needed for the current step), compression (reducing token count while preserving signal), and isolation (using separate context windows for parallel subtasks).

This is different from prompt engineering. Prompt engineering is about what you say to the model. Context engineering is about what the model sees when it reasons. A perfectly written prompt degrades if it competes with 80K tokens of irrelevant tool output for the model's attention.

Chroma's research found that all 18 frontier models degrade as context grows. This is not a context window size problem. A model with 200K tokens of capacity still performs worse at 100K than at 20K if most of those tokens are noise. The problem is signal dilution: every irrelevant token makes the model worse at attending to the tokens that matter.

The token efficiency gap is real

Claude Code uses 5.5x fewer tokens than Cursor for equivalent coding tasks. That gap comes from context management, not model capability. Structured memory files, selective loading, automatic compaction, and subagent isolation compound into dramatically better token efficiency.

How Production Agents Handle Context

No two production agents manage context the same way. The differences explain why some sustain quality over long sessions while others force users to start fresh every 20 minutes. Here is how four major tools approach the problem.

Claude Code

Claude Code has the most explicit context management system of any production agent. Six mechanisms work together:

  • CLAUDE.md: A project-level memory file that is always loaded into context. Developers write architectural decisions, file paths, conventions, and task state here. It persists across sessions and acts as the agent's long-term memory.
  • .claudeignore: Functions like .gitignore but for the agent's context. Excluding node_modules, build artifacts, and generated files saves 30-100K tokens per session.
  • Auto-compaction: Triggers automatically when the conversation approaches the context window limit. Compresses history while preserving the working state.
  • Subagent isolation: Spawns separate context windows for exploration and research tasks. The main agent's context stays clean while subagents search, read files, and gather information in isolation.
  • /compact command: Manual trigger for compaction when the developer knows context has accumulated noise.
  • On-demand skill loading: Skills (specialized capabilities) load into context only when triggered, not at startup. This keeps the base context lean.

OpenAI Codex

Codex takes an opaque approach. It uses a /responses/compact endpoint with encrypted_content fields. The compression logic is internal, achieving 99.3% compression via an auto_compact_limit parameter. Developers cannot inspect or influence what gets compressed. The tradeoff: high compression ratio, zero visibility into what was preserved or discarded.

Cursor

Cursor operates with a 120K context window and no auto-compaction. It uses .cursorrules for project-level instructions (similar to CLAUDE.md). The problem shows up in extended sessions: quality degrades after 20-30 exchanges as noise accumulates with no automatic cleanup. Users learn to start new sessions proactively, which means losing in-progress reasoning and established context.

Devin

Devin degrades after roughly 2.5 hours of continuous operation. It exhibits what researchers call "context anxiety": the model begins prematurely summarizing its own context to avoid hitting limits, discarding details it may still need. Mechanisms like CHANGELOG.md and SUMMARY.md help but are insufficient alone because the model must decide what to persist before it knows what will matter later.

FeatureClaude CodeCodexCursorDevin
Persistent memoryCLAUDE.md (always loaded)Opaque (encrypted).cursorrulesCHANGELOG.md / SUMMARY.md
File exclusion.claudeignore (30-100K saved)N/AN/AN/A
Auto-compactionYes (at capacity)Yes (auto_compact_limit)NoPremature self-summarization
Manual compaction/compact commandN/AN/AN/A
Subagent isolationYes (separate windows)Sandboxed tasksNoLimited
Compression ratioConfigurable99.3%N/AUncontrolled
Degradation pointGradual (managed)Unknown20-30 exchanges~2.5 hours
Token efficiency5.5x vs Cursor baselineUnknownBaselineBelow baseline at scale

Anthropic's Four Pillars: Write, Select, Compress, Isolate

Anthropic's context engineering framework defines four operations that cover the full lifecycle of agent memory. These are not theoretical abstractions. Each maps directly to production features in Claude Code.

Write: Persist to External Memory

Save state to files that survive context resets. CLAUDE.md stores project context, conventions, and architectural decisions. Git commits capture progress checkpoints. The agent writes its own memory rather than relying on what fits in the window.

Select: Just-in-Time Retrieval

Load context only when needed, not upfront. Skills activate on demand. Subagents fetch specific information and return only the relevant results. .claudeignore prevents irrelevant files from ever entering context. Every token must earn its place.

Compress: Compaction Before Summarization

Reduce token count while preserving signal. Anthropic recommends compaction (reversible, lossless) as the first lever, then summarization (lossy) only when compaction is insufficient. Keep the last 3 turns raw for immediate context continuity.

Isolate: Separate Context Windows

Run subtasks in separate context windows. Exploration, research, and file searching happen in subagent threads that do not pollute the main agent's reasoning context. Results flow back as summaries, not raw data.

The order matters. Write first: persist anything you might need later. Select carefully: only load what the current step requires. Compress when needed: compaction before summarization because compaction is reversible. Isolate by default: subtasks get their own windows.

Most developers start with compression because it feels like the obvious lever. But the highest-leverage intervention is usually selection: preventing irrelevant tokens from entering context in the first place. A well-configured .claudeignore that excludes node_modules, build outputs, and lock files can save 30-100K tokens per session without any compression at all.

The four pillars in practice (Claude Code configuration)

# 1. WRITE: Persist state to CLAUDE.md
# This file is always loaded — it's the agent's long-term memory
# CLAUDE.md
## Architecture
- API routes in src/app/api/
- Drizzle ORM with PostgreSQL
- Clerk auth on all /dashboard routes

## Current Sprint
- Migrating from REST to tRPC
- Payment webhook handler needs retry logic

# 2. SELECT: Exclude irrelevant files from context
# .claudeignore
node_modules/
dist/
.next/
*.lock
coverage/
# Saves 30-100K tokens per session

# 3. COMPRESS: Auto-compaction at capacity + manual trigger
# Automatic: triggers when approaching context limit
# Manual: /compact command when you know context is noisy

# 4. ISOLATE: Subagents for exploration
# "Search the codebase for all Stripe webhook handlers"
# → Runs in a separate context window
# → Returns structured results to main agent
# Main agent's context stays clean

The Compression Landscape: Five Approaches Compared

Not all context compression is the same. Five distinct approaches have emerged, each with different tradeoffs between compression ratio, quality preservation, and interpretability.

ApproachCompressionQualityInterpretableSource
Factory (anchored summaries)98.6%3.70/5 (4.04 accuracy)Yes36K production SE msgs
Anthropic (structured summaries)98.7%3.44/5YesClaude Code auto-compact
OpenAI (opaque)99.3%3.35/5NoCodex /responses/compact
Morph Compact (verbatim)50-70%98% verbatim accuracyYesDeletion, not rewriting
Observation masking (JetBrains)~50%Matches summarization 4/5N/AJunie agent system

Factory.ai: Anchored Summaries from 36K Real Messages

Factory.ai ran the most rigorous public evaluation of compression to date. They tested three methods on 36,611 real software engineering messages from production coding sessions. Their anchored summary approach scored 3.70 overall (4.04 for accuracy), outperforming Anthropic's structured summaries at 3.44 and OpenAI's opaque compression at 3.35.

The key design: summaries are anchored to specific sections (completed work, current state, pending tasks), which gives the model structured reference points for picking up where it left off. Unanchored summaries scored lower because the model had to rediscover the organizational structure.

OpenAI Codex: Maximum Compression, Zero Visibility

Codex's /responses/compact endpoint achieves 99.3% compression, the highest of any method. But the output is an opaque encrypted_content blob that developers cannot inspect, debug, or verify. If the compression discards something important, you have no way to know until the agent makes a mistake downstream.

Morph Compact: Verbatim Deletion

Morph Compact takes the opposite approach: deletion, not rewriting. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives is verbatim from the original. 50-70% reduction at 3,300+ tokens per second with 98% verbatim accuracy and zero hallucination risk.

The compression ratio is lower than summarization approaches. That's the tradeoff. You get less compression but absolute fidelity. For coding agents where file paths, error messages, and code snippets must survive intact, this tradeoff often favors verbatim compaction.

Observation Masking: The Cost-Free Option

JetBrains' Junie agent tested simple observation masking against full LLM summarization on SWE-bench. Masking replaces tool outputs with placeholders after the model has processed them. The results: masking matched summarization quality in 4 of 5 cases while cutting costs by 50%. An unexpected finding: agents with masking ran 13-15% shorter because summarization hid "you should stop" signals from tool outputs.

You don't always need an LLM call to compress

Every summarization call is itself an LLM inference, costing tokens and latency. Masking and deletion-based approaches skip that cost entirely. For high-frequency compression (every few turns), the savings compound. JetBrains proved that the simplest approach often works as well as the most expensive one.

Practical Patterns That Work

80% of context management problems in production agents can be solved with three patterns: file exclusion, structured memory, and deliberate planning. These require no custom infrastructure and work with any agent framework.

Pattern 1: File Exclusion (.claudeignore)

The single highest-leverage context management intervention. A .claudeignore file that excludes node_modules, build outputs, lock files, and generated code prevents 30-100K tokens from ever entering the context window. This is not compression. It's prevention.

.claudeignore: Prevent noise from entering context

# .claudeignore — saves 30-100K tokens per session
node_modules/
dist/
.next/
build/
coverage/
*.lock
*.min.js
*.min.css
*.map
.git/
__pycache__/
*.pyc
.env*

# Project-specific exclusions
generated/
prisma/migrations/
public/assets/vendor/

Pattern 2: Structured Memory (CLAUDE.md)

CLAUDE.md is the agent's long-term memory. Write architectural decisions, key file paths, project conventions, and current task state here. It loads automatically at the start of every session, giving the agent immediate context without consuming window space on rediscovery.

Effective CLAUDE.md structure

# CLAUDE.md

## Commands
bun run dev          # Development server on port 3002
bun run build        # Production build
bun run db:push      # Push schema changes

## Architecture
- Next.js 15 App Router, React 19, TypeScript
- PostgreSQL with Drizzle ORM (schema in src/lib/db/schema.ts)
- Clerk auth on /dashboard routes
- Stripe webhooks in src/app/api/webhooks/stripe/

## Key Decisions
- Server components by default, client only for interactivity
- All mutations through server actions in actions.ts files
- API keys stored in apiKeys table, usage tracked in usageRecords

## Current State
- Migrating webhook handler to use idempotency keys
- Payment retry logic has a bug in handlePaymentFailed() at line 98

Pattern 3: Plan Mode for Complex Tasks

Plan mode forces the agent to outline its approach before executing. This prevents a common failure pattern: the agent reads 15 files, fills its context with exploration data, then has no room left for the actual implementation. Planning first means the agent knows which files it needs before it starts reading them.

Pattern 4: Subagent Isolation for Exploration

Exploration is the biggest source of context pollution. When an agent searches for "all files that import the auth module," the results from 30 files flood the main context. Subagent isolation runs this exploration in a separate window and returns only the relevant findings.

Subagent isolation: exploration without context pollution

// Without isolation: exploration floods main context
// Main agent reads 30 files searching for auth imports
// Context now 60% exploration data, 40% actual task

// With isolation: exploration runs in a separate window
// Subagent: "Find all files that import from @/lib/auth"
// Subagent reads 30 files in its own context window
// Returns to main agent: "3 files import auth:
//   src/app/api/webhooks/stripe.ts (line 12)
//   src/app/dashboard/billing/page.tsx (line 5)
//   src/middleware.ts (line 3)"
// Main agent's context: clean, focused, ready to edit

Pattern 5: Compaction Triggers

Do not wait for context to hit capacity. Use inline compaction on tool outputs as they arrive, and manual compaction (/compact) when you notice the agent repeating itself or losing track of earlier decisions. Both signals indicate that noise has diluted the context beyond what the model can effectively attend to.

The 80% rule

.claudeignore + CLAUDE.md + plan mode solve 80% of context management problems with zero infrastructure changes. Start here before building custom compression pipelines. The remaining 20% is where compaction, summarization, and subagent isolation become necessary.

When to Use What: The Decision Framework

Six context management strategies exist. Each is optimal for different situations. Using the wrong one wastes tokens or loses signal.

RAG (Retrieval-Augmented Generation)

Finding specific evidence from large knowledge bases. Best when the answer exists in a corpus too large to fit in context. Not for managing session history.

Compaction (First Lever)

Long-running sessions where information already exists in the environment. Removes noise while keeping surviving text verbatim. Pull this lever first because it's reversible.

Summarization (When Compaction Isn't Enough)

When you need higher compression than deletion can achieve. Keep the last 3 turns raw for continuity, summarize everything older. Accept the tradeoff: more compression, less fidelity.

Subagents (Decomposable Tasks)

Naturally decomposable tasks where exploration and research shouldn't bloat the main context. File searches, codebase exploration, and information gathering run in isolated windows.

Observation Masking (Cost Reduction)

Reduce costs without quality loss. Replace tool outputs with placeholders after the model processes them. JetBrains proved this matches summarization quality in 4/5 cases at 50% lower cost.

External Memory (Multi-Session)

State that must survive across session resets. CLAUDE.md files, git commits, database records. The only strategy that persists beyond the current context window.

SituationBest StrategyWhy
Large knowledge base, specific queriesRAGCorpus too large for context window
Long session, accumulating noiseCompaction (first), then summarizationReversible reduction preserves fidelity
Agent repeating itself or losing trackManual compaction + CLAUDE.md updateReset signal-to-noise ratio, persist key state
Need to search 30+ filesSubagent isolationExploration doesn't pollute main context
High tool-call volume, cost-sensitiveObservation masking50% cost reduction, matches summarization quality
Multi-day task spanning sessionsExternal memory (CLAUDE.md, git)Only strategy that survives context resets
Code must survive compression exactlyVerbatim compaction (Morph Compact)Zero hallucination risk, exact file paths preserved

Jason Liu's Insight: Compaction as Momentum

Jason Liu framed the value of compaction with a precise analogy: "If in-context learning is gradient descent, then compaction is momentum."

The analogy is exact. In gradient descent, momentum preserves the direction of optimization while dampening oscillations from noisy gradients. In an agent session, compaction preserves the trajectory of reasoning while shedding the weight of irrelevant history. The model keeps its direction without dragging dead tokens forward.

Without compaction, every irrelevant tool output, every superseded error message, every file read that was ultimately not needed acts like a noisy gradient: it pulls the model's attention in random directions. The model oscillates between old context and current context, spending attention budget on tokens that no longer matter.

With compaction, the context window contains only tokens that contribute to the current task. The model's "optimization trajectory" stays clean. It reasons from a high-signal, low-noise representation of the session state.

This is why Anthropic recommends compaction before summarization. Compaction preserves the original tokens (direction). Summarization rewrites them (potentially changing direction). Compaction is momentum. Summarization is a new gradient estimate from a different vantage point.

Frequently Asked Questions

What is LLM context management?

LLM context management is the discipline of controlling what enters and exits a language model's context window during agent sessions. It covers four areas: memory persistence (writing state to external files), just-in-time retrieval (loading only what's needed), compression (reducing token count while preserving signal), and isolation (using separate context windows for subtasks). Effective context management determines whether an agent degrades after 20 exchanges or sustains quality across hundreds.

Why does agent quality degrade with longer contexts?

Research from Chroma found that all 18 frontier models degrade as context grows, even when the window is not full. The problem is signal dilution: irrelevant tokens make the model worse at attending to tokens that matter. Tool outputs, old conversation turns, and verbose file reads accumulate noise that competes with actionable information. This is context rot in action.

What are Anthropic's four pillars of context engineering?

Write (persist state to external files like CLAUDE.md), Select (load context just-in-time rather than all at once), Compress (use compaction before summarization to reduce tokens), and Isolate (spawn separate context windows for subtasks via subagents). These map directly to Claude Code features: memory files, .claudeignore, auto-compaction, and subagent spawning.

How does Claude Code manage context compared to Cursor?

Claude Code uses 5.5x fewer tokens than Cursor for equivalent tasks. It achieves this through CLAUDE.md files (always-loaded memory), .claudeignore (excluding irrelevant files saves 30-100K tokens), automatic compaction at capacity, subagent isolation for exploration, and on-demand skill loading. Cursor has a 120K context window with no auto-compaction, leading to degradation after 20-30 exchanges that forces new sessions.

When should I use RAG vs compaction vs summarization?

Use RAG for finding specific evidence from large knowledge bases. Use compaction as the first lever for long-running sessions where information exists in the environment. Use summarization when compaction alone is not enough, keeping the last 3 turns raw for continuity. Use subagents for naturally decomposable tasks. Use observation masking for cost reduction without quality loss. Start with .claudeignore + CLAUDE.md + plan mode, which solve 80% of context problems with zero infrastructure.

What is the difference between compaction and summarization?

Compaction deletes low-signal tokens while keeping every surviving sentence verbatim from the original. Summarization rewrites context into shorter form, which can alter code snippets, file paths, and error messages. Factory.ai's evaluation of 36K production messages scored structured summaries at 3.70 vs OpenAI's opaque compression at 3.35. Morph Compact achieves 50-70% reduction with 98% verbatim accuracy and zero hallucination risk.

Ship Agents That Don't Degrade

Morph Compact is verbatim compaction for production agents. 50-70% token reduction, 3,300+ tok/s, zero hallucination risk. Every surviving sentence is word-for-word identical to the original. The first lever to pull for context management.