AI Agent Memory: How Agents Retain, Retrieve, and Reason Across Sessions

AI agent memory is the system an agent uses to retain, retrieve, and reason over information. This guide covers the three types (working memory, short-term, long-term), the challenges each presents, and practical solutions including context compression for working memory optimization.

February 27, 2026 ยท 3 min read

AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It maps to three tiers borrowed from cognitive science: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes. For coding agents, working memory is the bottleneck.

~37%
Cross-session memory retention (best case)
30%+
Performance drop from context rot
50-70%
Working memory freed by compression
3,300+
Tokens/sec (Morph Compact)

Three Types of AI Agent Memory

The field draws from Endel Tulving's 1972 taxonomy of human memory. A 2025 survey of agent memory systems identified three dominant forms: token-level (context window), parametric (model weights), and latent memory. For practical purposes, agent builders work with three tiers.

Working Memory

The context window. Everything the model can reason over at inference time: the system prompt, conversation history, retrieved documents, tool outputs. Capacity is finite and every token costs attention.

Short-Term Memory

The current session. Conversation turns, tool call results, intermediate reasoning, and accumulated state. Persists only until the session ends or context is compacted.

Long-Term Memory

Persistent storage across sessions. User preferences, project knowledge, learned patterns, past decisions. Requires external systems: databases, vector stores, or files like CLAUDE.md.

Recent research further subdivides long-term memory into episodic (specific past events and their outcomes), semantic (accumulated facts and user preferences), and procedural (learned workflows and decision logic). These map to different storage backends and retrieval strategies.

DimensionWorking MemoryShort-Term MemoryLong-Term Memory
AnalogyCPU registers / RAMDesktop workspaceHard drive / database
Capacity128K-1M tokensUnbounded (within session)Unlimited
SpeedInstant (in-context)Instant (in-context)Retrieval required
PersistenceSingle inference callSingle sessionAcross sessions
Failure modeContext rot, attention dilutionNoise accumulation, compaction lossLow retention (~37%), retrieval errors
Key challengeSignal-to-noise ratioCompression timingWhat to store, how to retrieve

Working Memory: The Context Window Is All You Get

LLMs are stateless functions. Their weights are frozen by the time they reach inference. The model does not learn from your conversation, does not remember your last session, and does not update its parameters based on your feedback. The only information it can reason over is what you put into the context window.

This makes the context window the agent's working memory: the bottleneck through which all reasoning must pass. Context engineering is the discipline of managing this scarce resource. Every token matters. Every irrelevant token degrades performance.

Attention budget is finite

Like humans who can hold roughly 7 items in working memory, LLMs have a finite "attention budget." Every new token depletes this budget. Chroma's research tested 18 frontier models and found that all of them degrade as input length increases, even on simple retrieval tasks. The degradation follows three mechanisms: the lost-in-the-middle effect, attention dilution at scale, and distractor interference.

The lost-in-the-middle effect (Liu et al., Stanford/TACL 2024) showed that LLM performance drops over 30% when relevant information sits in the middle of the context. Transformer attention follows a U-shaped curve: strong at the start and end, weak in the middle. For an agent that reads 8 files and finds relevant code in file #4, that code sits in the model's blind spot.

100M
Pairwise attention relationships at 10K tokens
10B
Pairwise attention relationships at 100K tokens
1T
Pairwise attention relationships at 1M tokens

Attention scales quadratically. At 100K tokens, the model tracks 10 billion pairwise relationships. Adding more context does not just dilute relevance. It makes the model physically worse at attending to what matters.

Short-Term Memory: Session State and Its Decay

Short-term memory is everything the agent accumulates during a single session: conversation turns, tool outputs, file contents, error traces, and its own reasoning. Unlike working memory (which is the window for a single inference call), short-term memory spans the full session and feeds into working memory at each step.

The problem is accumulation. A typical coding task generates thousands of tokens per step:

Token accumulation in a multi-step coding session

Step 1: Read issue description                          500 tokens
Step 2: Search codebase, read 5 candidate files       8,000 tokens
Step 3: Read related tests, config files               6,000 tokens
Step 4: Backtrack, explore alternative approach         5,000 tokens
Step 5: Found correct file, ready to edit              ----------
Total context: ~20,000 tokens (60%+ is noise)

The agent now has the right information buried in 20K tokens
of search traces, dead ends, and irrelevant file contents.
Most of this hurts performance. It does not help.

Every production agent handles this differently. Claude Code triggers auto-compaction when context approaches the window limit, summarizing history into a structured format. OpenAI recommends compaction as a "default long-run primitive," not an emergency fallback. The question is not whether to compress short-term memory, but when and how.

Compaction is lossy

When Claude Code compacts, it produces documentation-style summaries that capture the gist of what happened but lose specific events, decisions, and exact code references. Users report losing early conversation detail after compaction. This is why compaction vs. summarization matters: the compression method determines what survives.

Long-Term Memory: The Cross-Session Problem

When a session ends, the context window clears. The agent starts fresh. Unless you have built an external memory system, every decision, every discovery, and every learned preference is gone.

Cross-session memory retention is the hardest unsolved problem in agent memory. The best approaches achieve only around 37% retention across sessions according to compression benchmarks. Mem0's research on the LOCOMO benchmark showed that retrieval-augmented memory achieved 26% higher accuracy than OpenAI's native memory (66.9% vs. 52.9%). A Letta agent reached 74% on the same benchmark with GPT-4o mini. These numbers are improving, but they are far from solved.

ApproachHow It WorksTradeoffs
Config files (CLAUDE.md)Always-loaded text files with project instructionsManual maintenance; limited to what you write down
Vector stores / RAGEmbed past interactions, retrieve by similarityMath ceiling at ~500K docs; code structure is hard to embed
Structured databasesStore facts, preferences, decisions in relational/KV storesRequires schema design; retrieval queries add latency
Auto-memory (Claude)Agent writes notes to MEMORY.md during sessionsFirst 200 lines loaded per session; can drift or bloat
MCP memory serversSQLite-backed tools the agent reads/writes at runtimeFlexible but requires integration; no standard protocol yet

Coding agents have converged on a practical pattern: files as memory. Claude Code uses CLAUDE.md for project instructions and MEMORY.md for auto-discovered patterns. This is simple, inspectable, and version-controlled. The first 200 lines of MEMORY.md load into every session automatically.

The deeper research direction is memory management, not just memory storage. A-Mem (Agentic Memory) treats memory as a living system that merges related memories, marks outdated ones as invalid, and resolves contradictions. This mirrors how human memory consolidates and forgets. Agents need to forget strategically, not just accumulate.

How Coding Agents Manage Memory in Practice

Each major coding agent handles the memory problem differently. The differences are not theoretical. They directly determine session length, accuracy over time, and token cost.

FeatureClaude CodeOpenAI CodexCursorDevin
Working memory200K context window200K context window120K context window200K context window
Persistent memoryCLAUDE.md + MEMORY.mdNone (stateless)Cursor rules filesTask lists, to-do files
Auto-compactionYes (at capacity)Yes (/compact endpoint)NoPartial (premature)
Context isolationSubagent Task toolSandboxed executionBackground indexingParallel sandboxes
Degradation onsetGradual (compaction helps)Gradual20-30 exchanges~2.5 hours
Token efficiency5.5x fewer than CursorBaselineHigh token usageVariable

Claude Code uses 5.5x fewer tokens than Cursor for equivalent coding tasks. That gap comes from better context management, not a better base model. Structured memory files, selective loading via .claudeignore, auto-compaction, and subagent isolation each contribute.

Devin takes a different approach for long-running tasks. It maintains a persistent to-do list and iterates over hours or days, using parallel sandboxes for isolation. But it exhibits "context anxiety" where the model prematurely summarizes to avoid hitting limits, losing detail before it needs to.

The memory hierarchy is real

Production coding agents now implement a recognizable memory hierarchy: always-loaded config files (L1 cache), on-demand file loading (L2), session history with compaction (L3), and external retrieval for rare queries (L4). The pattern mirrors CPU memory hierarchies from computer architecture: fast/small at the top, slow/large at the bottom.

Memory Architectures: MemGPT, Letta, and Virtual Context

MemGPT (Packer et al., 2023) introduced the most influential agent memory architecture. It treats the LLM as an operating system where the context window is "main memory" (RAM) and external storage is "disk." The agent uses function calls to page information in and out of context, just like an OS manages virtual memory.

MemGPT-style virtual context management

# The agent's context window (main memory) is limited
# External storage (disk) holds everything else

# Agent decides what to keep in "RAM" (context window):
core_memory = {
    "persona": "I am a coding assistant working on project X",
    "user": "Prefers TypeScript, uses Next.js, strict on types",
    "current_task": "Fix the webhook retry logic in stripe.ts"
}

# When the agent needs old information, it "pages in" from disk:
agent.call_function("archival_search", query="previous webhook fixes")
# Returns relevant memories from vector store into context

# When context gets full, agent "pages out" to disk:
agent.call_function("archival_insert",
    content="Discovered retryCount can be null for new customers"
)
# Saves to long-term storage, frees context window space

# The result: effectively unlimited memory through intelligent paging

MemGPT now lives as part of the Letta framework, which extends the pattern with memory blocks: dedicated modules for core memory, episodic memory, semantic memory, and procedural memory. Each module uses data structures suited to its content type.

Virtual Context Management

Inspired by OS virtual memory. The agent pages information between the context window (fast, limited) and external storage (slow, unlimited) using function calls. Enables working with information that far exceeds the context window.

Memory Blocks

Letta's extension of MemGPT. Dedicated modules for different memory types: core (persona + user facts), episodic (time-series events), semantic (abstract knowledge), and procedural (step-by-step workflows). Each block has its own update and retrieval logic.

The key insight from MemGPT: the agent itself should manage its memory. Rather than relying on fixed rules (compact at 80% capacity, retrieve top-5 documents), the agent decides what to remember, what to forget, and when to retrieve. This agentic approach to memory is now a core research direction, with papers like Agentic Memory (2026) proposing unified frameworks for short-term and long-term memory learning.

Optimizing Working Memory with Compression

Long-term memory is a hard research problem. Working memory is an engineering problem you can solve today. The approach: remove noise tokens from the context window so the model's attention budget goes to high-signal information.

Three compression approaches have emerged, each with different tradeoffs:

MethodMechanismHallucination RiskBest For
Structured summarizationLLM rewrites into organized sectionsMedium (paraphrasing can alter details)High-level progress tracking
Opaque compressionModel-internal compression (black box)Medium (unverifiable)API-level simplicity
Verbatim compactionDeletes noise, keeps text word-for-wordZero (no rewriting)Code, errors, file paths

For coding agents where exact file paths, error messages, and code snippets must survive compression, the distinction between summarization and compaction is critical. Summarization might compress src/api/webhooks/stripe.ts:98 into "the Stripe webhook handler," losing the exact reference the agent needs for its next edit.

50-70%
Token reduction (Morph Compact)
98%
Verbatim accuracy
3,300+
Tokens per second
0%
Hallucination risk

Morph Compact takes the deletion approach. The model identifies which tokens carry signal and which are noise, then removes the noise. Every sentence that survives is verbatim from the original. No paraphrasing. No summarization. This means the agent's working memory after compression contains a strict subset of the original content with zero risk of the compression step introducing errors.

Working memory optimization with Morph Compact

from openai import OpenAI

client = OpenAI(
    api_key="your-morph-api-key",
    base_url="https://api.morphllm.com/v1"
)

# Agent's working memory is getting noisy after many tool calls.
# Compact it before the next reasoning step:
response = client.chat.completions.create(
    model="morph-compact",
    messages=[{
        "role": "user",
        "content": accumulated_context  # 20K tokens of session history
    }]
)

compacted = response.choices[0].message.content
# Result: 6-10K tokens, every surviving sentence verbatim
# The agent's next reasoning step sees only high-signal tokens

The ACON framework from academic research validated this direction: adaptive compression of agent observations achieved 26-54% peak token reduction while preserving 95%+ task accuracy. JetBrains found that simple observation masking matched full LLM summarization quality at a fraction of the cost. The evidence is consistent: most of what fills an agent's working memory is noise, and removing it improves performance.

Compaction is momentum

Jason Liu framed the value precisely: "If in-context learning is gradient descent, then compaction is momentum." It preserves the trajectory of the conversation while shedding the weight of irrelevant history. The agent keeps its direction without dragging dead tokens forward. This is the practical path to better working memory: not bigger windows, but cleaner ones.

Frequently Asked Questions

What is AI agent memory?

AI agent memory is the system an agent uses to retain, retrieve, and reason over information across its operation. It includes three tiers: working memory (the context window), short-term memory (session history), and long-term memory (persistent cross-session storage). Each tier has different capacity, speed, and failure modes.

What is the difference between working memory and long-term memory in AI agents?

Working memory is the context window, the only information the model can reason over during inference. It is fast but capacity-limited (128K-1M tokens). Long-term memory persists across sessions using external storage like vector databases or files. It has unlimited capacity but requires retrieval mechanisms to load relevant information back into working memory.

Why do AI agents forget between sessions?

LLMs are stateless. Their weights are frozen and do not update during use. The only information the model knows about your task is what's in the context window. When a session ends, the context clears. Cross-session retention requires external memory systems, and the best current approaches achieve only around 37% retention accuracy.

How does MemGPT manage agent memory?

MemGPT treats the LLM like an operating system with main memory (context window) and disk storage (external databases). The agent uses function calls to page information in and out of its context, similar to how an OS manages virtual memory. This now forms the basis of the Letta agent framework.

How do coding agents like Claude Code handle memory?

Claude Code uses CLAUDE.md files as always-loaded project memory, auto-compaction to summarize history at context limits, subagent isolation through its Task tool, and .claudeignore to exclude irrelevant files. These strategies map to Anthropic's four pillars of context engineering: Write, Select, Compress, and Isolate.

How does context compression improve agent working memory?

Context compression removes noise tokens from the context window, freeing attention budget for high-signal information. Morph Compact achieves 50-70% token reduction with 98% verbatim accuracy by deleting low-signal content rather than rewriting it. Every surviving sentence is identical to the original, eliminating hallucination risk from the compression step.

Optimize Your Agent's Working Memory

Morph Compact removes noise tokens from the context window so your agent's attention budget goes to what matters. 50-70% reduction, 3,300+ tok/s, and zero hallucination risk. Every surviving sentence is verbatim from the original.