Recursive Language Models: How RLMs Process 10M+ Tokens Without Context Rot

RLM-Qwen3-8B outperforms its base model by 28.3%. MIT's recursive framework processes 10 million tokens by storing context in a Python REPL instead of the attention window.

March 4, 2026 · 2 min read

Key Numbers

91.33%
RLM-GPT-5 on BrowseComp-Plus (base model: 0%)
10M+
Tokens processed without context rot
28.3%
RLM-Qwen3-8B improvement over base
~2-3K
Tokens per RLM query vs 95K+ for direct ingestion

What This Page Covers

Recursive Language Models (RLMs) are a new inference paradigm from MIT CSAIL. They store the input in a Python REPL instead of the attention window, letting models process 10M+ token inputs by writing code to inspect and decompose them. This page explains how they work, their benchmark results, and how they relate to context rot and subagent architectures.

The Problem: Context Rot at Scale

Every transformer-based LLM degrades as its context window fills up. Chroma Research measured this: retrieval accuracy drops from 99% at 2K tokens to below 60% at 128K. The model does not suddenly forget. Information competes for attention weight, and the signal-to-noise ratio collapses as more tokens pile up.

The industry response has been to make context windows bigger. Gemini went to 2M tokens. GPT-5 hit 1M. But bigger windows do not fix the underlying problem. A model with 1M tokens of context still performs worse than the same model with 10K tokens of context on the same retrieval task. The attention mechanism was not designed for information-dense reasoning over millions of tokens.

Three approaches have emerged to work around this:

  1. RAG (Retrieval-Augmented Generation): Retrieve relevant chunks, feed only those to the model. Works for lookup tasks. Fails when the reasoning requires cross-referencing distant parts of the input.
  2. Summarization and compaction: Compress old context into shorter representations. Loses detail. The model cannot recover information it discarded.
  3. Subagent architectures: Split work across multiple model instances with dedicated context windows. Each agent focuses on a subset. Works for multi-task workflows but adds coordination overhead.

RLMs take a fourth approach: do not put the input in the context window at all.

How RLMs Work

The core insight of RLMs is simple: treat the prompt as data in an external environment rather than tokens in the attention window. The model gets a Python REPL with the full input stored as a string variable. Instead of reading the input through attention, it writes code to explore it.

The RLM Processing Loop

  1. 1. Input externalization: The user's prompt (potentially millions of tokens) is stored as a string variable in a Python REPL. The model receives only metadata: length, structure hints, and the task description.
  2. 2. Programmatic inspection: The model writes Python code to examine the input. It might read the first 500 characters, run regex searches, count occurrences of keywords, or split the text into sections.
  3. 3. Recursive decomposition: For complex tasks, the root model launches sub-LM calls on specific chunks. A "root" model (GPT-5) might orchestrate while a faster "recursive" model (GPT-5-mini) processes individual sections.
  4. 4. Aggregation: Results from sub-calls are collected back into the REPL environment. The root model synthesizes them into a final answer, stored in a designated output variable.

The model's active context window never holds the full input. It holds code, intermediate results, and the current reasoning step. A typical RLM query uses 2-3K tokens of context per turn, regardless of whether the input is 100K or 10M tokens.

RLM Pseudocode: Processing a 10M Token Codebase

# The REPL environment stores the input externally
prompt = load_input("codebase_10M_tokens.txt")  # stored in REPL memory, not context

# Root model receives only metadata
root_context = f"""
Input: string variable 'prompt' ({len(prompt)} chars)
Task: Find all authentication bypass vulnerabilities
Available: prompt[:N], prompt[N:M], re.search(), sub_llm_call()
"""

# Model writes inspection code (runs in REPL)
# Step 1: Find relevant files
auth_files = re.findall(r'### FILE: (.*auth.*)', prompt)
# Found: ["src/auth/login.py", "src/auth/middleware.py", ...]

# Step 2: Extract each file's content
for f in auth_files:
    chunk = extract_file_content(prompt, f)
    # Step 3: Recursive sub-LM call on each chunk
    result = sub_llm_call(
        model="gpt-5-mini",
        prompt=f"Analyze for auth bypass:\n{chunk}"
    )
    findings.append(result)

# Step 4: Root model aggregates findings
answer = synthesize(findings)  # written to output variable

Context-as-Variable: The Key Abstraction

The RLM paper introduces a concept called "Context-as-Variable." Traditional LLM usage loads text directly into the attention window. RLMs load it into the memory of a runtime environment as a string variable.

This distinction matters for two reasons:

Memory vs. attention. The Python REPL's memory is bounded only by system RAM, not by quadratic attention cost. Storing 10M tokens as a string in Python costs a few hundred MB. Running attention over 10M tokens is computationally intractable for current hardware.

Selective access vs. full scan. Attention processes all tokens on every forward pass. The RLM model accesses only the bytes it requests through code. If the task requires finding a specific function in a 10M token codebase, the model can use re.search() to jump directly to it instead of scanning through millions of irrelevant tokens.

The context engineering community has been working toward this insight from the opposite direction, optimizing what goes into the attention window. RLMs bypass the problem entirely. The attention window becomes a workspace for reasoning about data, not a container for the data itself.

Benchmark Results

The MIT paper evaluates RLMs across four long-context benchmarks. The numbers are stark.

91.33%
BrowseComp-Plus (6-11M tokens). Base GPT-5: 0%, Summary Agent: 70.47%
58% F1
OOLONG-Pairs. Base GPT-5: 0.04% F1
62%
CodeQA accuracy. Base GPT-5: 24%, Summary Agent: 41.33%
100x
Input length beyond model context window

BrowseComp-Plus is the headline result. The benchmark requires reasoning over 6-11 million tokens of web content. Base GPT-5 literally cannot process inputs that large, scoring 0%. The Summary Agent, which compresses inputs before processing, reaches 70.47%. The RLM reaches 91.33% by treating the input as an external dataset to be explored programmatically.

OOLONG-Pairs tests information-dense reasoning where difficulty scales quadratically with input length. Base GPT-5 collapses to 0.04% F1 on this task. The RLM reaches 58% F1. The gap is not marginal. The attention mechanism fails catastrophically on this type of task. The RLM's programmatic approach handles it.

CodeQA measures code understanding across large codebases. The RLM more than doubles base GPT-5's accuracy (62% vs 24%). The Summary Agent sits between them at 41.33%, confirming that summarization loses too much detail for code comprehension tasks.

Cost Comparison

RLMs reduce costs by 40-67% compared to direct ingestion on long-context tasks. The model processes 2-3K tokens per REPL turn instead of paying for attention over the full input. On a 10M token task, direct ingestion would cost hundreds of dollars in API calls. The RLM approach costs a fraction of that by only reading the bytes it needs.

RLM-Qwen3-8B: Training Models for Recursive Context Management

The original MIT paper uses RLMs as an inference-time scaffolding. Prime Intellect took it further: they trained a model to use the scaffolding natively.

RLMEnv is Prime Intellect's training environment where models learn to manage their own context through reinforcement learning. The model is rewarded for correctly answering questions about long inputs using the RLM paradigm. Over training, it learns when to inspect, when to decompose, and when to launch sub-calls.

RLM-Qwen3-8B, the resulting model, outperforms base Qwen3-8B by 28.3% on average. On OOLONG-Pairs, the improvement is even larger: the base model stays below 0.10 F1, while the trained RLM reaches 23.11 F1. The model has learned to write better inspection code, decompose more effectively, and aggregate results more accurately.

This is the part that aligns with Rich Sutton's "Bitter Lesson." RLMs are not a fixed scaffolding, a static set of rules the model follows. They are a learned capability. As training scales, models get better at managing their own context. Prime Intellect calls this "learned context folding," and it never actually summarizes the input, avoiding the information loss that plagues compression-based approaches.

28.3%
Average improvement over base Qwen3-8B
23.11 F1
OOLONG-Pairs (base: <0.10 F1)
Open Source
Available on HuggingFace (mit-oasys/rlm-qwen3-8b-v0.1)

RLMs vs Subagents: Two Solutions to the Same Problem

RLMs and subagent architectures both solve context degradation. They attack it from different angles.

RLMs externalize the input. One model interacts with its data through code. The context window holds reasoning, not data. Best for: single-document tasks over massive inputs (legal review, codebase analysis, research synthesis).

Subagents split the work. Multiple model instances each get a dedicated context window for a specific subtask. A coordinator dispatches and aggregates. Best for: multi-file, multi-task workflows where different subtasks require different context (refactoring auth while writing tests while updating docs).

The distinction is not academic. A 10M token legal contract is one document. An RLM can process it by programmatically navigating between sections. A 40-file codebase refactor is many tasks. Subagents can work on different files in parallel without polluting each other's context.

Anthropic's own research found multi-agent setups improve performance by up to 90% on complex coding tasks. Cognition measured that their agents spend 60% of their time searching through context. Both numbers point to the same root cause: context rot is the bottleneck, and the solution is keeping the attention window focused on the current reasoning step.

Where They Converge

RLMs and subagents are not mutually exclusive. An RLM can launch sub-LM calls as part of its recursive decomposition. A subagent can use RLM-style REPL access for its specific subtask. The MIT paper already demonstrates this: the root model uses GPT-5 for planning while launching GPT-5-mini sub-calls for chunk processing. That is a subagent pattern inside an RLM framework.

How This Connects to Morph

Morph's fast apply infrastructure is built on the premise that context management is the core engineering challenge for AI coding tools. When a coding agent needs to apply changes across 40 files, the model that makes the edit does not need the entire codebase in its context window. It needs the specific file, the change description, and enough surrounding context to apply the diff correctly.

This is the same insight that drives RLMs: separate the data from the reasoning. Morph's approach uses dedicated, fast models for the apply step, each operating on a focused context window. The orchestrating agent handles the reasoning about what to change. The apply model handles the execution.

The combination of subagent architectures (for task decomposition) and context-focused inference (for execution) is where production AI coding is heading. RLMs demonstrate that the paradigm works even for single-model systems. Subagents demonstrate that it works for multi-model systems. Both validate that the attention window should be a workspace, not a warehouse.

Context-Optimized Apply Models

Morph's fast apply API gives each edit operation a focused context window, avoiding the context rot that degrades quality at scale. 10,500 tok/s output speed, built for coding agent workflows.

Frequently Asked Questions

What are Recursive Language Models (RLMs)?

RLMs are an inference paradigm from MIT CSAIL where the LLM stores its input in a Python REPL as a variable instead of loading it into the attention window. The model writes code to inspect, decompose, and recursively process the input. This lets models handle inputs 100x beyond their context window. On BrowseComp-Plus (6-11M tokens), RLM-GPT-5 scored 91.33% while the base model scored 0%.

How do RLMs solve context rot?

Context rot occurs when LLM accuracy degrades as context length increases. RLMs avoid it entirely by keeping the input outside the attention window. The prompt is stored as a string variable in a Python REPL. The model interacts with it through code (slicing, regex, chunking) rather than attending to it directly. The model's working context stays at 2-3K tokens per turn regardless of input size.

What is RLM-Qwen3-8B?

A post-trained Qwen3-8B from Prime Intellect's RLMEnv. Trained with reinforcement learning to manage its own context using RLM scaffolding. It outperforms base Qwen3-8B by 28.3% on average. Available on HuggingFace at mit-oasys/rlm-qwen3-8b-v0.1.

How do RLMs compare to subagent architectures?

They solve the same problem (context degradation) differently. RLMs externalize context to a REPL for single-model, single-document processing. Subagents split work across multiple model instances for multi-task workflows. They are complementary: an RLM can launch sub-LM calls, and a subagent can use RLM-style REPL access.

Can I use RLMs with any language model?

Yes. RLMs are an inference-time strategy, not an architecture change. The open-source library at github.com/alexzhang13/rlm supports API-based models (GPT-5, Claude) and local models. Prime Intellect's RLMEnv provides a training environment for fine-tuning models specifically for RLM usage.