MCP Output Too Large: Why Tool Results Exceed Token Limits and How to Fix It

MCP tools like Playwright, database servers, and web scrapers return massive outputs that fill the agent's context window. A single DOM snapshot can be 100K+ tokens. This page explains why MCP tool results exceed maximum allowed tokens, the impact on agent performance, and how Flash Compact reduces output size 50-70% in under 500ms with zero hallucination.

March 21, 2026 · 2 min read

result (112,937 characters) exceeds maximum allowed tokens. You connected a Playwright MCP server to Claude Code, ran a page snapshot, and the tool dumped an entire DOM tree into the context window. Or a database MCP server returned 50K tokens of query results. Or a web scraper pulled a full page. The output filled the agent's context window, triggered auto-compact, and destroyed the working memory the agent needed to finish the task.

100K+
Tokens from a single Playwright snapshot
50-70%
Reduction with Flash Compact
<500ms
Compaction latency
0%
Hallucination rate

The Problem: MCP Tools Return Too Much Context

MCP (Model Context Protocol) servers connect coding agents to external tools: browsers, databases, file systems, APIs, web scrapers. Each tool call returns a result that goes into the agent's context window. The problem is that these results are often massive.

A Playwright browser_snapshot call returns the full accessibility tree of the page. For a moderately complex web app, that is 50,000-200,000 tokens of nested DOM elements, ARIA attributes, and text content. A single call can consume the agent's entire usable context.

Database MCP servers return complete query results as JSON. A SELECT * FROM users LIMIT 100 with 20 columns produces 10,000-50,000 tokens depending on field lengths. Web scrapers return full page HTML including navigation, footers, ads, and script tags. GitHub code search returns matching files with surrounding context, easily 5,000-20,000 tokens per search.

When you chain multiple MCP tool calls in a debugging session, the context fills fast. Two Playwright snapshots and a database query can push a session past the auto-compact threshold in minutes. The agent was mid-task, had files open, had a hypothesis. After compaction, it starts over.

MCP output is the fastest way to fill a context window

Regular file reads add 1-5K tokens per file. MCP tool outputs can add 50-200K tokens in a single call. One Playwright snapshot can consume more context than 50 file reads. This makes MCP output management the single highest-leverage optimization for agents that use external tools.

Why MCP Output Gets So Large

The root cause is a protocol-level design gap. MCP servers and MCP clients operate with different constraints, and the protocol does not bridge them.

No context budget negotiation

MCP servers don't know how much context the agent has left. A Playwright server returns the full DOM whether the agent has 100K tokens free or 5K. The protocol has no mechanism for the client to say 'I have 20K tokens of budget, give me the most important 20K.'

Servers return everything by default

MCP servers are built for completeness, not efficiency. A database server returns all columns and all rows matching the query. A file system server returns entire file contents. The assumption is that more data is better. For LLM agents with fixed context windows, the opposite is true.

No distinction between signal and noise

A Playwright DOM snapshot includes every aria-label, every data-testid, every style class, every hidden element. The agent might need 200 tokens of text content from a 100K token DOM. But the server cannot distinguish what matters for the agent's current task.

Unstructured output dominates

Structured data (JSON, CSV) can be filtered by fields. Unstructured data (DOM trees, log files, HTML pages, search results) cannot. Most of the worst offenders for output size produce unstructured text that resists simple filtering.

MCP specification version 2025-03-26 introduced _meta.truncated to signal when a client has truncated a response. But truncation is a blunt instrument. The client does not know which parts of the output are important, so it cuts at an arbitrary character limit. The first 10,000 characters of a DOM tree are mostly <head> metadata and style definitions, not the page content the agent needs.

The Impact on Agent Performance

Large MCP outputs trigger a cascade of problems that compound over the session.

Immediate: context window pressure

A 100K token Playwright snapshot in a 200K context window leaves roughly 60K tokens for everything else: system prompt (~20K), tool definitions (5-50K), CLAUDE.md, and the actual conversation. Two large tool calls and the session is out of space.

Secondary: auto-compact fires prematurely

When context hits ~95% capacity, auto-compact fires. The agent summarizes everything: file paths, error messages, debugging hypotheses, code snippets. A 100K token conversation gets compressed to 5-10K tokens. The summary captures the gist but loses precision. After compaction, the agent does not remember which files it modified or what the specific error message said.

Tertiary: the re-reading loop

After compaction, the agent re-reads files to recover lost context. Each re-read adds tokens, pushing toward the next compaction. Large MCP outputs make this loop worse because they consume so much context per call. The agent might compact, make one more MCP tool call, and immediately need to compact again. This is context rot accelerated by tool output volume.

MCP operationTypical output size% of usable context (140K)
Playwright page snapshot50,000-200,000 tokens35-140%
Database query (100 rows)10,000-50,000 tokens7-35%
Web scraper (full page)20,000-80,000 tokens14-57%
GitHub file search (10 results)5,000-20,000 tokens3-14%
File system read (large file)2,000-10,000 tokens1-7%
API response (verbose JSON)1,000-15,000 tokens1-10%

A Playwright snapshot alone can consume more than the entire usable context. Even smaller outputs like database queries stack up quickly across a debugging session.

Solution 1: Server-Side Truncation

The simplest approach: limit output at the MCP server. Set a maximum character count and cut the response when it exceeds that limit.

Truncating MCP tool output at the server

// In your MCP server implementation
const MAX_OUTPUT_CHARS = 50000;

function handleToolResult(result: string): string {
  if (result.length > MAX_OUTPUT_CHARS) {
    return result.slice(0, MAX_OUTPUT_CHARS) +
      "\n\n[TRUNCATED: output exceeded " + MAX_OUTPUT_CHARS + " characters]";
  }
  return result;
}

Pros: Simple to implement. Guaranteed output size. No external dependencies.

Cons: Truncation is unintelligent. It cuts at a character boundary, not a semantic boundary. The first N characters of a DOM tree are often metadata, not content. You might keep 50K characters of <head> elements and lose all the <body> content the agent actually needs. You are guessing what the agent needs, and the guess is often wrong.

Solution 2: Output Schemas and Filtering

For structured data (JSON, database results), you can define output schemas that extract only specific fields. Instead of returning all 20 columns from a user table, return only id, email, and created_at.

Filtering structured MCP output with schemas

// Define what fields the agent needs
const outputSchema = {
  type: "object",
  properties: {
    id: { type: "string" },
    email: { type: "string" },
    created_at: { type: "string" }
  }
};

// In your MCP server, filter the result
function filterResult(rows: Record<string, unknown>[], schema: object): object[] {
  const allowedFields = Object.keys(schema.properties);
  return rows.map(row =>
    Object.fromEntries(
      Object.entries(row).filter(([key]) => allowedFields.includes(key))
    )
  );
}

// 100 rows x 3 fields = ~2K tokens instead of 100 rows x 20 fields = ~30K tokens

Pros: Precise control over structured data. Significant reduction for wide tables or verbose JSON APIs.

Cons: Only works for structured data. DOM trees, log files, HTML pages, and search results are unstructured text. You cannot define a schema for "the important parts of this web page." Most of the worst offenders for MCP output size (Playwright, web scrapers, log aggregators) produce unstructured output.

Solution 3: Flash Compact (State of the Art)

Flash Compact takes a different approach. Instead of truncating (losing the end) or filtering by schema (only works on structured data), it reads the entire output and removes noise through verbatim deletion. Boilerplate, redundant attributes, decorative markup, repeated patterns, and structural scaffolding get removed. Data, error messages, content, and meaningful identifiers survive.

50-70%
Output size reduction
<500ms
Compaction latency
3,300+
Tokens per second
0%
Hallucination (verbatim deletion)

The key property is zero hallucination. Flash Compact does not summarize, paraphrase, or rewrite. Every token in the output is copied verbatim from the input. Sentences are either kept entirely or deleted entirely. This means the agent can trust the compacted output the same way it trusts the original. No invented details, no altered numbers, no changed variable names.

Flash Compact API usage

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

// Compact any MCP tool output before passing to the agent
async function compactToolOutput(toolOutput: string): Promise<string> {
  const response = await morph.compact.create({
    model: "morph-compact-latest",
    input: toolOutput,
  });
  return response.output;
}

// Example: compact a Playwright DOM snapshot
const domSnapshot = await playwrightMcp.callTool("browser_snapshot", {});
const compacted = await compactToolOutput(domSnapshot.content);

// domSnapshot.content: 112,937 characters (100K+ tokens)
// compacted:            39,528 characters (~35K tokens)
// Reduction: 65%
// Latency: 340ms

A 112K character Playwright snapshot compacts to ~39K characters. The agent gets the page structure, text content, interactive elements, and form fields. It loses the aria-describedby chains, redundant class lists, hidden metadata divs, and style attributes that added no value for the task.

Why compaction beats truncation

Truncation keeps the first N characters and drops everything after. For a DOM tree, the first N characters are <head> elements, meta tags, and CSS. The actual page content is later in the document. Truncation keeps the noise and drops the signal. Flash Compact keeps the signal regardless of position and drops noise from everywhere in the document.

Flash Compact works on any text. DOM trees, JSON responses, log files, search results, error traces, HTML pages. The model identifies structural patterns, redundancy, and boilerplate specific to each format and removes them. No format-specific configuration required.

Comparison: Truncation vs Summarization vs Flash Compact

DimensionTruncationLLM SummarizationFlash Compact
Compression ratioFixed (you choose the limit)60-80% reduction50-70% reduction
Information lossHigh (drops everything after cutoff)Medium (summary loses detail)Low (removes noise, keeps signal)
Hallucination riskNone (just cuts)High (summarizer invents/alters details)None (verbatim deletion only)
Latency<1ms2-10 seconds<500ms
Works on unstructured textYes (but badly)YesYes
Position-awareNo (always keeps the start)YesYes (removes noise from anywhere)
Best forHard size limits, fallbackHuman-readable summariesPreserving tool output fidelity for agents

Summarization seems like the obvious choice, but it introduces a critical problem for agents: hallucination. An LLM summarizing a database result might round numbers, change column names, or invent a relationship between fields. The agent then acts on fabricated data. Flash Compact avoids this entirely. The output is a strict subset of the input.

Implementation Guide

Add Flash Compact to your MCP pipeline by intercepting tool results before they reach the agent. The implementation has three steps: check the output size, compact if it exceeds your threshold, and pass the compacted result to the agent.

MCP tool output middleware with Flash Compact

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

// Threshold: compact outputs larger than 10K characters
const COMPACT_THRESHOLD = 10000;

interface ToolResult {
  content: string;
  isError?: boolean;
}

async function processToolResult(result: ToolResult): Promise<ToolResult> {
  // Don't compact errors or small outputs
  if (result.isError || result.content.length <= COMPACT_THRESHOLD) {
    return result;
  }

  const compacted = await morph.compact.create({
    model: "morph-compact-latest",
    input: result.content,
  });

  return {
    ...result,
    content: compacted.output,
  };
}

// Usage in your MCP client wrapper
async function callMcpTool(
  server: McpServer,
  toolName: string,
  args: Record<string, unknown>
): Promise<ToolResult> {
  const rawResult = await server.callTool(toolName, args);
  return processToolResult(rawResult);
}

The threshold is configurable. 10K characters is a good starting point. Outputs below this size are small enough that compaction overhead is not worth the savings. Outputs above this size benefit significantly: a 100K character DOM snapshot compacted to 35K saves 65K characters of context, roughly 16K tokens freed for actual reasoning.

Selective compaction by MCP server type

// Different servers benefit from different thresholds
const COMPACT_CONFIG: Record<string, { threshold: number; enabled: boolean }> = {
  "playwright":     { threshold: 5000,  enabled: true },   // Always large, always noisy
  "postgres-mcp":   { threshold: 20000, enabled: true },   // Large query results
  "web-scraper":    { threshold: 10000, enabled: true },   // Full page content
  "github":         { threshold: 15000, enabled: true },   // Search results, file contents
  "filesystem":     { threshold: 30000, enabled: true },   // Only compact very large files
};

async function processToolResultByServer(
  serverName: string,
  result: ToolResult
): Promise<ToolResult> {
  const config = COMPACT_CONFIG[serverName];

  if (!config?.enabled || result.content.length <= config.threshold) {
    return result;
  }

  const compacted = await morph.compact.create({
    model: "morph-compact-latest",
    input: result.content,
  });

  return { ...result, content: compacted.output };
}

Which MCP Servers Are Most Affected?

Not all MCP servers produce large outputs. Some, like a time/date server or a calculator, return a few tokens. The following servers consistently produce outputs that strain agent context windows.

MCP ServerTypical output sizeWhat makes it largeFlash Compact reduction
Playwright50,000-200,000 tokensFull DOM accessibility tree with every attribute60-70% (removes redundant attrs, hidden elements)
Database (Postgres, MySQL)10,000-50,000 tokensComplete result sets as JSON, all columns included40-60% (removes redundant keys, formatting)
Web scraper20,000-80,000 tokensFull page HTML including nav, footer, scripts, ads55-70% (removes non-content markup)
GitHub5,000-20,000 tokensSearch results with full file contents and metadata45-60% (removes boilerplate, keeps code)
File system2,000-10,000 tokensEntire file contents, including generated/minified code30-50% (varies by file type)

Playwright is the worst offender by a wide margin. A single page snapshot of a complex web application can exceed the agent's entire usable context. If your workflow involves Playwright MCP calls, Flash Compact is not optional. Without it, every snapshot forces an auto-compact cycle.

For a broader look at MCP server selection and configuration for coding agents, see best MCP servers for coding and MCP best practices.

Frequently Asked Questions

Why does my MCP tool output say "exceeds maximum allowed tokens"?

The tool returned more text than the agent's context window can handle. This happens with Playwright DOM snapshots (50-200K tokens per page), database query results (full result sets as JSON), web scraper output (complete page HTML), and file search results. The MCP protocol has no built-in size negotiation. Servers return everything regardless of the agent's remaining context budget.

How do I reduce MCP tool output size?

Three options: truncate on the server (loses information at an arbitrary cutoff), filter with output schemas (works for structured data like JSON but not for unstructured text like DOM trees or logs), or compact with Flash Compact (preserves all signal, 50-70% reduction, under 500ms, zero hallucination through verbatim deletion).

Does Flash Compact change the meaning of tool output?

No. Flash Compact uses verbatim deletion. Every sentence that survives is word-for-word identical to the original. There is zero paraphrasing and zero hallucination. It removes noise (boilerplate HTML, redundant attributes, decorative markup, repeated structural patterns) and keeps signal (data, error messages, text content, meaningful identifiers).

Which MCP servers produce the largest outputs?

Playwright produces the largest outputs: DOM snapshots of 50,000-200,000 tokens per page. Database servers return full result sets as JSON (10,000-50,000 tokens depending on query scope). Web scrapers return complete page content including navigation, footers, and script blocks. GitHub code search returns matching files with surrounding context (5,000-20,000 tokens per search).

Can I set a size limit on MCP tool responses?

Some MCP clients let you truncate responses at a character limit, but truncation is lossy and position-biased. It keeps the beginning of the output and drops the rest. For DOM trees, the beginning is usually <head> metadata, not the page content you need. Flash Compact is better because it removes noise from every part of the document, keeping signal regardless of position.

How does large MCP output cause context rot?

Each large tool output fills the context window. When the window fills, auto-compact fires and summarizes everything. The summary loses file paths, error messages, and debugging state. The agent then re-reads files to recover lost context, filling the window again. Large MCP outputs accelerate this context rot cycle because a single tool call can consume 35-140% of the usable context budget.

Does Flash Compact work with streaming MCP responses?

Flash Compact processes the complete response. For streaming MCP scenarios, buffer the full response from the server, pass it through Flash Compact, then forward the compacted result to the agent. The compaction step adds under 500ms of latency. For a 100K token output that would otherwise trigger auto-compact, that 500ms saves minutes of re-reading and recovery time.

Related Resources

Stop MCP Outputs From Filling Your Context

Flash Compact reduces MCP tool output 50-70% in under 500ms. Zero hallucination. Works on DOM trees, database results, web scraper output, and any unstructured text. Your agent keeps more context for reasoning instead of losing it to tool output noise.