MCP Best Practices: Building Servers That Don't Waste Agent Context

Most MCP servers dump raw, unstructured output into the agent's context window. A database query returns 500 rows as JSON. A web scrape returns the full DOM. A file search returns entire files. This wastes context tokens, triggers compaction earlier, and degrades agent performance. This guide covers tool description best practices, output formatting, security, error handling, performance, and the state-of-the-art strategy for handling large tool outputs: compacting them with Flash Compact in under 500ms with zero hallucination.

March 21, 2026 · 1 min read

MCP servers are the plumbing behind every coding agent that talks to external tools. Databases, browsers, APIs, file systems, CI pipelines. But most servers treat tool output like an afterthought: dump the full result and let the agent sort it out. The agent can't sort it out. It has a fixed context window, and every wasted token in a tool response displaces tokens the agent needs for reasoning. This guide covers the practices that separate context-efficient MCP servers from context-destroying ones.

60%
MCP servers with output size issues (Cognition)
50-70%
Context reduction with Flash Compact
<500ms
Compaction latency
0%
Hallucination rate

Why MCP Best Practices Matter

MCP connects agents to external tools through a standardized protocol. An agent calls a tool, the MCP server executes it, and the result flows back into the agent's context window. The protocol itself is clean. The problem is what servers put in that response.

A Playwright MCP returns the full DOM of a webpage: 30K-100K tokens of HTML with CSS classes, data attributes, and nested divs. The agent needed one paragraph. A database MCP returns 500 rows of JSON when the agent needed a count. A GitHub MCP returns an entire file when the agent needed a function signature on line 47.

These aren't edge cases. Cognition (the team behind Devin) measured that coding agents spend 60% of their time searching for context. Most of that search returns far more data than the agent needs. The extra tokens fill the context window, trigger auto-compact sooner, degrade the quality of compaction summaries, and force the agent to re-read information it already processed.

Good MCP practices reduce this waste at the source. The goal is not minimalism for its own sake. The goal is giving the agent exactly the information it needs, in a format it can act on, without filling its working memory with noise.

Context is a zero-sum game

Every token in a tool response competes with tokens for reasoning, planning, and code editing. A 50K-token database dump doesn't just waste 50K tokens of space. It displaces 50K tokens of the agent's ability to think about the problem. This is the mechanism behind context rot: low-signal tokens crowd out high-signal ones until the agent can no longer maintain coherent state.

Tool Description Best Practices

Tool descriptions are the first thing the agent reads. Before calling any tool, the agent evaluates all available tool schemas to decide which one to use and what parameters to pass. Poor descriptions lead to wrong tool selection, wrong parameters, failed calls, and wasted context on error recovery.

Write descriptions for the agent, not humans

The agent reads tool descriptions as part of its system prompt. Every word counts against the context budget. Be precise and concise. Include what the tool does, what it returns, and any constraints.

Bad vs good tool descriptions

// BAD: Vague, no constraints, no return format
{
  name: "search",
  description: "Search for things in the database"
}

// GOOD: Specific, with constraints and return format
{
  name: "search_users",
  description: "Search users by name or email. Returns max 20 results. Each result includes id, name, email, created_at. Use exact email for single-user lookup. Supports partial name matching (case-insensitive).",
  inputSchema: {
    type: "object",
    properties: {
      query: {
        type: "string",
        description: "User name (partial match) or exact email address"
      },
      limit: {
        type: "number",
        description: "Max results to return. Default 10, max 20.",
        default: 10
      }
    },
    required: ["query"]
  }
}

Include parameter constraints

Agents infer parameter values from descriptions. If a parameter has a maximum value, valid formats, or enum constraints, state them in the description. The agent will pass invalid values if it has to guess.

Use distinct tool names

Tools named search, query, and find on the same server force the agent to read all three descriptions carefully to distinguish them. Tools named search_users_by_email, query_orders_by_date, and find_product_by_sku are self-documenting. The agent can pick the right tool from the name alone.

Document what the tool does NOT do

If your search tool does not support regex, say so. If your database tool cannot write data, say so. Negative constraints prevent the agent from attempting operations that will fail.

One tool, one job

A tool that searches users AND creates users AND deletes users is three tools masquerading as one. Split them. The agent picks between three clear descriptions faster than it parses a complex multi-mode tool.

Include return format

Tell the agent what to expect: 'Returns JSON array of {id, name, email}' or 'Returns markdown table with columns: file, line, match.' The agent plans its next step based on what it expects to receive.

Output Formatting Best Practices

The output of an MCP tool goes directly into the agent's context window. Every byte of formatting, every redundant field, every verbose error message consumes tokens the agent needs for reasoning.

Return structured, minimal responses

Don't return entire database records when the agent asked for a name. Don't return nested objects five levels deep when the agent needs two fields. Shape the response to match what the tool description promised.

Minimal vs verbose tool output

// VERBOSE: 800+ tokens for 3 users
{
  "status": "success",
  "message": "Query executed successfully",
  "metadata": {
    "query_time_ms": 42,
    "total_count": 3,
    "page": 1,
    "page_size": 100,
    "has_more": false
  },
  "data": [
    {
      "id": "usr_abc123",
      "name": "Alice Chen",
      "email": "alice@example.com",
      "created_at": "2025-01-15T08:30:00Z",
      "updated_at": "2025-06-20T14:22:00Z",
      "profile_image_url": "https://cdn.example.com/...",
      "preferences": { "theme": "dark", "locale": "en-US" },
      "role": "admin",
      "last_login": "2025-06-20T14:22:00Z",
      "login_count": 247
    }
    // ... 2 more objects with same verbosity
  ]
}

// MINIMAL: ~200 tokens for the same information
{
  "users": [
    { "id": "usr_abc123", "name": "Alice Chen", "email": "alice@example.com", "role": "admin" },
    { "id": "usr_def456", "name": "Bob Park", "email": "bob@example.com", "role": "member" },
    { "id": "usr_ghi789", "name": "Carol Wu", "email": "carol@example.com", "role": "member" }
  ],
  "total": 3
}

Prefer flat structures over nested ones

Each level of nesting adds brackets, keys, and whitespace that consume tokens. Flatten when possible. Instead of {"user": {"address": {"city": "NYC"}}}, consider {"user_city": "NYC"} if the agent doesn't need the full address object.

Use markdown for human-readable output

When the output is meant to be read (not parsed), markdown is more token-efficient than JSON. A markdown table uses fewer tokens than an array of objects with repeated keys.

FormatToken cost (10 items)Agent parseability
JSON (verbose)800-1,200 tokensHigh (structured)
JSON (minimal)200-400 tokensHigh (structured)
Markdown table150-300 tokensMedium (readable, pattern-matchable)
CSV100-200 tokensMedium (compact, less readable)
HTML2,000-5,000 tokensLow (noise from tags/attributes)
Raw logsVariable, often 1,000+Low (unstructured)

Handling Large Tool Outputs

This is where most MCP servers fail. Some tools inherently produce large outputs. A code search returns dozens of matching files. A database query returns hundreds of rows. A web scrape returns the full page. A log search returns thousands of lines. You can't always predict or limit the size at the server level.

When a tool dumps 50K+ tokens into the agent's context, three things happen: (1) the agent's remaining context budget drops sharply, (2) earlier relevant context gets displaced or lost in the middle, and (3) auto-compact fires sooner, summarizing away the details the agent just spent tokens to acquire.

Three approaches to large outputs

StrategyHow it worksTradeoff
Server-side truncationCut output at N rows/lines/charsSimple. Loses information beyond the cutoff. The agent cannot access truncated data even if it contains the answer.
LLM summarizationSummarize the output before returningPreserves gist. Introduces hallucination risk. Slow (1-3 seconds). The summary may invent details not in the original.
Flash CompactVerbatim deletion of noise tokens50-70% reduction. Under 500ms. Zero hallucination. Every surviving sentence is copied verbatim from the original. Deletes redundant formatting, repeated patterns, low-signal metadata.

Truncation is a valid first pass. If your database query returns 500 rows, returning only the first 50 with a "truncated": true flag is better than dumping all 500. But the agent loses access to rows 51-500. If the answer was in row 300, truncation fails silently.

LLM summarization preserves the gist but introduces hallucination. A summary of a database result set might say "most users signed up in January" when the actual data shows February. The agent trusts the summary and makes decisions on fabricated details.

Flash Compact avoids both failure modes. It operates by verbatim deletion: removing tokens that don't contribute signal (redundant JSON keys, repeated formatting patterns, verbose metadata, boilerplate) while keeping every surviving sentence exactly as it appeared in the original. The result is 50-70% smaller, arrives in under 500ms, and contains zero hallucinated content.

Compacting MCP output with Flash Compact

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

// In your MCP server's tool handler:
async function handleToolCall(params: ToolParams) {
  // Execute the tool (database query, web scrape, etc.)
  const rawOutput = await executeQuery(params);
  const outputText = JSON.stringify(rawOutput);

  // If output is large, compact it before returning
  if (outputText.length > 10000) {
    const compacted = await morph.compact({
      model: "morph-compact-v1",
      messages: [{ role: "user", content: outputText }],
      // Optional: provide context about what matters
      system: "Preserve all data values, IDs, and relationships. Remove redundant formatting and metadata.",
    });

    return compacted.choices[0].message.content;
  }

  return outputText;
}

Flash Compact is not summarization

Summarization rewrites content in the model's own words, introducing hallucination risk. Flash Compact performs verbatim deletion: it selects which tokens to keep and copies them exactly. No paraphrasing. No invented details. The output is a strict subset of the input, character for character. This is why the hallucination rate is 0%, not "low." See Compaction vs Summarization for the full technical comparison.

When to compact vs when to paginate

Use pagination when the agent might need to iterate through results (browsing a list, scanning search results). Use Flash Compact when the tool output contains inherent noise that the agent will never need (verbose formatting, metadata, repeated structures). Most real-world cases benefit from both: paginate first, then compact each page.

Security Best Practices

MCP servers execute operations on behalf of the agent. A database MCP runs queries. A file system MCP reads and writes files. A shell MCP executes commands. Each is an attack surface.

Input validation

Validate every parameter the agent sends. The agent generates tool parameters from natural language instructions, which means the parameters can contain anything the user typed, including injection payloads. A database MCP that passes user-provided strings directly into SQL queries is vulnerable to injection through the agent.

Input validation for MCP tools

// BAD: Agent-provided input directly in query
async function searchUsers(query: string) {
  return db.query(`SELECT * FROM users WHERE name LIKE '%${query}%'`);
}

// GOOD: Parameterized query with validation
async function searchUsers(query: string) {
  // Validate input
  if (typeof query !== "string" || query.length > 200) {
    return { error: "Query must be a string under 200 characters" };
  }

  // Parameterized query prevents injection
  return db.query(
    "SELECT id, name, email FROM users WHERE name ILIKE $1 LIMIT 20",
    [`%${query}%`]
  );
}

Credential management

Never include credentials, API keys, or connection strings in tool outputs. The tool output goes into the agent's context window, which may be logged, summarized, or visible to the user. Use environment variables for credentials and strip them from any output.

Principle of least privilege

A database MCP for a coding agent should have read-only access unless writes are explicitly needed. A file system MCP should be scoped to the project directory. A shell MCP should restrict which commands can run. Grant the minimum permissions needed for the tool's purpose.

Validate all inputs

Type-check, length-limit, and sanitize every parameter. Agents generate parameters from natural language. Treat them like untrusted user input.

Scope permissions

Read-only database access. Project-directory-only file access. Allowlisted shell commands. Don't give tools more power than they need.

Strip secrets from output

Scrub connection strings, API keys, tokens, and internal URLs from tool responses. These end up in the agent's context and potentially in logs.

Error Handling Best Practices

When an MCP tool fails, the error response goes into the agent's context window. A 200-line stack trace wastes tokens. A vague "something went wrong" wastes the agent's next turn trying to figure out what happened. Good error responses are structured, concise, and actionable.

Structured error responses

// BAD: Full stack trace in tool output (200+ tokens of noise)
{
  "error": "Error: ENOENT: no such file or directory, open '/path/to/file.ts'\n    at Object.openSync (node:fs:603:3)\n    at Object.readFileSync (node:fs:471:35)\n    at readConfig (/app/src/config.ts:15:22)\n    at ..."
}

// GOOD: Structured error with actionable context (30 tokens)
{
  "error": {
    "code": "FILE_NOT_FOUND",
    "message": "File not found: /path/to/file.ts",
    "suggestion": "Check if the file path is correct. Use list_files to see available files."
  }
}

Use MCP's isError flag

The MCP protocol supports an isError field in tool responses. Set it to true for errors. This tells the agent the tool failed without the agent having to parse the response to figure that out. It also lets the host application handle errors differently (retry logic, fallback tools, user notification).

Include recovery hints

An error that says "Permission denied" is less useful than one that says "Permission denied: the database user lacks SELECT access to the orders table. Grant access or use a different tool." The agent can act on the second error without additional investigation.

Catch and translate exceptions

Don't let raw exceptions bubble up as tool output. Catch them, extract the relevant message, and return a structured error. A Python traceback or Node.js stack trace can consume 500+ tokens for information the agent does not need (internal module paths, line numbers in dependencies).

Performance Best Practices

MCP tool calls are synchronous from the agent's perspective. The agent sends a request and waits for the response. A tool that takes 5 seconds to respond adds 5 seconds to the agent's turn. Over a session with 50 tool calls, slow tools add minutes of wall-clock time.

Minimize latency

Keep database connections warm. Cache frequently accessed data. Pre-compute expensive results where possible. A database MCP that opens a new connection for every query adds 100-500ms of connection overhead per call. A connection pool eliminates this.

Cache when the data allows it

File system metadata, schema information, and configuration data change infrequently. Cache them. A tool that reads the project's package.json on every call wastes I/O on data that changes once a week.

Stream for long-running operations

MCP supports streaming responses via the notifications/progress mechanism. For operations that take more than a few seconds (large file processing, complex queries, API calls to slow services), stream partial results so the agent can start reasoning before the full response arrives.

Limit result sizes server-side

Set hard maximums on result sizes. A database tool should never return more than 100 rows by default. A file search should cap at 20 matches. A log reader should limit to the last 50 lines. Let the agent request more if needed by providing pagination parameters, but default to small.

<100ms
Target tool response time
20
Default max results
10K
Max tokens per response (guideline)
50+
Tool calls per session

Testing and Debugging MCP Servers

Most MCP server bugs manifest as agent performance problems, not crashes. The server returns valid data, but too much of it, or in the wrong format, or with missing fields. The agent silently degrades. You won't see an error. You'll see the agent making worse decisions.

Test with real agents

Unit testing MCP tools verifies that they execute correctly. It does not verify that their output is useful to an agent. Run your MCP server with Claude Code, Cursor, or another agent and watch what happens. Does the agent pick the right tool? Does it pass correct parameters? Does the output fit in context without triggering compaction?

Monitor output sizes

Log the token count of every tool response. Set alerts for responses above 10K tokens. Track the distribution over time. A tool that averages 2K tokens but occasionally spikes to 50K will cause intermittent context problems that are hard to diagnose without monitoring.

Logging output sizes in an MCP server

import { encoding_for_model } from "tiktoken";

const encoder = encoding_for_model("cl100k_base");

function logToolOutput(toolName: string, output: string) {
  const tokenCount = encoder.encode(output).length;

  console.log({
    tool: toolName,
    output_tokens: tokenCount,
    output_chars: output.length,
    timestamp: new Date().toISOString(),
  });

  if (tokenCount > 10000) {
    console.warn(
      `Tool ${toolName} returned ${tokenCount} tokens. \
Consider compacting with Flash Compact or adding pagination.`
    );
  }
}

Test edge cases for output size

Query your database with no filters. Search for a common term. Read a large file. These are the cases where output explodes. If your tool handles the worst case well (pagination, truncation, or compaction), it handles everything.

Use the MCP Inspector

The official MCP Inspector tool lets you interact with your server manually, send tool calls, and inspect responses. This is the fastest way to verify output format, size, and structure before connecting the server to an agent.

Real-World Examples

The output size problem is not theoretical. Here is what popular MCP servers return and how to handle it.

MCP ServerTypical outputContext costMitigation
Playwright / BrowserFull DOM snapshot of a webpage30K-100K tokensStrip to text content, or compact with Flash Compact (50-70% reduction)
PostgreSQL / DatabaseFull result set as JSON5K-50K tokens (depends on rows)Default LIMIT 20. Return only requested columns. Compact large results.
GitHubFull file contents, PR diffs2K-20K tokens per fileReturn line ranges instead of full files. Use Flash Compact for large diffs.
Filesystem / file searchMultiple full file contents10K-100K tokensReturn snippets with line numbers. Use semantic search (WarpGrep) instead of full reads.
Slack / messagingChannel history with metadata5K-30K tokensLimit to last 20 messages. Strip formatting metadata. Return text only.
Log readersThousands of log lines10K-50K tokensTail last 50 lines. Filter by severity. Compact with Flash Compact for pattern analysis.

Playwright MCP: DOM snapshots

The Playwright MCP server returns accessibility tree snapshots of web pages. A simple page produces 5K-10K tokens. A complex web application can produce 100K+ tokens. The agent typically needs one element, one table, or one paragraph.

Two mitigations: (1) Use selector-based extraction to return only the relevant element instead of the full page. (2) Run Flash Compact on the snapshot to strip noise (ARIA attributes, empty elements, decorative markup) while preserving the content structure. A 50K-token snapshot compresses to 15K-25K tokens without losing any text content.

Database MCPs: result sets

A SELECT * FROM orders WHERE status = 'pending' on a production database might return 10,000 rows. Even with a LIMIT clause, 100 rows of JSON with 15 fields each is 5K-10K tokens. The agent asked "how many pending orders are there?" and needed a COUNT, not the rows.

Best practice: implement query-aware output shaping. If the query is a count, return the count. If it's a lookup, return matching rows with only the requested columns. If the agent needs to scan results, paginate at 20 rows per page and let the agent request the next page.

The compaction layer

Regardless of how well you design your server, some tool calls will produce large outputs. A code search that matches 30 files. A database join across three tables. A web scrape of a content-heavy page. For these cases, a compaction layer between the tool output and the agent's context is the safety net.

Adding a Flash Compact layer to any MCP server

import Morph from "morphllm";

const morph = new Morph({ apiKey: process.env.MORPH_API_KEY });

// Generic middleware for any MCP tool handler
async function withCompaction(
  handler: (params: any) => Promise<string>,
  options: { maxTokens?: number; system?: string } = {}
) {
  return async (params: any) => {
    const output = await handler(params);
    const { maxTokens = 10000, system } = options;

    // Estimate tokens (rough: 1 token ≈ 4 chars)
    const estimatedTokens = Math.ceil(output.length / 4);

    if (estimatedTokens <= maxTokens) {
      return output; // Small enough, return as-is
    }

    // Compact large outputs
    const compacted = await morph.compact({
      model: "morph-compact-v1",
      messages: [{ role: "user", content: output }],
      ...(system && { system }),
    });

    return compacted.choices[0].message.content;
  };
}

// Usage: wrap any tool handler
const searchFiles = withCompaction(
  async (params) => {
    const results = await doFileSearch(params.query);
    return JSON.stringify(results);
  },
  {
    maxTokens: 8000,
    system: "Preserve file paths, line numbers, and matching code. Remove surrounding context that doesn't match the query.",
  }
);

Frequently Asked Questions

What is the biggest mistake when building MCP servers?

Returning too much output. A database query MCP that returns 500 rows as JSON dumps 50K+ tokens into the agent's context window. A Playwright MCP returning full DOM snapshots can consume 30K-100K tokens per page. The agent needed a few rows or a single element. Server-side filtering, pagination, and output compaction with Flash Compact (50-70% reduction, under 500ms, zero hallucination) solve this.

How do I handle MCP tools that return large results?

Compact the output with Flash Compact before it enters the agent's context window. Flash Compact reduces output by 50-70% in under 500ms with zero hallucination. Every surviving sentence is preserved verbatim. This is better than truncation (loses information) or summarization (hallucinates details). You can also implement server-side pagination and filtering as a first pass, then compact the remaining output.

Should I truncate MCP output on the server side?

Truncation is a valid first pass but loses information. If a database query returns 500 rows and you truncate to 50, the agent cannot access rows 51-500 even if they contain the answer. Flash Compact is better because it preserves all signal while removing noise: redundant formatting, repeated patterns, verbose metadata.

What output format should MCP tools use?

Structured JSON or markdown. Avoid HTML, raw logs, or verbose stack traces. JSON is parseable and compact. Markdown is readable by both agents and humans. HTML contains massive amounts of noise (CSS classes, attributes, nested divs) that waste context tokens. If your tool must return HTML, strip it to text or structured data.

How do MCP tool descriptions affect agent performance?

Poor descriptions cause the agent to call the wrong tool or pass wrong parameters, wasting context on failed attempts and error recovery. Each failed tool call consumes tokens for the request, the error response, and the agent's reasoning about what went wrong. Clear descriptions with parameter types, constraints, and example values prevent this.

How many MCP servers should I connect to a coding agent?

As few as needed. Each MCP server adds tool schemas to the system prompt, consuming 900 to 51,000 tokens per server depending on tool count. Five servers with 20 tools each can consume 30K-50K tokens for tool definitions alone. Claude Code automatically defers tool schemas when they exceed 10% of the context window. Disable servers you're not actively using. See best MCP servers for coding for which ones are worth the context cost.

Does Flash Compact work with any MCP server?

Yes. Flash Compact operates on the output text, not the server itself. Any MCP tool that returns text (JSON, markdown, logs, HTML, code) can have its output compacted. You call the Flash Compact API with the tool output and get back a reduced version at 3,300+ tok/s. No changes to the MCP server are needed.

Related Resources

Stop MCP Tools From Filling Your Context Window

Flash Compact reduces MCP tool output by 50-70% in under 500ms with zero hallucination. Every surviving sentence is verbatim from the original. Works with any MCP server, any tool, any output format. One API call between your tool and the agent's context.