A coding agent is a model in a loop with tools. The architecture is simple. The hard parts are choosing the right edit format, solving code search, and engineering what goes into the context window. This guide covers all three, with real data from Pi (145K GitHub stars), Spotify's Honk (1,500+ merged PRs), and the research behind WarpGrep and SWE-grep.
What Is an AI Coding Agent?
An AI coding agent is software that autonomously reads, writes, and executes code. Unlike autocomplete or chat assistants, an agent plans multi-step tasks, navigates a codebase, runs terminal commands, executes tests, and iterates on failures without manual intervention.
The bar for "real agent" is higher than most marketing suggests. A 2025 RAND Corporation study found that 80-90% of tools marketed as "AI agents" are glorified chatbots. A real coding agent passes four tests:
Takes initiative
Decides what to do next without being told each step. Reads error output, identifies the fix, applies it, and re-runs the test.
Handles errors
When something fails, it doesn't stop and ask. It reads the traceback, reasons about the cause, and tries a different approach.
Uses tools
Reads files, writes files, runs commands, searches code. Not just generating text, but acting on the environment.
Completes multi-step tasks
Chains together file reads, edits, test runs, and git operations to finish a task end-to-end.
Every coding agent, from Claude Code to Pi to a 50-line script, follows the same core pattern: the ReAct loop. The model thinks about what to do (Thought), calls a tool (Action), reads the result (Observation), and repeats until the task is done.
Why Build Your Own Coding Agent?
Claude Code, Cursor, and Copilot work well out of the box for most tasks. So why would you build a custom agent? Five reasons keep coming up across the teams that have done it.
Context control
Existing tools inject content into the context window that you can't see or control. Pi's creator Mario Zechner built his agent specifically because 'exactly controlling what goes into the model's context yields better outputs.'
Compliance and data sovereignty
Healthcare, finance, and government organizations need code to stay on-premise. EU AI Act, GDPR, and industry regulations mandate security measures that cloud-only tools can't guarantee.
Domain knowledge
Enterprise codebases have internal frameworks, proprietary APIs, and conventions that generic agents don't understand. Renault uses custom Gemini Code Assist agents tuned to their standards.
Cost at scale
Heavy Claude Code usage runs $50-150/dev/month. Custom agents using the API directly can be cheaper. Sean Goedecke showed you can get a working agent in ~50 lines of code using free GitHub Actions.
Vendor independence
Pi supports Anthropic, OpenAI, Google, xAI, Groq, and OpenRouter with cross-provider handoffs. Commercial platforms lock you into one provider's pricing and availability.
The commoditization argument
Sean Goedecke (GitHub) argues that coding agents are already commoditized: "All you need is a slightly smarter base model." The scaffolding is simple. Agent hackers in 2023 were correct about the architecture; they just lacked capable models. This means the differentiation isn't in the loop itself. It's in context engineering, edit format, and domain knowledge.
Build vs Buy vs Customize
Not every team needs to build from scratch. The decision depends on your constraints and where existing tools fall short.
| Approach | When to Choose | Effort | Examples |
|---|---|---|---|
| Use off-the-shelf | Standard dev workflows, no compliance constraints, team < 20 | Zero setup | Claude Code, Cursor, Copilot |
| Customize existing tools | Need project-specific instructions, internal tool integration, better search | Hours to days | CLAUDE.md + MCP servers + WarpGrep |
| Fork and extend | Need a different edit format, custom UI, or novel tool combinations | Days to weeks | oh-my-pi (forked Pi, added hash-anchored edits) |
| Build from scratch | Strict compliance, deeply custom domain, the agent IS your product | Weeks to months | Spotify Honk, Devin, Cursor |
Most teams land in the "customize existing tools" bucket. A well-written CLAUDE.md file, a couple of MCP servers for internal APIs, and WarpGrep for code search gets you 80% of the benefit of a custom agent at 10% of the cost.
Anatomy of a Coding Agent
Strip away the branding and every coding agent has four components. Understanding them is the first step to building or customizing your own.
1. The Agent Loop
The model receives a task, reasons about what to do, calls a tool, reads the output, and decides whether to continue or stop. This is the ReAct (Reasoning + Acting) pattern. The loop itself is trivial:
The core agent loop (pseudocode)
while not done:
response = llm.generate(messages, tools)
if response.has_tool_calls:
results = execute_tools(response.tool_calls)
messages.append(results)
else:
done = TrueSean Goedecke demonstrated a production-usable agent in roughly 50 lines of code. The loop is not where the complexity lives.
2. The Tool System
Tools are functions the model can call. At minimum: read files, write files, edit files, and run shell commands. More sophisticated agents add LSP integration, browser automation, image analysis, and third-party integrations via MCP.
3. The Edit Format
How the model expresses code changes. This is where most agents diverge, and where the biggest performance differences hide. Search-and-replace, unified diffs, whole-file rewrites, hash-anchored edits, and semantic Fast Apply all solve the same problem differently.
4. The Search System
How the agent finds relevant code in a large codebase. Naive grep, structural repo maps, vector search, or RL-trained subagents like WarpGrep. This component has the most room for improvement in current tools.
The Four Tools You Actually Need
Pi, built by Mario Zechner (creator of the libGDX game framework), is the strongest argument for minimalism in agent design. It gives the model exactly four tools and a system prompt under 1,000 tokens. It became the engine behind OpenClaw, which reached 145,000 GitHub stars in a single week.
read
Read file contents (text + images). Defaults to the first 2,000 lines with configurable offset and limit.
write
Create or overwrite files. Auto-generates parent directories. Simple and predictable.
edit
Surgical replacements using exact text matching. The model specifies old text and new text.
bash
Execute shell commands with optional timeout. This is the escape hatch: anything the model can't do with the other three tools, it does through bash.
Zechner's design philosophy: "If I don't need it, it won't be built." Pi deliberately excludes plan modes, to-do lists, sub-agents, MCP support, and permission prompts. Plans live in PLAN.md files. Sub-agents are launched via bash when needed. MCP servers, which can inject 13,000-18,000 tokens of tool definitions, are replaced by CLI tools with README files.
Why minimalism works
Frontier models have been RL-trained extensively on coding tasks. They already know what a coding agent is. Specialized tools and elaborate system prompts add tokens without adding capability. Pi's sub-1,000-token prompt means more of the context window is available for actual code.
Extending Beyond Four
Pi supports extension through external files rather than internal plugins. AGENTS.md for project context. SYSTEM.md to customize the system prompt. Skills as on-demand capability packages. TypeScript hooks for context, session_before_compact, and tool_call events.
The oh-my-pi fork by Can Boluk extended Pi significantly: hash-anchored edits, LSP integration across 40+ languages, a persistent IPython kernel, headless browser automation, six bundled subagents, and a native Rust engine in 7,500 lines. The fork proves the architecture scales. Start minimal, add what you need.
The Harness Problem: Why Edit Format Matters More Than the Model
Can Boluk (creator of oh-my-pi) published "The Harness Problem" in February 2026. It's the most important finding in coding agent research this year. He tested 16 models across different edit interfaces and found that changing only the edit format, not the model, improved coding performance by +8% on average. Some models improved dramatically:
The insight: "The model isn't flaky at understanding the task. It's flaky at expressing itself." When you ask a model to produce a search-and-replace block, it has to reproduce the exact original text character-for-character, match whitespace, handle quoting, and get line boundaries right. The model knows what code to write. The rigid format is what breaks.
Edit Formats Compared
| Format | Used By | Accuracy | Tradeoff |
|---|---|---|---|
| Search/Replace | Claude Code, Gemini | ~80% | Intuitive but whitespace-sensitive |
| Unified Diff | Aider, Codex CLI | 80-85% | Token-efficient but fragile line numbers |
| Whole File | Simple agents | 60-75% | No matching issues but wastes tokens on large files |
| Hash-Anchored | oh-my-pi | ~90%+ | Eliminates ambiguity; new format, less ecosystem support |
| Fast Apply (Semantic) | Morph, Cursor | ~98% | Best accuracy; requires a dedicated apply model |
Fast Apply works by letting the model generate code naturally, then using a specialized smaller model to merge the changes into the target file. The model doesn't need to reproduce exact text or count line numbers. It just writes the code it wants. Morph Fast Apply achieves 98% accuracy at 10,500 tokens per second.
Hash-anchored edits (from oh-my-pi) take a different approach: every line gets a 2-3 character content hash. The model references hashes instead of reproducing text. This eliminated "string not found" errors and reduced output tokens by 61% on some models.
Code Search: The Real Bottleneck
Cognition (the team behind Devin) analyzed their agent trajectories and found that agents spend over 60% of their first turn just retrieving context. Not writing code. Not reasoning. Searching. This is the single biggest performance bottleneck in coding agents today.
The naive approach is giving the agent grep and letting it search. This works on small codebases. On large ones, raw grep results flood the context window with irrelevant matches, causing "context rot": the model's performance degrades as noise accumulates in its context.
RL-Trained Search Subagents
The solution that both Cognition and Morph converged on: train a separate model specifically for code search, run it in its own context window, and return only the relevant code to the main agent.
| Approach | Speed | Context Impact | Setup |
|---|---|---|---|
| Naive grep | Fast | High context rot | Zero |
| Repo mapping (Aider) | Moderate | Moderate | Auto-generated |
| Vector/semantic search | Moderate | Low | Requires embedding index |
| WarpGrep (Morph) | Sub-6 sec | 70% less context rot | Zero (no index needed) |
| SWE-grep (Cognition) | Fast | Low | Internal to Devin/Windsurf |
WarpGrep uses reinforcement learning with a reward function that prioritizes precision over recall (weighted F1 with beta=0.5). It executes up to 24 tool calls (8 parallel calls across 3 exploration turns + 1 answer turn) in under 6 seconds. No embeddings, no vector database, no index setup. It needs only ripgrep installed locally.
Integration is straightforward. WarpGrep is available as an MCP server for Claude Code, Cursor, and any MCP-enabled agent. It also ships as SDK tools for Anthropic, OpenAI, Gemini, and Vercel AI SDK integrations.
Context Engineering: The Skill That Separates Good Agents from Great Ones
Prompt engineering is what you write inside the context window. Context engineering is how you decide what fills it. Anthropic's engineering team frames context engineering as the successor to prompt engineering: "curating the smallest set of high-signal tokens that maximize the likelihood of some desired outcome."
Most agent failures are not model failures. They are context failures. The model had enough capability to solve the problem, but the wrong information was in the context window, or the right information was buried under noise.
CLAUDE.md and AGENTS.md
These files are the agent's "constitution." They go into every conversation and provide build commands, project structure, coding conventions, known gotchas, and tool-specific instructions.
AGENTS.md (released by OpenAI, August 2025) is now adopted by 60,000+ open-source projects and supported by Claude Code, Cursor, Copilot, Gemini CLI, Windsurf, Aider, and more. GitHub published a guide based on analysis of 2,500+ repositories showing what makes a good AGENTS.md file.
MCP: The USB-C of AI Tools
Model Context Protocol (MCP) standardizes how AI apps connect to external tools. It solves the M x N problem: without a standard, M models times N tools requires M*N custom integrations. MCP uses JSON-RPC 2.0 with three primitives: Tools (executable functions), Resources (data access), and Prompts (templated workflows).
In December 2025, the Linux Foundation formed the Agentic AI Foundation with MCP as a founding project. OpenAI, Google DeepMind, and Microsoft all adopted it. For custom agents, MCP means you can build a tool once and it works with every agent that supports the protocol.
Subagent Isolation
Instead of one agent doing everything (and filling its context with intermediate search results, error logs, and exploration artifacts), split the work. A search subagent explores in its own context window and returns a condensed summary. A review subagent checks code quality independently. The main agent's context stays clean.
This is the pattern behind WarpGrep, Claude Code's built-in subagents, and VS Code Copilot's multi-agent support. It's also how Spotify's Honk agent orchestrates complex migrations: separate agents for planning, execution, and verification.
The Practical Path: Start by Customizing, Build When You Must
The teams that get the most from coding agents rarely build from scratch. They customize systematically, measure where tools fall short, and build only the components where existing options fail.
Level 1: Project Instructions (15 minutes)
Create a CLAUDE.md or AGENTS.md file in your repo root. Include build and test commands, project architecture overview, coding conventions, and known gotchas. This alone can double agent effectiveness by preventing the most common context failures.
Minimal CLAUDE.md example
# Project: billing-service
## Commands
- `npm test` - Run tests
- `npm run build` - Build for production
- `npm run lint` - Run ESLint
## Architecture
- Express API with PostgreSQL (Drizzle ORM)
- Stripe webhooks handle subscription changes
- All mutations through server actions in actions.ts files
## Conventions
- Use Zod for input validation at API boundaries
- Never import from @internal packages in tests
- Subscription tier checks use lightweight DB reads, not Stripe API callsLevel 2: MCP Servers for Internal Tools (hours)
Build MCP servers that expose your internal APIs, databases, or CI/CD systems to the agent. A custom MCP server for your deployment pipeline means the agent can check build status, trigger deploys, and roll back, all through a standardized protocol that works with any MCP-enabled tool.
Level 3: Specialized Subagents (days)
Add WarpGrep for code search. Create custom subagent definitions for domain-specific tasks like database migration review, security audit, or API compatibility checking. Claude Code supports custom subagents via .claude/agents/ markdown files with YAML frontmatter, tool restrictions, and custom system prompts.
Level 4: Custom Harness (weeks)
When levels 1-3 aren't enough, build the harness. Use the Claude Agent SDK (Python or TypeScript) or the OpenAI Agents SDK. These give you the same agent loop, tools, and context management that power Claude Code and Codex, but with full control over every component.
Agent SDK quickstart (Claude Agent SDK, TypeScript)
import { Agent } from "@anthropic-ai/claude-agent-sdk";
const agent = new Agent({
model: "claude-sonnet-4-6",
tools: ["read", "write", "edit", "bash"],
systemPrompt: "You are a coding agent for billing-service.",
mcpServers: ["./mcp-servers/deploy.json"],
});
const result = await agent.run(
"Fix the failing test in billing.test.ts"
);Real-World Examples
Spotify: Honk (1,500+ Merged PRs)
Spotify built a custom internal CLI called Honk for background coding agents. It handles complex migrations: Java records, data pipeline upgrades, UI component migrations. The team reports 60-90% time savings versus manual changes. Honk orchestrates prompt execution, MCP formatting, and LLM-based diff evaluation. It can switch between AI models seamlessly based on task requirements.
Key insight from Spotify's engineering team: "Claude Code performs better with prompts that describe the end state and leave room for figuring out how to get there."
Pi and OpenClaw (145K Stars in One Week)
Mario Zechner built Pi as a reaction to the complexity accumulating in existing agents. Four tools, sub-1,000-token system prompt, zero permission prompts. Pi became the engine behind OpenClaw, a personal AI assistant that runs on your own devices across WhatsApp, Telegram, Slack, Discord, and more. The ecosystem spawned forks (PiClaw, GitClaw, MimiClaw) within days.
oh-my-pi: The Fork That Proved the Architecture Scales
Can Boluk forked Pi and extended it with hash-anchored edits (eliminating "string not found" errors), LSP integration across 40+ languages, a persistent IPython kernel, headless browser automation, six bundled subagents, and a native Rust engine. The result: a full-featured development environment built on Pi's minimal foundation, implemented in 7,500 lines of Rust on top of the original TypeScript core.
The 50-Line Agent
Sean Goedecke (GitHub) demonstrated that a working coding agent fits in roughly 50 lines of code using free GitHub Actions. The point: the architecture is commoditized. "Agent hackers in 2023 were correct about the architecture; they simply lacked access to sufficiently capable models." The frontier has moved from "can we build an agent?" to "how do we engineer its context?"
| Project | Approach | Key Result |
|---|---|---|
| Spotify Honk | Custom CLI, model-agnostic, MCP integration | 1,500+ production PRs, 60-90% time savings |
| Pi | 4 tools, <1K token prompt, zero bloat | Powered OpenClaw to 145K GitHub stars |
| oh-my-pi | Pi fork + hash edits, LSP, Rust engine | +8% avg accuracy, 61% fewer output tokens |
| Devin (Cognition) | Full sandbox, RL-trained SWE-grep | First commercial autonomous coding agent |
| 50-line agent | GitHub Actions, minimal loop | Proved architecture is commoditized |
Frequently Asked Questions
How many tools does a coding agent need?
Four at minimum: read files, write files, edit files, and run shell commands. Pi proved this architecture by powering OpenClaw to 145,000 GitHub stars with exactly four tools and a system prompt under 1,000 tokens. Additional tools like code search (WarpGrep) and LSP integration improve performance but aren't strictly required.
Should I build a coding agent from scratch or customize an existing one?
Most teams should customize first. Start with CLAUDE.md or AGENTS.md files for project instructions, add MCP servers for internal tools, and use WarpGrep for code search. Build a custom harness only when you hit measurable limitations, like strict compliance requirements or deeply custom domain knowledge that can't be expressed in configuration files.
What is the harness problem?
The finding that the edit format matters more than the underlying model. Can Boluk tested 16 models with different edit interfaces and found that changing only the format improved performance by 8% on average. Some models improved up to 10x. The model understands the task but struggles to express edits in rigid formats. Fast Apply models solve this with a dedicated merge model.
Why is code search the biggest bottleneck?
Agents spend over 60% of their first turn just finding relevant code. Raw grep results flood the context window with noise, degrading performance on longer tasks. RL-trained search subagents like WarpGrep run search in a separate context window and return only relevant sections, cutting context rot by 70%.
What is context engineering?
How you decide what fills the model's context window, as opposed to what you write inside it. CLAUDE.md files, MCP servers, subagent isolation, and structured retrieval (WarpGrep) are all context engineering techniques. Most agent failures are context failures, not model failures.
How much does a custom coding agent cost to run?
Commercial tools range from $10/month (Copilot) to $200+/month (heavy Claude Code). Custom agents on the API run $5-50/month per developer for moderate usage. The Claude Agent SDK and OpenAI Agents SDK provide production-grade agent loops at standard API pricing. A minimal agent can run on free GitHub Actions credits.
Build Better Agents with WarpGrep and Fast Apply
WarpGrep solves code search. Fast Apply solves edit accuracy. Both integrate with any coding agent through MCP or SDK.