How to Build Your Own AI Coding Agent from Scratch

Step-by-step guide to building a custom AI coding agent. Learn the 4-tool architecture behind Pi (145K GitHub stars), why edit format matters more than the model, and when to build vs customize.

March 1, 2026 · 3 min read

A coding agent is a model in a loop with tools. The architecture is simple. The hard parts are choosing the right edit format, solving code search, and engineering what goes into the context window. This guide covers all three, with real data from Pi (145K GitHub stars), Spotify's Honk (1,500+ merged PRs), and the research behind WarpGrep and SWE-grep.

4 tools
Minimum viable agent (Pi)
+8%
Avg improvement from better edit format
60%+
Agent time spent on code search
1,500+
PRs merged by Spotify's Honk

What Is an AI Coding Agent?

An AI coding agent is software that autonomously reads, writes, and executes code. Unlike autocomplete or chat assistants, an agent plans multi-step tasks, navigates a codebase, runs terminal commands, executes tests, and iterates on failures without manual intervention.

The bar for "real agent" is higher than most marketing suggests. A 2025 RAND Corporation study found that 80-90% of tools marketed as "AI agents" are glorified chatbots. A real coding agent passes four tests:

Takes initiative

Decides what to do next without being told each step. Reads error output, identifies the fix, applies it, and re-runs the test.

Handles errors

When something fails, it doesn't stop and ask. It reads the traceback, reasons about the cause, and tries a different approach.

Uses tools

Reads files, writes files, runs commands, searches code. Not just generating text, but acting on the environment.

Completes multi-step tasks

Chains together file reads, edits, test runs, and git operations to finish a task end-to-end.

Every coding agent, from Claude Code to Pi to a 50-line script, follows the same core pattern: the ReAct loop. The model thinks about what to do (Thought), calls a tool (Action), reads the result (Observation), and repeats until the task is done.

Why Build Your Own Coding Agent?

Claude Code, Cursor, and Copilot work well out of the box for most tasks. So why would you build a custom agent? Five reasons keep coming up across the teams that have done it.

Context control

Existing tools inject content into the context window that you can't see or control. Pi's creator Mario Zechner built his agent specifically because 'exactly controlling what goes into the model's context yields better outputs.'

Compliance and data sovereignty

Healthcare, finance, and government organizations need code to stay on-premise. EU AI Act, GDPR, and industry regulations mandate security measures that cloud-only tools can't guarantee.

Domain knowledge

Enterprise codebases have internal frameworks, proprietary APIs, and conventions that generic agents don't understand. Renault uses custom Gemini Code Assist agents tuned to their standards.

Cost at scale

Heavy Claude Code usage runs $50-150/dev/month. Custom agents using the API directly can be cheaper. Sean Goedecke showed you can get a working agent in ~50 lines of code using free GitHub Actions.

Vendor independence

Pi supports Anthropic, OpenAI, Google, xAI, Groq, and OpenRouter with cross-provider handoffs. Commercial platforms lock you into one provider's pricing and availability.

The commoditization argument

Sean Goedecke (GitHub) argues that coding agents are already commoditized: "All you need is a slightly smarter base model." The scaffolding is simple. Agent hackers in 2023 were correct about the architecture; they just lacked capable models. This means the differentiation isn't in the loop itself. It's in context engineering, edit format, and domain knowledge.

Build vs Buy vs Customize

Not every team needs to build from scratch. The decision depends on your constraints and where existing tools fall short.

ApproachWhen to ChooseEffortExamples
Use off-the-shelfStandard dev workflows, no compliance constraints, team < 20Zero setupClaude Code, Cursor, Copilot
Customize existing toolsNeed project-specific instructions, internal tool integration, better searchHours to daysCLAUDE.md + MCP servers + WarpGrep
Fork and extendNeed a different edit format, custom UI, or novel tool combinationsDays to weeksoh-my-pi (forked Pi, added hash-anchored edits)
Build from scratchStrict compliance, deeply custom domain, the agent IS your productWeeks to monthsSpotify Honk, Devin, Cursor

Most teams land in the "customize existing tools" bucket. A well-written CLAUDE.md file, a couple of MCP servers for internal APIs, and WarpGrep for code search gets you 80% of the benefit of a custom agent at 10% of the cost.

Anatomy of a Coding Agent

Strip away the branding and every coding agent has four components. Understanding them is the first step to building or customizing your own.

1. The Agent Loop

The model receives a task, reasons about what to do, calls a tool, reads the output, and decides whether to continue or stop. This is the ReAct (Reasoning + Acting) pattern. The loop itself is trivial:

The core agent loop (pseudocode)

while not done:
    response = llm.generate(messages, tools)
    if response.has_tool_calls:
        results = execute_tools(response.tool_calls)
        messages.append(results)
    else:
        done = True

Sean Goedecke demonstrated a production-usable agent in roughly 50 lines of code. The loop is not where the complexity lives.

2. The Tool System

Tools are functions the model can call. At minimum: read files, write files, edit files, and run shell commands. More sophisticated agents add LSP integration, browser automation, image analysis, and third-party integrations via MCP.

3. The Edit Format

How the model expresses code changes. This is where most agents diverge, and where the biggest performance differences hide. Search-and-replace, unified diffs, whole-file rewrites, hash-anchored edits, and semantic Fast Apply all solve the same problem differently.

4. The Search System

How the agent finds relevant code in a large codebase. Naive grep, structural repo maps, vector search, or RL-trained subagents like WarpGrep. This component has the most room for improvement in current tools.

The Four Tools You Actually Need

Pi, built by Mario Zechner (creator of the libGDX game framework), is the strongest argument for minimalism in agent design. It gives the model exactly four tools and a system prompt under 1,000 tokens. It became the engine behind OpenClaw, which reached 145,000 GitHub stars in a single week.

read

Read file contents (text + images). Defaults to the first 2,000 lines with configurable offset and limit.

write

Create or overwrite files. Auto-generates parent directories. Simple and predictable.

edit

Surgical replacements using exact text matching. The model specifies old text and new text.

bash

Execute shell commands with optional timeout. This is the escape hatch: anything the model can't do with the other three tools, it does through bash.

Zechner's design philosophy: "If I don't need it, it won't be built." Pi deliberately excludes plan modes, to-do lists, sub-agents, MCP support, and permission prompts. Plans live in PLAN.md files. Sub-agents are launched via bash when needed. MCP servers, which can inject 13,000-18,000 tokens of tool definitions, are replaced by CLI tools with README files.

Why minimalism works

Frontier models have been RL-trained extensively on coding tasks. They already know what a coding agent is. Specialized tools and elaborate system prompts add tokens without adding capability. Pi's sub-1,000-token prompt means more of the context window is available for actual code.

Extending Beyond Four

Pi supports extension through external files rather than internal plugins. AGENTS.md for project context. SYSTEM.md to customize the system prompt. Skills as on-demand capability packages. TypeScript hooks for context, session_before_compact, and tool_call events.

The oh-my-pi fork by Can Boluk extended Pi significantly: hash-anchored edits, LSP integration across 40+ languages, a persistent IPython kernel, headless browser automation, six bundled subagents, and a native Rust engine in 7,500 lines. The fork proves the architecture scales. Start minimal, add what you need.

The Harness Problem: Why Edit Format Matters More Than the Model

Can Boluk (creator of oh-my-pi) published "The Harness Problem" in February 2026. It's the most important finding in coding agent research this year. He tested 16 models across different edit interfaces and found that changing only the edit format, not the model, improved coding performance by +8% on average. Some models improved dramatically:

+8%
Average improvement across 16 models
10x
Max improvement (Grok Code Fast 1)
6.7% → 68.3%
Grok Code Fast 1 score change
16
Models tested

The insight: "The model isn't flaky at understanding the task. It's flaky at expressing itself." When you ask a model to produce a search-and-replace block, it has to reproduce the exact original text character-for-character, match whitespace, handle quoting, and get line boundaries right. The model knows what code to write. The rigid format is what breaks.

Edit Formats Compared

FormatUsed ByAccuracyTradeoff
Search/ReplaceClaude Code, Gemini~80%Intuitive but whitespace-sensitive
Unified DiffAider, Codex CLI80-85%Token-efficient but fragile line numbers
Whole FileSimple agents60-75%No matching issues but wastes tokens on large files
Hash-Anchoredoh-my-pi~90%+Eliminates ambiguity; new format, less ecosystem support
Fast Apply (Semantic)Morph, Cursor~98%Best accuracy; requires a dedicated apply model

Fast Apply works by letting the model generate code naturally, then using a specialized smaller model to merge the changes into the target file. The model doesn't need to reproduce exact text or count line numbers. It just writes the code it wants. Morph Fast Apply achieves 98% accuracy at 10,500 tokens per second.

Hash-anchored edits (from oh-my-pi) take a different approach: every line gets a 2-3 character content hash. The model references hashes instead of reproducing text. This eliminated "string not found" errors and reduced output tokens by 61% on some models.

Context Engineering: The Skill That Separates Good Agents from Great Ones

Prompt engineering is what you write inside the context window. Context engineering is how you decide what fills it. Anthropic's engineering team frames context engineering as the successor to prompt engineering: "curating the smallest set of high-signal tokens that maximize the likelihood of some desired outcome."

Most agent failures are not model failures. They are context failures. The model had enough capability to solve the problem, but the wrong information was in the context window, or the right information was buried under noise.

CLAUDE.md and AGENTS.md

These files are the agent's "constitution." They go into every conversation and provide build commands, project structure, coding conventions, known gotchas, and tool-specific instructions.

AGENTS.md (released by OpenAI, August 2025) is now adopted by 60,000+ open-source projects and supported by Claude Code, Cursor, Copilot, Gemini CLI, Windsurf, Aider, and more. GitHub published a guide based on analysis of 2,500+ repositories showing what makes a good AGENTS.md file.

MCP: The USB-C of AI Tools

Model Context Protocol (MCP) standardizes how AI apps connect to external tools. It solves the M x N problem: without a standard, M models times N tools requires M*N custom integrations. MCP uses JSON-RPC 2.0 with three primitives: Tools (executable functions), Resources (data access), and Prompts (templated workflows).

In December 2025, the Linux Foundation formed the Agentic AI Foundation with MCP as a founding project. OpenAI, Google DeepMind, and Microsoft all adopted it. For custom agents, MCP means you can build a tool once and it works with every agent that supports the protocol.

Subagent Isolation

Instead of one agent doing everything (and filling its context with intermediate search results, error logs, and exploration artifacts), split the work. A search subagent explores in its own context window and returns a condensed summary. A review subagent checks code quality independently. The main agent's context stays clean.

This is the pattern behind WarpGrep, Claude Code's built-in subagents, and VS Code Copilot's multi-agent support. It's also how Spotify's Honk agent orchestrates complex migrations: separate agents for planning, execution, and verification.

The Practical Path: Start by Customizing, Build When You Must

The teams that get the most from coding agents rarely build from scratch. They customize systematically, measure where tools fall short, and build only the components where existing options fail.

Level 1: Project Instructions (15 minutes)

Create a CLAUDE.md or AGENTS.md file in your repo root. Include build and test commands, project architecture overview, coding conventions, and known gotchas. This alone can double agent effectiveness by preventing the most common context failures.

Minimal CLAUDE.md example

# Project: billing-service
## Commands
- `npm test` - Run tests
- `npm run build` - Build for production
- `npm run lint` - Run ESLint

## Architecture
- Express API with PostgreSQL (Drizzle ORM)
- Stripe webhooks handle subscription changes
- All mutations through server actions in actions.ts files

## Conventions
- Use Zod for input validation at API boundaries
- Never import from @internal packages in tests
- Subscription tier checks use lightweight DB reads, not Stripe API calls

Level 2: MCP Servers for Internal Tools (hours)

Build MCP servers that expose your internal APIs, databases, or CI/CD systems to the agent. A custom MCP server for your deployment pipeline means the agent can check build status, trigger deploys, and roll back, all through a standardized protocol that works with any MCP-enabled tool.

Level 3: Specialized Subagents (days)

Add WarpGrep for code search. Create custom subagent definitions for domain-specific tasks like database migration review, security audit, or API compatibility checking. Claude Code supports custom subagents via .claude/agents/ markdown files with YAML frontmatter, tool restrictions, and custom system prompts.

Level 4: Custom Harness (weeks)

When levels 1-3 aren't enough, build the harness. Use the Claude Agent SDK (Python or TypeScript) or the OpenAI Agents SDK. These give you the same agent loop, tools, and context management that power Claude Code and Codex, but with full control over every component.

Agent SDK quickstart (Claude Agent SDK, TypeScript)

import { Agent } from "@anthropic-ai/claude-agent-sdk";

const agent = new Agent({
  model: "claude-sonnet-4-6",
  tools: ["read", "write", "edit", "bash"],
  systemPrompt: "You are a coding agent for billing-service.",
  mcpServers: ["./mcp-servers/deploy.json"],
});

const result = await agent.run(
  "Fix the failing test in billing.test.ts"
);

Real-World Examples

Spotify: Honk (1,500+ Merged PRs)

Spotify built a custom internal CLI called Honk for background coding agents. It handles complex migrations: Java records, data pipeline upgrades, UI component migrations. The team reports 60-90% time savings versus manual changes. Honk orchestrates prompt execution, MCP formatting, and LLM-based diff evaluation. It can switch between AI models seamlessly based on task requirements.

Key insight from Spotify's engineering team: "Claude Code performs better with prompts that describe the end state and leave room for figuring out how to get there."

Pi and OpenClaw (145K Stars in One Week)

Mario Zechner built Pi as a reaction to the complexity accumulating in existing agents. Four tools, sub-1,000-token system prompt, zero permission prompts. Pi became the engine behind OpenClaw, a personal AI assistant that runs on your own devices across WhatsApp, Telegram, Slack, Discord, and more. The ecosystem spawned forks (PiClaw, GitClaw, MimiClaw) within days.

oh-my-pi: The Fork That Proved the Architecture Scales

Can Boluk forked Pi and extended it with hash-anchored edits (eliminating "string not found" errors), LSP integration across 40+ languages, a persistent IPython kernel, headless browser automation, six bundled subagents, and a native Rust engine. The result: a full-featured development environment built on Pi's minimal foundation, implemented in 7,500 lines of Rust on top of the original TypeScript core.

The 50-Line Agent

Sean Goedecke (GitHub) demonstrated that a working coding agent fits in roughly 50 lines of code using free GitHub Actions. The point: the architecture is commoditized. "Agent hackers in 2023 were correct about the architecture; they simply lacked access to sufficiently capable models." The frontier has moved from "can we build an agent?" to "how do we engineer its context?"

ProjectApproachKey Result
Spotify HonkCustom CLI, model-agnostic, MCP integration1,500+ production PRs, 60-90% time savings
Pi4 tools, <1K token prompt, zero bloatPowered OpenClaw to 145K GitHub stars
oh-my-piPi fork + hash edits, LSP, Rust engine+8% avg accuracy, 61% fewer output tokens
Devin (Cognition)Full sandbox, RL-trained SWE-grepFirst commercial autonomous coding agent
50-line agentGitHub Actions, minimal loopProved architecture is commoditized

Frequently Asked Questions

How many tools does a coding agent need?

Four at minimum: read files, write files, edit files, and run shell commands. Pi proved this architecture by powering OpenClaw to 145,000 GitHub stars with exactly four tools and a system prompt under 1,000 tokens. Additional tools like code search (WarpGrep) and LSP integration improve performance but aren't strictly required.

Should I build a coding agent from scratch or customize an existing one?

Most teams should customize first. Start with CLAUDE.md or AGENTS.md files for project instructions, add MCP servers for internal tools, and use WarpGrep for code search. Build a custom harness only when you hit measurable limitations, like strict compliance requirements or deeply custom domain knowledge that can't be expressed in configuration files.

What is the harness problem?

The finding that the edit format matters more than the underlying model. Can Boluk tested 16 models with different edit interfaces and found that changing only the format improved performance by 8% on average. Some models improved up to 10x. The model understands the task but struggles to express edits in rigid formats. Fast Apply models solve this with a dedicated merge model.

Why is code search the biggest bottleneck?

Agents spend over 60% of their first turn just finding relevant code. Raw grep results flood the context window with noise, degrading performance on longer tasks. RL-trained search subagents like WarpGrep run search in a separate context window and return only relevant sections, cutting context rot by 70%.

What is context engineering?

How you decide what fills the model's context window, as opposed to what you write inside it. CLAUDE.md files, MCP servers, subagent isolation, and structured retrieval (WarpGrep) are all context engineering techniques. Most agent failures are context failures, not model failures.

How much does a custom coding agent cost to run?

Commercial tools range from $10/month (Copilot) to $200+/month (heavy Claude Code). Custom agents on the API run $5-50/month per developer for moderate usage. The Claude Agent SDK and OpenAI Agents SDK provide production-grade agent loops at standard API pricing. A minimal agent can run on free GitHub Actions credits.

Build Better Agents with WarpGrep and Fast Apply

WarpGrep solves code search. Fast Apply solves edit accuracy. Both integrate with any coding agent through MCP or SDK.