2025 proved coding agents could work. 2026 is about making them work reliably. The model is the brain. The harness is the body. Meta acquired Manus for $2B, not for the model, but for the harness. This guide covers the frameworks, patterns, and architecture behind every serious coding agent shipping today.
What Is Agent Engineering?
Agent engineering is the discipline of building reliable systems around language models. swyx coined the term at the 2025 AI Engineer Summit. His core argument: a minimal agent definition of "LLM + tools + loop" is too simplistic to be useful. It leaves out memory, planning, and authority, which are exactly the components that separate agents that ship from agents that demo well.
Simon Willison defines agentic engineering more precisely: building software using coding agents where the defining feature is that they can both generate and execute code. The execution part is what makes this different from chat-based assistants. The agent writes code, runs it, sees the error, fixes it, and runs it again. No copy-paste.
The shift from "build a better model" to "build a better harness" happened fast. Phil Schmid frames it as an analogy: the model is the CPU providing raw processing power. The context window is RAM. The harness is the operating system, curating context, handling tool dispatch, and providing drivers. The agent is the application running on top. You do not ship a CPU to end users. You ship an operating system.
Why the harness is worth billions
Meta acquired Manus for ~$2B in December 2025. Not for the model (Manus uses foundation models from Anthropic, OpenAI, and others). For the harness. Manus rebuilt their agent harness five times in six months, each architecture improving reliability and task completion. The harness is the product.
The IMPACT Framework
swyx's IMPACT framework defines the six components that make agents work. He created it because simplified definitions ("LLM + tools + loop") make engineers forget the components that actually determine agent quality: planning, memory, and authority.
I - Intent
Goals encoded via multimodal I/O and verified through evals. The agent must understand what success looks like before acting. Without clear intent specification, agents wander.
M - Memory
Long-running memory creates coherence and self-improvement. Not just conversation history, but skill libraries and reusable workflow patterns that persist across sessions.
P - Planning
Multi-step editable plans. Devin and Deep Research prove that letting users modify agent plans mid-execution significantly improves outcomes. Static plans fail. Editable plans adapt.
A - Authority
The most overlooked element. Trust between humans and agents. Permission models, approval gates, sandbox boundaries. 'Stutter-step agents get old fast' -- too many approval prompts kill productivity.
C - Control Flow
The more agentic an application, the more the LLM decides the control flow. This distinguishes real agents from preset workflows. Dynamic execution paths versus hardcoded sequences.
T - Tools
RAG and search, sandboxed code execution, browser automation. Everyone agrees on tools. The disagreement is how to manage them: static registration vs. dynamic discovery vs. logit masking.
Notably, OpenAI's official agent definition (TRIM: Tools, Reasoning, Instructions, Memory) omits Planning and Authority despite their importance in production systems. The IMPACT framework captures what practitioners have learned from shipping agents at scale.
The Agent Loop
Every coding agent implements some version of the same core loop. Princeton and Google's ReAct (Reason + Act) pattern from 2022 formalized it: the model alternates between generating a reasoning trace and selecting an action. In practice, the loop looks like this:
The agent loop (pseudocode)
while task_not_complete:
# 1. READ — gather relevant context
state = read_files() + read_test_output() + read_errors()
context = harness.select_context(state, task)
# 2. PLAN — decide what to do next
plan = model.reason(context, task)
# 3. ACT — execute via tools
result = harness.dispatch_tool(plan.next_action)
# 4. OBSERVE — check the outcome
outcome = harness.evaluate(result)
if outcome.needs_retry:
context = harness.add_error_trace(outcome.error)
continue
if outcome.needs_human:
harness.escalate(outcome)
break
harness.checkpoint(result) # git commit, progress fileThe differences between Claude Code, Cursor, Codex CLI, Cline, and Aider are not in this loop. They all implement it. The differences are in how the harness manages each step: how it selects context, how it dispatches tools, how it handles failures, and how it knows when to stop.
The loop cannot run without feedback
That loop falls apart without reliable feedback from tests, linters, and type checkers. Addy Osmani is direct about this: "You absolutely have to test what it writes." Agents with access to a test suite can "fly" through a project. Agents without one hallucinate progress.
Harness Engineering
Martin Fowler and Birgitta Boeckeler at ThoughtWorks define harness engineering as the tooling and practices used to keep AI agents in check when maintaining large applications. Their framework has three components:
Context Engineering
A continuously refined knowledge base embedded in the codebase, supplemented by dynamic sources like observability data and browser navigation for agents.
Architectural Constraints
Guardrails enforced through both LLM-based agents and deterministic custom linters and structural tests (like ArchUnit) that monitor code quality.
Entropy Management
Periodic 'garbage collection' agents that identify documentation inconsistencies and architectural constraint violations over time. The codebase degrades without active cleanup.
Anthropic's harness guidance for long-running agents solves a specific problem: agents that need to work across multiple context windows. Their pattern uses an initializer agent that sets up the environment (init.sh, progress file, initial git commit, feature list) and a coding agent that reads progress files and git logs to understand context, selects one high-priority feature, implements it, and updates documentation before the session ends.
Anthropic's harness pattern for multi-session agents
# Session 1: Initializer agent sets up the scaffold
harness.create("init.sh") # environment setup script
harness.create("progress.json") # structured feature list
harness.create("claude-progress.txt") # human-readable status
git commit -m "Initialize project scaffold"
# Session N: Coding agent picks up where it left off
progress = read("claude-progress.txt")
git_log = run("git log --oneline -10")
features = read("progress.json")
# Select highest-priority incomplete feature
next_feature = features.find(f => f.status == "incomplete")
# Implement ONE feature per session (prevents context exhaustion)
implement(next_feature)
run_tests()
update("claude-progress.txt")
git commit -m "Implement {next_feature.name}"Key principle: one feature per session. Trying to do too much in a single context window leads to context exhaustion mid-implementation. Clean commit states ensure code is production-ready before the context window ends. This mirrors how human engineering teams work across shifts: each person inherits clear documentation, not incomplete mental state.
Harness Patterns Across Tools
Every serious coding agent has published details about its harness. The specifics reveal how different teams solved the same core problems.
| Component | Claude Code | Cursor | Manus | OpenCode |
|---|---|---|---|---|
| Loop pattern | Initializer + coding agent | Model-specific harness per model | KV-cache-optimized ReAct | Server-client with LSP |
| Context strategy | CLAUDE.md + just-in-time retrieval | Repo indexing + reasoning trace preservation | Filesystem-as-context + todo.md recitation | Built-in LSP for immediate feedback |
| Tool management | MCP + lazy loading (95% context reduction) | Renamed tools per model + explicit dispatch | Logit masking via state machines | 75+ provider-agnostic tool layer |
| Error recovery | Git checkpoints + progress files | Reasoning trace alerting (30% drop if lost) | Error trace preservation in context | Approval-based execution gates |
| Multi-agent | Agent Teams via MCP | Subagent system for parallel tasks | Sub-agents for context isolation | Primary agents + subagents |
| Permission model | Configurable levels + hooks | Sandbox with filesystem/network boundaries | Sandboxed environment | Plan-first with approval gates |
Cursor: Training the Harness Into the Model
Cursor's approach is unique. Instead of training a generic model and wrapping it in a harness, they trained their Composer model on tool-use trajectories: sequences of actions showing the model how and when to use tools. Each frontier model gets a tailored harness with model-specific instructions and tool definitions, measured against Cursor Bench (internal eval suite) on success rate, tool-calling ability, and user adoption.
A critical discovery: dropping reasoning traces from GPT-5-Codex caused a 30% performance drop. The harness must preserve reasoning continuity across multi-turn interactions. Cursor implemented alerting to catch when reasoning traces are accidentally discarded, preventing models from losing track of subgoals.
Manus: Five Harnesses in Six Months
Manus chose in-context learning over fine-tuning, betting that context engineering would iterate faster (hours vs. weeks) while keeping the system model-agnostic. Their production insights:
- Optimize for KV-cache hit rate. Agent workloads have ~100:1 prefill-to-decode ratios. Cache efficiency is the single most important production metric. Cached tokens can be 10x cheaper than uncached (Claude Sonnet pricing).
- Mask tools, do not remove them. Dynamically modifying tool definitions invalidates the KV-cache and confuses the model about prior actions. Use logit masking and state machines to constrain action selection instead.
- Use the filesystem as extended context. Large observations (web pages, PDFs) overflow the context window. Offload to sandbox storage while preserving restorable references.
- Recite objectives into the end of context. A todo.md file updated throughout execution keeps goals in the model's recent attention span. Without this, agents drift after ~50 tool calls.
- Preserve error traces. Do not clean up failed attempts. Keeping error messages in context helps the model avoid repeating mistakes.
Test-First & Spec-Driven Workflows
Two complementary practices have emerged as the most effective ways to work with coding agents. Both shift work before the agent writes code.
Test-First Development (Simon Willison)
Simon Willison's red/green pattern: write failing tests, then let the agent make them pass. The agent sees the test failure output, diagnoses the issue, writes code, and runs the tests again. This tight loop produces better code with less human intervention than open-ended "implement this feature" prompts.
Test-first workflow with a coding agent
# Step 1: Human writes the failing test
def test_token_refresh_handles_expiry():
"""When refresh token is expired, redirect to /login"""
expired_token = create_expired_refresh_token()
response = client.post("/api/refresh", token=expired_token)
assert response.status_code == 302
assert response.headers["Location"] == "/login"
# Step 2: Run tests — they fail (red)
# FAILED test_token_refresh_handles_expiry - AssertionError
# Step 3: Tell the agent: "Make this test pass"
# Agent reads test, understands the contract, implements:
async def refresh_handler(request):
try:
new_token = await refresh(request.token)
return JsonResponse({"token": new_token})
except RefreshTokenExpired:
return RedirectResponse("/login", status_code=302)
# Step 4: Tests pass (green). Commit.Spec-Driven Development (Addy Osmani)
Addy Osmani's workflow starts with a spec.md containing requirements, architecture decisions, and testing strategy. He calls this "waterfall in 15 minutes": rapid structured planning that prevents the agent from going off the rails. Key principles:
- One function, one feature at a time. LLMs produce a "jumbled mess" when asked for too much simultaneously.
- Commit after each chunk. Commits are save points for rollback.
- Quality gates at every step. Linters, type checkers, and test suites run after each implementation chunk.
- Cross-check with multiple models. If one model gets stuck, try another. Each has different strengths.
Kiro from AWS formalizes spec-driven development into an IDE. The agent generates user stories with acceptance criteria, a technical design document, and a task list before writing code. It is the first tool to make specification a first-class part of the agent workflow rather than something the human writes separately.
Both patterns solve the same problem
Test-first and spec-driven development both solve the intent specification problem from the IMPACT framework. Vague prompts produce vague code. A failing test is an unambiguous specification of desired behavior. A detailed spec.md is an unambiguous specification of desired architecture. Both give the agent a clear definition of done.
Context Engineering in the Harness
Context engineering is not a separate discipline from agent engineering. It is the core competency within agent engineering. Every harness decision is a context engineering decision: what tokens enter the window, when, and in what order.
Anthropic defines it as "thinking in context: considering the holistic state available to the LLM at any given time and what potential behaviors that state might yield." The key enemy is context rot: as token count increases, the model's ability to accurately recall information from that context decreases because transformers require n² pairwise relationships for n tokens.
Five Harness-Level Context Strategies
CLAUDE.md / .cursorrules
Project-level context files loaded into every session. The minimum viable context for the entire project. The #1 move. See the full guide.
Just-in-Time Retrieval
Maintain lightweight references (file paths, URLs) and load data on demand. Never preload everything. Claude Code's MCP tool search achieves 95% context reduction.
Compaction
Summarize conversation history near context limits. Anthropic recommends combining with git commits and progress files so the agent can reconstruct state after compaction.
Sub-Agent Isolation
Each sub-agent gets its own context window. The main agent stays clean for orchestration. Specialized agents handle focused tasks with exactly the context they need.
Manus's fifth strategy is the most production-hardened: use the filesystem as extended context. When an observation is too large for the context window (a full web page, a PDF, a large file), write it to the sandbox filesystem and keep only a reference (file path, URL) in context. The model can read it back on demand. This prevents information loss from aggressive compaction while keeping the active context lean.
For a deep dive on context engineering techniques, see our complete context engineering guide.
Multi-Agent Orchestration
Multi-agent orchestration is a harness engineering problem, not a model problem. The orchestrator coordinates specialized agents in parallel, each with dedicated context, then synthesizes results. In February 2026, every major tool shipped multi-agent support in the same two-week window.
| Tool | Architecture | Parallelism |
|---|---|---|
| Claude Code | Agent Teams via MCP | Specialized roles, message passing |
| Cursor | Subagent system | Discrete parallel subtasks from main agent |
| Windsurf | Cascade via git worktrees | 5 agents on 5 bugs simultaneously |
| Grok Build | Direct parallelism | 8 agents working simultaneously |
| Codex CLI | Agents SDK + worktrees | Parallel tasks across isolated branches |
| Devin | Parallel sandboxed sessions | Each Devin in its own cloud IDE |
| Cline CLI 2.0 | Parallel terminal agents | BYOM multi-agent for open source |
The fundamental benefit is context isolation. A single agent trying to refactor three modules simultaneously pollutes its context with details from all three. Three parallel agents, each focused on one module, maintain clean context for their specific task. The orchestrator only needs high-level status, not file-level details.
For more on how different tools implement multi-agent, see our AI coding agent comparison and subagent architecture guide.
Error Recovery & Permission Models
The harness defines what happens when things go wrong. Reliable agents do not just retry. They have a hierarchy of recovery strategies:
Error recovery hierarchy
# Level 1: Retry with context
# Agent sees the error, adjusts approach, tries again
# Most errors resolve here (wrong file path, syntax error)
# Level 2: Rollback to checkpoint
# Agent reverts to last known good state via git
git reset --soft HEAD~1
# Re-attempt with different strategy
# Level 3: Decompose the task
# Break the failing task into smaller subtasks
# Delegate to sub-agents with focused context
# Level 4: Escalate to human
# Agent writes a clear summary of what it tried,
# what failed, and what it needs from the human
harness.escalate({
attempted: ["approach_a", "approach_b"],
errors: [error_trace_a, error_trace_b],
suggested_next: "Need REDIS_URL env var to proceed"
})Permission Models
The authority component from IMPACT manifests as the permission model. Every tool makes a different tradeoff between safety and speed:
- Cursor: Sandbox-aware harness with explicit filesystem paths, network access controls, and permission elevation requests when the agent needs to step outside boundaries.
- Claude Code: Configurable permission levels from full approval to dangerously skip permissions. Hooks enable custom automation on tool calls.
- Cline: Explicit approval for every file change. Safe but slower.
- Devin: Full sandboxed cloud environment. The agent can do anything within the sandbox. Nothing escapes without explicit PR submission.
swyx identifies the key tradeoff: "stutter-step agents get old fast." Requiring approval for every action frustrates developers and kills the flow state that makes agents valuable. But skipping all approvals risks unintended modifications. The best harnesses provide granular control: auto-approve reads and tests, require approval for writes to specific paths, block destructive operations entirely.
The Apply Layer
Every harness faces the same final bottleneck: applying edits to files. The LLM generates an edit intent, but merging that intent into existing code is where things break. Diffs fail when context shifts. Search-and-replace misses when code has moved. Full-file rewrites waste tokens and risk overwriting concurrent changes.
Cursor invested in training a custom model specifically for this step. Manus optimizes for KV-cache efficiency to make repeated applies fast. Claude Code uses Morph's Fast Apply model for deterministic merges.
The apply step needs exactly three pieces of context: the original file, the edit intent, and the update snippet. Too little and the merge fails. Too much and the model gets confused. Morph Fast Apply is purpose-built for this: instruction + code + update in, merged file out at over 10,500 tokens per second. OpenAI-compatible API, so it drops into any agent pipeline.
Morph Fast Apply integration
import { OpenAI } from 'openai';
const morph = new OpenAI({
apiKey: process.env.MORPH_API_KEY,
baseURL: 'https://api.morphllm.com/v1'
});
// The harness sends exactly 3 pieces of context:
// 1. The instruction (what to change)
// 2. The original code (what exists)
// 3. The update snippet (the LLM's edit)
const result = await morph.chat.completions.create({
model: 'morph-v3-fast',
messages: [{
role: 'user',
content: `<instruction>Add retry with exponential backoff</instruction>
<code>${originalFile}</code>
<update>${agentEditSnippet}</update>`
}],
stream: true
});
// Returns: complete merged file, deterministicallyBy isolating the merge in a specialized model, the coding agent's primary context window stays clean for planning and reasoning. The apply layer handles reliability. The reasoning model handles intelligence. Clean separation.
For more on how the apply layer works, see our Fast Apply deep dive and edit format comparison.
Frequently Asked Questions
What is agent engineering?
Agent engineering is building reliable systems around language models. It encompasses the harness (tool dispatch, context management, error recovery, permission models), the agent loop (read, plan, act, observe, repeat), and the workflow practices that make agents work in production. swyx coined the term and published the IMPACT framework: Intent, Memory, Planning, Authority, Control Flow, Tools.
What is the IMPACT framework?
Six essential components of AI agents, created by swyx at the 2025 AI Engineer Summit: Intent (goals verified via evals), Memory (coherence through skill libraries), Planning (editable multi-step plans), Authority (trust and permission models), Control Flow (LLM-driven execution paths), Tools (RAG, sandboxes, browser automation). It corrects overly simplified definitions that leave out planning, memory, and authority.
What is an agent harness?
The non-LLM infrastructure wrapping a model to manage long-running tasks. It handles tool dispatch, context management, permissions, error recovery, and state. Phil Schmid's analogy: model = CPU, context window = RAM, harness = operating system, agent = application. Meta paid ~$2B for Manus's harness. Manus rebuilt it five times in six months, each time improving reliability.
What is harness engineering?
A discipline defined by Martin Fowler and Birgitta Boeckeler at ThoughtWorks. Three components: context engineering (curating what the model sees), architectural constraints (guardrails via linters and structural tests), and entropy management (periodic cleanup of inconsistencies). It emerged as a distinct discipline in 2026.
What is test-first development with coding agents?
Simon Willison's pattern: write failing tests (red), let the agent make them pass (green). The agent sees test failure output, diagnoses issues, writes code, and runs tests again. This creates a tight feedback loop that produces better code than open-ended prompts. Addy Osmani's spec-driven approach complements this: detailed specification before any code is written.
How do multi-agent systems work for coding?
An orchestrator decomposes tasks and delegates to specialized agents, each with isolated context windows. Claude Code Agent Teams, Cursor subagents, Windsurf's 5 parallel Cascade agents, and Grok Build's 8 parallel agents all implement this. The key benefit is context isolation: each agent focuses on its task without pollution from other work.
The Apply Layer for Your Agent Harness
Every agent harness needs a reliable apply step. Morph Fast Apply merges LLM edits deterministically at 10,500+ tokens per second. OpenAI-compatible API. Drop it into your harness.