Agentic Engineering: The Post-Vibe-Coding Paradigm

Andrej Karpathy coined agentic engineering as the successor to vibe coding. You orchestrate agents that write code. You own the architecture, tests, and reviews. This guide covers the conductor-to-orchestrator evolution, core practices, tools, and the data behind the shift.

March 4, 2026 ยท 8 min read

Andrej Karpathy coined vibe coding in February 2025. A year later, he replaced it. Agentic engineering: you orchestrate agents that write code. You own the architecture, the tests, and the reviews. Same shift that took "hacking" to "software engineering."

84%
Devs using or planning to use AI coding (Stack Overflow 2025)
3%
Who highly trust AI-generated code
90.2%
Multi-agent gain over single-agent (Anthropic)
100K
Lines of Rust, 16 agents, $20K (Anthropic C compiler)

What Is Agentic Engineering?

Agentic engineering is building software by orchestrating AI coding agents. You define the architecture, decompose work into agent-sized tasks, and review everything. The agents handle implementation: reading files, running commands, writing tests, committing code, and iterating on failures.

Karpathy's framing: "Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering because there is an art and science and expertise to it."

The defining feature of the tools behind it: they can both generate and execute code. Claude Code, OpenAI Codex, Cursor, and Devin do not suggest code. They run it, test it, observe the results, and iterate. That loop is what makes them agents rather than autocomplete.

Agentic engineering vs. agent engineering

Agent engineering is about building the harness around an LLM: tool dispatch, context management, error recovery. Agentic engineering is about using those agents to build software. One builds agents. The other builds with agents.

From Vibe Coding to Agentic Engineering

Vibe coding was the consumer phase of AI-assisted development. Karpathy coined it in February 2025 to describe "fully giving in to the vibes" and accepting AI-generated code without reading it. 92 percent of US developers started using AI coding tools daily. Collins Dictionary made it Word of the Year.

The problems showed up fast. 45 percent of AI-generated code contains security vulnerabilities (arxiv 2505.19443). Code duplication increased 48 percent. Refactoring activity dropped 60 percent. Daniel Stenberg shut down cURL's bug bounty because of AI-generated spam reports.

The GLM-5 paper titled "From Vibe Coding to Agentic Engineering" formalized the transition as a paradigm shift: from intuitive, human-in-the-loop prompting to goal-driven agents capable of planning, executing, testing, and iterating with minimal human intervention.

AspectVibe CodingAgentic Engineering
Human roleDescribe what you wantArchitect, decompose, review
Code reviewNone or minimalEvery diff reviewed
TestingRun it and seeTest-first, agent iterates until green
Agent autonomySingle-turn generationMulti-step: plan, code, test, commit
Best forPrototypes, MVPs, learningProduction systems, team codebases
Failure modeTechnical debt, security holesCoordination overhead, review bottleneck

Simon Willison drew the clearest line: if an LLM wrote the code but you reviewed it, tested it, and can explain how it works, that is software development. Vibe coding means accepting code you do not understand. Agentic engineering formalizes that distinction into a repeatable practice.

Conductor to Orchestrator: The Role Evolution

O'Reilly and Addy Osmani frame the engineer's role evolution in two stages. Understanding which stage you are in determines which tools, workflows, and skills matter most.

AspectConductorOrchestrator
AgentsOne agent, one taskMultiple agents in parallel
SynchronicityReal-time, interactiveAsynchronous delegation
Human effortContinuous throughoutFront-loaded (specs) + back-loaded (review)
ArtifactsEphemeral; limited docsPersistent; tracked in version control
OutputLine-by-line suggestionsCompleted pull requests

The Conductor Model

You work closely with a single AI agent on a specific task, like a conductor guiding a soloist. You remain in the loop at each step: steering the agent's behavior, tweaking prompts, intervening when needed, iterating in real time. This is how most developers work with AI today. Synchronous, interactive sessions in an IDE or CLI.

The Orchestrator Model

You oversee multiple AI agents working in parallel on different parts of a project. You set high-level goals, define tasks, and let specialized agents carry out implementation independently. Results arrive as completed pull requests. Some early adopters already delegate 10 or more PRs daily to AI agents. Anthropic projects that agents could generate 80 to 90 percent of code with humans providing the 10 percent of critical architectural guidance.

Most teams are still conductors

The orchestrator model is where the field is heading, but it is not where most teams are today. Start as a conductor: one agent, one task, tight feedback loops. Build trust in the workflow before scaling to multi-agent orchestration. The practices that make conductors effective (context files, test-first, checkpoint discipline) are the same ones orchestrators need.

Core Practices of Agentic Engineering

Four practices define the discipline. Each solves a specific failure mode of unstructured AI coding.

Context Engineering

Curate what the agent sees. CLAUDE.md, AGENTS.md, and .cursorrules files encode your project's architecture, conventions, and constraints. Short, focused, loaded progressively.

Task Decomposition

Break work into agent-sized pieces. One feature, one function, one module per session. Too broad: agent loses coherence. Too narrow: coordination overhead dominates.

Verification Loops

The agent tests its own work. Write failing tests first, let the agent iterate until green. The test suite is the spec. The agent is the implementer. You are the reviewer.

Checkpoint Discipline

Commit working states before every significant change. If the agent fails, roll back instead of fixing forward. Starting fresh has a higher success rate than correcting mistakes.

Context Engineering: The Foundation

Context engineering is the art of filling the agent's context window with the right information. If the agent does not understand your project's architecture, testing conventions, or business constraints, no amount of prompt engineering fixes it.

The primary artifact is a project-level instruction file. CLAUDE.md for Claude Code, AGENTS.md for the cross-tool standard, .cursorrules for Cursor. Anthropic's guidance: keep it under 300 lines, focus on what causes mistakes if missing, and use progressive disclosure so the agent loads details only when needed. Never send an LLM to do a linter's job. Reserve context tokens for architecture, business logic, and project-specific knowledge.

Example CLAUDE.md for an agentic engineering workflow

# CLAUDE.md

## Architecture
Next.js 15 App Router. Server components by default.
Client components only for interactivity.

## Commands
bun run dev          # Dev server on port 3000
bun run typecheck    # TypeScript strict mode
bun run test         # Vitest suite

## Conventions
- All mutations through server actions in actions.ts
- Database: Drizzle ORM, schema in src/lib/db/schema.ts
- Auth: Clerk middleware on /dashboard/* routes
- Never commit .env files

## Agent Workflow
- Run typecheck after every code change
- Commit working states before risky changes
- Write tests before implementation
- If stuck for >3 attempts, ask for human input

For code retrieval during agent execution, WarpGrep provides semantic search that returns only the files relevant to the current task. Instead of loading a 500-file repository into context, the agent searches for exactly what it needs. Agentic context engineering covers the runtime side of this problem in depth.

Task Decomposition: Agent-Sized Work

The granularity of tasks determines whether agentic engineering works or wastes money. Too broad, and the agent loses context midway. Too narrow, and you spend more time coordinating agents than you save.

Addy Osmani's workflow: start with a spec.md containing requirements, architecture, and testing strategy. Break work into one function or one feature at a time. Commit after each chunk. He calls this "waterfall in 15 minutes" because it compresses planning into rapid structured cycles. The spec is the contract. The agent works against it.

Task decomposition for a feature (rate limiting)

# spec.md: Add rate limiting to API routes

## Tasks (each = one agent session)

1. Create Redis rate limiter module
   - src/lib/rate-limiter.ts
   - Sliding window algorithm
   - Tests: src/lib/__tests__/rate-limiter.test.ts

2. Add rate limit middleware
   - src/middleware/rate-limit.ts
   - Read limits from DB per API key tier
   - Tests: middleware integration tests

3. Wire into API routes
   - Update src/app/api/chat/route.ts
   - Update src/app/api/usage/route.ts
   - Tests: e2e rate limit behavior

4. Add rate limit headers to responses
   - X-RateLimit-Remaining, X-RateLimit-Reset
   - Tests: header presence assertions

# Each task: self-contained, testable, one commit

Kiro from AWS formalizes this in an IDE: the agent generates user stories with acceptance criteria, a technical design document, and a task list before writing any code. This is task decomposition as a first-class workflow step, not an afterthought.

The parallelism question

Tasks 1 and 4 above are independent and can run in parallel on separate agents. Tasks 2 and 3 depend on Task 1. Good decomposition identifies these dependencies upfront. Over-parallelization causes merge conflicts and wasted work. Anthropic learned this during the C compiler project: 16 agents hitting the same bug overwrite each other's fixes.

Verification Loops: Tests as Specifications

Testing is the single biggest differentiator between agentic engineering and vibe coding. With a solid test suite, an AI agent can iterate in a loop until tests pass. This turns an unreliable generator into a reliable system.

Simon Willison's pattern is the simplest form: write failing tests (red), then let the agent make them pass (green). The developer writes the specification as tests. The agent writes the implementation. The tests are the contract.

Test-first agentic workflow

// Step 1: Human writes the test (the specification)
describe("rate limiter", () => {
  it("allows requests under the limit", async () => {
    const limiter = createRateLimiter({ max: 10, window: "1m" });
    const result = await limiter.check("user-123");
    expect(result.allowed).toBe(true);
    expect(result.remaining).toBe(9);
  });

  it("blocks requests over the limit", async () => {
    const limiter = createRateLimiter({ max: 2, window: "1m" });
    await limiter.check("user-123");
    await limiter.check("user-123");
    const result = await limiter.check("user-123");
    expect(result.allowed).toBe(false);
  });
});

// Step 2: Agent implements until tests pass
// Step 3: Human reviews the implementation
// Step 4: Commit if review passes

CodeScene found that teams targeting a Code Health score of 9.5 or higher see 2 to 3x productivity gains from agentic coding. AI needs healthy code to reduce defect risk. It will happily modify spaghetti code and make it worse. High code quality is not a casualty of agent-assisted development. It is a prerequisite for it.

Beyond unit tests, Claude Code best practices include running typecheck, lint, and build after every change. These are automated verification loops that catch drift before it compounds.

Tools and Platforms for Agentic Engineering

Each major tool implements the core patterns differently. They all share the same loop: read state, plan action, execute with tools, observe results, iterate.

ToolArchitectureMulti-AgentKey Feature
Claude CodeInitializer + coding agentAgent Teams (peer communication)Git-native checkpoints, auto-compaction
OpenAI CodexSandboxed container, no internetParallel task runnersCloud + CLI + mobile delegation
CursorCustom Composer model + harnessBackground AgentsReal-time dashboard, concurrent tasks
DevinMulti-model swarmNative swarm (Planner/Coder/Critic)Full VM with browser access
ConductorMulti-agent Git worktreesDashboard for agent statusIsolated workspaces per agent
GitHub Copilot AgentIssue-to-PR automationEphemeral environmentsAssigns tasks via GitHub Issues
Google JulesCloud VM per taskAsync execution with approvalClones repos, presents plan first
Kiro (AWS)Spec-driven IDE agentSingle-agent with structureAuto-generates specs before coding

Claude Code Agent Teams deserve special attention. Unlike subagents, which run within a single session and can only report back, Agent Teams members communicate directly with each other. One session acts as team lead. Teammates work independently, each in its own context window, and share discoveries mid-task.

Conductor by Melty Labs manages multiple Claude agents in isolated Git worktrees, displaying agent status on a dashboard. Claude Squad is an open-source terminal multiplexer for running Claude instances in parallel panes.

All these tools benefit from fast context management. Morph Compact compresses agent context to the minimum viable token set at 10,500+ tokens per second. WarpGrep provides surgical code retrieval so agents load only what they need.

Multi-Agent Coordination

Anthropic's 2026 Agentic Coding Trends Report identifies multi-agent coordination as a defining trend: organizations are moving from single agents to specialized agent groups working in parallel under an orchestrator.

Anthropic reported that their multi-agent system (Claude Opus as lead, Claude Sonnet subagents) outperformed single-agent Claude Opus by 90.2 percent on research evaluations. The gain comes from context isolation: each subagent maintains only the context relevant to its task, avoiding the context rot that degrades long single-agent sessions.

Parallel Research

Multiple agents investigate different aspects of a problem simultaneously, then share and challenge findings. Faster than serial exploration.

Module Ownership

Each agent owns a separate module: frontend, backend, tests. Each in its own context window. No stepping on each other's changes.

Cross-Layer Changes

Changes spanning frontend, backend, and database, each handled by a different agent. Orchestrator manages the integration points.

The sequential conductor approach: implement backend with AI, then frontend, then tests. Each step involves active human participation. The parallel orchestrator approach: delegate backend to Agent A, frontend to Agent B, tests to Agent C. Human reviews the resulting PRs and integrates. The orchestrator approach is faster but requires clearer task boundaries and better test coverage to catch integration issues.

Case Study: The 100K-Line C Compiler

In early 2026, Anthropic researcher Nicholas Carlini tasked 16 Claude agents with writing a dependency-free C compiler in Rust, capable of compiling the Linux kernel. The result demonstrates both the power and the limits of agentic engineering.

100K
Lines of Rust produced
16
Parallel Claude agents
~2K
Claude Code sessions
$20K
API compute cost

The compiler builds Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, Postgres, and Redis. It passes 99 percent of the GCC torture test suite. Clean-room implementation, no internet access during development, depending only on the Rust standard library.

The project exposed a critical coordination failure. When agents started compiling the kernel (one monolithic task, not parallelizable), all 16 hit the same bug, produced the same fix, and overwrote each other's changes. More agents made it worse, not better. Human intervention restructured the task into parallelizable pieces. Good task decomposition is the difference between 16 agents multiplying your output and 16 agents multiplying your problems.

What the $20K does not include

The compute cost excludes the human engineering time: designing the workflow, decomposing compiler components into parallelizable tasks, managing inter-agent communication, reviewing output, and resolving integration conflicts. The human orchestration was essential. This is agentic engineering in its truest form: the human is the architect, the agents are the builders.

When Agentic Engineering Fails

Agentic engineering is not always the right approach. Understanding when it fails prevents wasted time and money.

Coordination Overhead Exceeds Benefits

Every additional agent adds messages, latency, and drift risk. If the task is simple enough for one focused developer, coordinating three agents costs more than it saves. Research found that even strong models with 98 percent per-agent success rates degrade to 90 percent or lower at the system level, with each unchecked hop multiplying failure probability.

Review Bottleneck

Individual output surged 98 percent in high-adoption teams, but PR review time increased by up to 91 percent. The time saved writing code was consumed by reviewing more code. If your team cannot review fast enough, agents create a backlog instead of delivering value.

Cost Explosion at Scale

A three-agent workflow costing $5 to $50 in demos can generate $18,000 to $90,000 monthly in production. Token multiplication, retry loops, and expanded context windows compound costs. Pilot accuracy of 95 to 98 percent drops to 80 to 87 percent under real-world pressure.

Non-Parallelizable Work

The C compiler kernel compilation failure illustrates this. When multiple agents hit the same sequential bottleneck, parallelism hurts. Debugging a single race condition, understanding deeply interconnected legacy code, making a judgment call about product direction: these tasks do not benefit from agent parallelism. Sometimes a single developer typing code is faster than coordinating three agents.

The Skill Shift

The transition is misunderstood as "from writing code to writing prompts." That describes vibe coding. The actual shift: from writing code to designing agent systems.

The valuable skills in 2026: knowing what to build (product thinking), how to decompose it into parallelizable tasks (systems design), how to set up verification (testing strategy), and how to evaluate agent output (code review). These are the same skills that made senior engineers valuable before AI, applied to a new medium.

IBM frames the paradox: AI-assisted development rewards rigorous engineering practices more than traditional coding. Better specs yield better agent output. Stronger tests enable confident delegation. Cleaner architecture reduces hallucinations. The developers who benefit most from agentic engineering are the ones who were already good at engineering.

DimensionTraditional EngineeringAgentic Engineering
Primary outputCodeReviewed, tested agent output
Time allocation80% implementation, 20% design20% design, 20% context setup, 60% review
BottleneckTyping speed, domain knowledgeTask decomposition, review throughput
10x factorWrites code fasterDecomposes work better, reviews faster

Anthropic's report identifies eight trends reshaping software engineering in 2026, including the shift toward agent supervision, agents going end-to-end on multi-day tasks, and non-engineers using agents to build their own tools. Sales, legal, marketing, and operations teams building small automations themselves. The bottleneck shifts from technical ability to clarity of thought.

Getting Started With Agentic Engineering

You do not need multi-agent teams or expensive infrastructure. A single agent plus these practices will get you most of the benefit.

1. Write a Context File

Create a CLAUDE.md or AGENTS.md for your project. Include architecture, commands, conventions, and common mistakes. Keep it under 300 lines.

2. Decompose Your Next Feature

Break the next thing you need to build into 3 to 5 independent tasks. Each task should be completable in one agent session. Define acceptance criteria.

3. Write Tests First

For each task, write failing tests that define the expected behavior. Hand the tests to the agent. Let it iterate until they pass. Review the implementation.

4. Commit After Each Task

Every completed task gets a commit. If the next task fails, roll back to a known-good state. Never try to fix forward from a broken agent session.

5. Review Like a Teammate

Every piece of agent output gets the same review rigor as a human teammate's PR. Check for security, correctness, maintainability, and spec alignment.

6. Scale Gradually

Once comfortable with single-agent workflows, try running two agents on independent tasks. Then explore Agent Teams for coordinated multi-agent work.

Recommended tools for starting out

Claude Code is the most straightforward entry point: terminal-based, uses git natively, runs verification commands automatically. Pair with WarpGrep for codebase search and Morph Compact for context compression on long sessions.

Frequently Asked Questions

What is agentic engineering?

Agentic engineering is the discipline of building software by orchestrating AI coding agents. You define the architecture, decompose tasks, set up verification loops, and review all output. The agents handle implementation, testing, and iteration. Andrej Karpathy coined the term as the professional successor to vibe coding.

How is it different from vibe coding?

Vibe coding means accepting AI-generated code without reading it. Agentic engineering means orchestrating agents with specs, tests, reviews, and checkpoint discipline. Vibe coding works for prototypes. Agentic engineering works for production.

What is the conductor-to-orchestrator model?

The conductor model: one developer, one agent, real-time collaboration. The orchestrator model: one developer, multiple agents, asynchronous delegation. Most teams start as conductors and evolve into orchestrators as their workflows and trust mature. O'Reilly covers this evolution in depth.

What tools support agentic engineering?

Claude Code (with Agent Teams), OpenAI Codex, Cursor (Background Agents), Devin, Conductor (Melty Labs), Claude Squad, GitHub Copilot Coding Agent, Google Jules, and Kiro (AWS). WarpGrep handles semantic code search. Morph Compact compresses context for long-running sessions.

What is context engineering?

Context engineering is curating the information an agent sees. It includes CLAUDE.md files with project architecture, .claudeignore rules to exclude noise, and progressive disclosure so agents load details only when relevant.

What did Karpathy say about agentic engineering?

Karpathy said: "Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering because there is an art and science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind."

When should I not use agentic engineering?

When coordination costs exceed the benefits. Simple tasks where a focused developer is faster. Non-parallelizable debugging. Judgment calls about product direction. When the review bottleneck creates a backlog instead of delivering value.

How do I get started?

Write a CLAUDE.md for your project. Break your next feature into 3 to 5 tasks. Write tests first. Let the agent implement. Commit after each task. Review everything. Start with a single agent. Scale to subagents and Agent Teams once the single-agent workflow is solid.

Infrastructure for Agentic Engineering

WarpGrep gives your agents surgical codebase search. Morph Compact keeps context clean on long sessions. Both run fast enough for inline use during autonomous agent execution.