Andrej Karpathy coined vibe coding in February 2025. A year later, he replaced it. Agentic engineering: you orchestrate agents that write code. You own the architecture, the tests, and the reviews. Same shift that took "hacking" to "software engineering."
What Is Agentic Engineering?
Agentic engineering is building software by orchestrating AI coding agents. You define the architecture, decompose work into agent-sized tasks, and review everything. The agents handle implementation: reading files, running commands, writing tests, committing code, and iterating on failures.
Karpathy's framing: "Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering because there is an art and science and expertise to it."
The defining feature of the tools behind it: they can both generate and execute code. Claude Code, OpenAI Codex, Cursor, and Devin do not suggest code. They run it, test it, observe the results, and iterate. That loop is what makes them agents rather than autocomplete.
Agentic engineering vs. agent engineering
Agent engineering is about building the harness around an LLM: tool dispatch, context management, error recovery. Agentic engineering is about using those agents to build software. One builds agents. The other builds with agents.
From Vibe Coding to Agentic Engineering
Vibe coding was the consumer phase of AI-assisted development. Karpathy coined it in February 2025 to describe "fully giving in to the vibes" and accepting AI-generated code without reading it. 92 percent of US developers started using AI coding tools daily. Collins Dictionary made it Word of the Year.
The problems showed up fast. 45 percent of AI-generated code contains security vulnerabilities (arxiv 2505.19443). Code duplication increased 48 percent. Refactoring activity dropped 60 percent. Daniel Stenberg shut down cURL's bug bounty because of AI-generated spam reports.
The GLM-5 paper titled "From Vibe Coding to Agentic Engineering" formalized the transition as a paradigm shift: from intuitive, human-in-the-loop prompting to goal-driven agents capable of planning, executing, testing, and iterating with minimal human intervention.
| Aspect | Vibe Coding | Agentic Engineering |
|---|---|---|
| Human role | Describe what you want | Architect, decompose, review |
| Code review | None or minimal | Every diff reviewed |
| Testing | Run it and see | Test-first, agent iterates until green |
| Agent autonomy | Single-turn generation | Multi-step: plan, code, test, commit |
| Best for | Prototypes, MVPs, learning | Production systems, team codebases |
| Failure mode | Technical debt, security holes | Coordination overhead, review bottleneck |
Simon Willison drew the clearest line: if an LLM wrote the code but you reviewed it, tested it, and can explain how it works, that is software development. Vibe coding means accepting code you do not understand. Agentic engineering formalizes that distinction into a repeatable practice.
Conductor to Orchestrator: The Role Evolution
O'Reilly and Addy Osmani frame the engineer's role evolution in two stages. Understanding which stage you are in determines which tools, workflows, and skills matter most.
| Aspect | Conductor | Orchestrator |
|---|---|---|
| Agents | One agent, one task | Multiple agents in parallel |
| Synchronicity | Real-time, interactive | Asynchronous delegation |
| Human effort | Continuous throughout | Front-loaded (specs) + back-loaded (review) |
| Artifacts | Ephemeral; limited docs | Persistent; tracked in version control |
| Output | Line-by-line suggestions | Completed pull requests |
The Conductor Model
You work closely with a single AI agent on a specific task, like a conductor guiding a soloist. You remain in the loop at each step: steering the agent's behavior, tweaking prompts, intervening when needed, iterating in real time. This is how most developers work with AI today. Synchronous, interactive sessions in an IDE or CLI.
The Orchestrator Model
You oversee multiple AI agents working in parallel on different parts of a project. You set high-level goals, define tasks, and let specialized agents carry out implementation independently. Results arrive as completed pull requests. Some early adopters already delegate 10 or more PRs daily to AI agents. Anthropic projects that agents could generate 80 to 90 percent of code with humans providing the 10 percent of critical architectural guidance.
Most teams are still conductors
The orchestrator model is where the field is heading, but it is not where most teams are today. Start as a conductor: one agent, one task, tight feedback loops. Build trust in the workflow before scaling to multi-agent orchestration. The practices that make conductors effective (context files, test-first, checkpoint discipline) are the same ones orchestrators need.
Core Practices of Agentic Engineering
Four practices define the discipline. Each solves a specific failure mode of unstructured AI coding.
Context Engineering
Curate what the agent sees. CLAUDE.md, AGENTS.md, and .cursorrules files encode your project's architecture, conventions, and constraints. Short, focused, loaded progressively.
Task Decomposition
Break work into agent-sized pieces. One feature, one function, one module per session. Too broad: agent loses coherence. Too narrow: coordination overhead dominates.
Verification Loops
The agent tests its own work. Write failing tests first, let the agent iterate until green. The test suite is the spec. The agent is the implementer. You are the reviewer.
Checkpoint Discipline
Commit working states before every significant change. If the agent fails, roll back instead of fixing forward. Starting fresh has a higher success rate than correcting mistakes.
Context Engineering: The Foundation
Context engineering is the art of filling the agent's context window with the right information. If the agent does not understand your project's architecture, testing conventions, or business constraints, no amount of prompt engineering fixes it.
The primary artifact is a project-level instruction file. CLAUDE.md for Claude Code, AGENTS.md for the cross-tool standard, .cursorrules for Cursor. Anthropic's guidance: keep it under 300 lines, focus on what causes mistakes if missing, and use progressive disclosure so the agent loads details only when needed. Never send an LLM to do a linter's job. Reserve context tokens for architecture, business logic, and project-specific knowledge.
Example CLAUDE.md for an agentic engineering workflow
# CLAUDE.md
## Architecture
Next.js 15 App Router. Server components by default.
Client components only for interactivity.
## Commands
bun run dev # Dev server on port 3000
bun run typecheck # TypeScript strict mode
bun run test # Vitest suite
## Conventions
- All mutations through server actions in actions.ts
- Database: Drizzle ORM, schema in src/lib/db/schema.ts
- Auth: Clerk middleware on /dashboard/* routes
- Never commit .env files
## Agent Workflow
- Run typecheck after every code change
- Commit working states before risky changes
- Write tests before implementation
- If stuck for >3 attempts, ask for human inputFor code retrieval during agent execution, WarpGrep provides semantic search that returns only the files relevant to the current task. Instead of loading a 500-file repository into context, the agent searches for exactly what it needs. Agentic context engineering covers the runtime side of this problem in depth.
Task Decomposition: Agent-Sized Work
The granularity of tasks determines whether agentic engineering works or wastes money. Too broad, and the agent loses context midway. Too narrow, and you spend more time coordinating agents than you save.
Addy Osmani's workflow: start with a spec.md containing requirements, architecture, and testing strategy. Break work into one function or one feature at a time. Commit after each chunk. He calls this "waterfall in 15 minutes" because it compresses planning into rapid structured cycles. The spec is the contract. The agent works against it.
Task decomposition for a feature (rate limiting)
# spec.md: Add rate limiting to API routes
## Tasks (each = one agent session)
1. Create Redis rate limiter module
- src/lib/rate-limiter.ts
- Sliding window algorithm
- Tests: src/lib/__tests__/rate-limiter.test.ts
2. Add rate limit middleware
- src/middleware/rate-limit.ts
- Read limits from DB per API key tier
- Tests: middleware integration tests
3. Wire into API routes
- Update src/app/api/chat/route.ts
- Update src/app/api/usage/route.ts
- Tests: e2e rate limit behavior
4. Add rate limit headers to responses
- X-RateLimit-Remaining, X-RateLimit-Reset
- Tests: header presence assertions
# Each task: self-contained, testable, one commitKiro from AWS formalizes this in an IDE: the agent generates user stories with acceptance criteria, a technical design document, and a task list before writing any code. This is task decomposition as a first-class workflow step, not an afterthought.
The parallelism question
Tasks 1 and 4 above are independent and can run in parallel on separate agents. Tasks 2 and 3 depend on Task 1. Good decomposition identifies these dependencies upfront. Over-parallelization causes merge conflicts and wasted work. Anthropic learned this during the C compiler project: 16 agents hitting the same bug overwrite each other's fixes.
Verification Loops: Tests as Specifications
Testing is the single biggest differentiator between agentic engineering and vibe coding. With a solid test suite, an AI agent can iterate in a loop until tests pass. This turns an unreliable generator into a reliable system.
Simon Willison's pattern is the simplest form: write failing tests (red), then let the agent make them pass (green). The developer writes the specification as tests. The agent writes the implementation. The tests are the contract.
Test-first agentic workflow
// Step 1: Human writes the test (the specification)
describe("rate limiter", () => {
it("allows requests under the limit", async () => {
const limiter = createRateLimiter({ max: 10, window: "1m" });
const result = await limiter.check("user-123");
expect(result.allowed).toBe(true);
expect(result.remaining).toBe(9);
});
it("blocks requests over the limit", async () => {
const limiter = createRateLimiter({ max: 2, window: "1m" });
await limiter.check("user-123");
await limiter.check("user-123");
const result = await limiter.check("user-123");
expect(result.allowed).toBe(false);
});
});
// Step 2: Agent implements until tests pass
// Step 3: Human reviews the implementation
// Step 4: Commit if review passesCodeScene found that teams targeting a Code Health score of 9.5 or higher see 2 to 3x productivity gains from agentic coding. AI needs healthy code to reduce defect risk. It will happily modify spaghetti code and make it worse. High code quality is not a casualty of agent-assisted development. It is a prerequisite for it.
Beyond unit tests, Claude Code best practices include running typecheck, lint, and build after every change. These are automated verification loops that catch drift before it compounds.
Tools and Platforms for Agentic Engineering
Each major tool implements the core patterns differently. They all share the same loop: read state, plan action, execute with tools, observe results, iterate.
| Tool | Architecture | Multi-Agent | Key Feature |
|---|---|---|---|
| Claude Code | Initializer + coding agent | Agent Teams (peer communication) | Git-native checkpoints, auto-compaction |
| OpenAI Codex | Sandboxed container, no internet | Parallel task runners | Cloud + CLI + mobile delegation |
| Cursor | Custom Composer model + harness | Background Agents | Real-time dashboard, concurrent tasks |
| Devin | Multi-model swarm | Native swarm (Planner/Coder/Critic) | Full VM with browser access |
| Conductor | Multi-agent Git worktrees | Dashboard for agent status | Isolated workspaces per agent |
| GitHub Copilot Agent | Issue-to-PR automation | Ephemeral environments | Assigns tasks via GitHub Issues |
| Google Jules | Cloud VM per task | Async execution with approval | Clones repos, presents plan first |
| Kiro (AWS) | Spec-driven IDE agent | Single-agent with structure | Auto-generates specs before coding |
Claude Code Agent Teams deserve special attention. Unlike subagents, which run within a single session and can only report back, Agent Teams members communicate directly with each other. One session acts as team lead. Teammates work independently, each in its own context window, and share discoveries mid-task.
Conductor by Melty Labs manages multiple Claude agents in isolated Git worktrees, displaying agent status on a dashboard. Claude Squad is an open-source terminal multiplexer for running Claude instances in parallel panes.
All these tools benefit from fast context management. Morph Compact compresses agent context to the minimum viable token set at 10,500+ tokens per second. WarpGrep provides surgical code retrieval so agents load only what they need.
Multi-Agent Coordination
Anthropic's 2026 Agentic Coding Trends Report identifies multi-agent coordination as a defining trend: organizations are moving from single agents to specialized agent groups working in parallel under an orchestrator.
Anthropic reported that their multi-agent system (Claude Opus as lead, Claude Sonnet subagents) outperformed single-agent Claude Opus by 90.2 percent on research evaluations. The gain comes from context isolation: each subagent maintains only the context relevant to its task, avoiding the context rot that degrades long single-agent sessions.
Parallel Research
Multiple agents investigate different aspects of a problem simultaneously, then share and challenge findings. Faster than serial exploration.
Module Ownership
Each agent owns a separate module: frontend, backend, tests. Each in its own context window. No stepping on each other's changes.
Cross-Layer Changes
Changes spanning frontend, backend, and database, each handled by a different agent. Orchestrator manages the integration points.
The sequential conductor approach: implement backend with AI, then frontend, then tests. Each step involves active human participation. The parallel orchestrator approach: delegate backend to Agent A, frontend to Agent B, tests to Agent C. Human reviews the resulting PRs and integrates. The orchestrator approach is faster but requires clearer task boundaries and better test coverage to catch integration issues.
Case Study: The 100K-Line C Compiler
In early 2026, Anthropic researcher Nicholas Carlini tasked 16 Claude agents with writing a dependency-free C compiler in Rust, capable of compiling the Linux kernel. The result demonstrates both the power and the limits of agentic engineering.
The compiler builds Linux 6.9 on x86, ARM, and RISC-V. It compiles QEMU, FFmpeg, SQLite, Postgres, and Redis. It passes 99 percent of the GCC torture test suite. Clean-room implementation, no internet access during development, depending only on the Rust standard library.
The project exposed a critical coordination failure. When agents started compiling the kernel (one monolithic task, not parallelizable), all 16 hit the same bug, produced the same fix, and overwrote each other's changes. More agents made it worse, not better. Human intervention restructured the task into parallelizable pieces. Good task decomposition is the difference between 16 agents multiplying your output and 16 agents multiplying your problems.
What the $20K does not include
The compute cost excludes the human engineering time: designing the workflow, decomposing compiler components into parallelizable tasks, managing inter-agent communication, reviewing output, and resolving integration conflicts. The human orchestration was essential. This is agentic engineering in its truest form: the human is the architect, the agents are the builders.
When Agentic Engineering Fails
Agentic engineering is not always the right approach. Understanding when it fails prevents wasted time and money.
Coordination Overhead Exceeds Benefits
Every additional agent adds messages, latency, and drift risk. If the task is simple enough for one focused developer, coordinating three agents costs more than it saves. Research found that even strong models with 98 percent per-agent success rates degrade to 90 percent or lower at the system level, with each unchecked hop multiplying failure probability.
Review Bottleneck
Individual output surged 98 percent in high-adoption teams, but PR review time increased by up to 91 percent. The time saved writing code was consumed by reviewing more code. If your team cannot review fast enough, agents create a backlog instead of delivering value.
Cost Explosion at Scale
A three-agent workflow costing $5 to $50 in demos can generate $18,000 to $90,000 monthly in production. Token multiplication, retry loops, and expanded context windows compound costs. Pilot accuracy of 95 to 98 percent drops to 80 to 87 percent under real-world pressure.
Non-Parallelizable Work
The C compiler kernel compilation failure illustrates this. When multiple agents hit the same sequential bottleneck, parallelism hurts. Debugging a single race condition, understanding deeply interconnected legacy code, making a judgment call about product direction: these tasks do not benefit from agent parallelism. Sometimes a single developer typing code is faster than coordinating three agents.
The Skill Shift
The transition is misunderstood as "from writing code to writing prompts." That describes vibe coding. The actual shift: from writing code to designing agent systems.
The valuable skills in 2026: knowing what to build (product thinking), how to decompose it into parallelizable tasks (systems design), how to set up verification (testing strategy), and how to evaluate agent output (code review). These are the same skills that made senior engineers valuable before AI, applied to a new medium.
IBM frames the paradox: AI-assisted development rewards rigorous engineering practices more than traditional coding. Better specs yield better agent output. Stronger tests enable confident delegation. Cleaner architecture reduces hallucinations. The developers who benefit most from agentic engineering are the ones who were already good at engineering.
| Dimension | Traditional Engineering | Agentic Engineering |
|---|---|---|
| Primary output | Code | Reviewed, tested agent output |
| Time allocation | 80% implementation, 20% design | 20% design, 20% context setup, 60% review |
| Bottleneck | Typing speed, domain knowledge | Task decomposition, review throughput |
| 10x factor | Writes code faster | Decomposes work better, reviews faster |
Anthropic's report identifies eight trends reshaping software engineering in 2026, including the shift toward agent supervision, agents going end-to-end on multi-day tasks, and non-engineers using agents to build their own tools. Sales, legal, marketing, and operations teams building small automations themselves. The bottleneck shifts from technical ability to clarity of thought.
Getting Started With Agentic Engineering
You do not need multi-agent teams or expensive infrastructure. A single agent plus these practices will get you most of the benefit.
1. Write a Context File
Create a CLAUDE.md or AGENTS.md for your project. Include architecture, commands, conventions, and common mistakes. Keep it under 300 lines.
2. Decompose Your Next Feature
Break the next thing you need to build into 3 to 5 independent tasks. Each task should be completable in one agent session. Define acceptance criteria.
3. Write Tests First
For each task, write failing tests that define the expected behavior. Hand the tests to the agent. Let it iterate until they pass. Review the implementation.
4. Commit After Each Task
Every completed task gets a commit. If the next task fails, roll back to a known-good state. Never try to fix forward from a broken agent session.
5. Review Like a Teammate
Every piece of agent output gets the same review rigor as a human teammate's PR. Check for security, correctness, maintainability, and spec alignment.
6. Scale Gradually
Once comfortable with single-agent workflows, try running two agents on independent tasks. Then explore Agent Teams for coordinated multi-agent work.
Recommended tools for starting out
Claude Code is the most straightforward entry point: terminal-based, uses git natively, runs verification commands automatically. Pair with WarpGrep for codebase search and Morph Compact for context compression on long sessions.
Frequently Asked Questions
What is agentic engineering?
Agentic engineering is the discipline of building software by orchestrating AI coding agents. You define the architecture, decompose tasks, set up verification loops, and review all output. The agents handle implementation, testing, and iteration. Andrej Karpathy coined the term as the professional successor to vibe coding.
How is it different from vibe coding?
Vibe coding means accepting AI-generated code without reading it. Agentic engineering means orchestrating agents with specs, tests, reviews, and checkpoint discipline. Vibe coding works for prototypes. Agentic engineering works for production.
What is the conductor-to-orchestrator model?
The conductor model: one developer, one agent, real-time collaboration. The orchestrator model: one developer, multiple agents, asynchronous delegation. Most teams start as conductors and evolve into orchestrators as their workflows and trust mature. O'Reilly covers this evolution in depth.
What tools support agentic engineering?
Claude Code (with Agent Teams), OpenAI Codex, Cursor (Background Agents), Devin, Conductor (Melty Labs), Claude Squad, GitHub Copilot Coding Agent, Google Jules, and Kiro (AWS). WarpGrep handles semantic code search. Morph Compact compresses context for long-running sessions.
What is context engineering?
Context engineering is curating the information an agent sees. It includes CLAUDE.md files with project architecture, .claudeignore rules to exclude noise, and progressive disclosure so agents load details only when relevant.
What did Karpathy say about agentic engineering?
Karpathy said: "Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. Engineering because there is an art and science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind."
When should I not use agentic engineering?
When coordination costs exceed the benefits. Simple tasks where a focused developer is faster. Non-parallelizable debugging. Judgment calls about product direction. When the review bottleneck creates a backlog instead of delivering value.
How do I get started?
Write a CLAUDE.md for your project. Break your next feature into 3 to 5 tasks. Write tests first. Let the agent implement. Commit after each task. Review everything. Start with a single agent. Scale to subagents and Agent Teams once the single-agent workflow is solid.
Infrastructure for Agentic Engineering
WarpGrep gives your agents surgical codebase search. Morph Compact keeps context clean on long sessions. Both run fast enough for inline use during autonomous agent execution.