Spec-Driven Development: How Kiro and AI Agents Build From Specs

Vibe coding breaks past 500 lines. Spec-driven development fixes this: write the spec, not the code. Kiro, GitHub Spec Kit, and CLAUDE.md files turn agents from autocomplete into engineers.

March 4, 2026 ยท 1 min read

Summary

Spec-Driven Development in 30 Seconds

  • The problem: Vibe coding (freeform prompts to AI agents) produces unreliable output past ~500 lines because agents guess at unstated requirements
  • The fix: Write a formal spec first. Define behavior, constraints, and acceptance criteria. Then let the agent implement against that contract.
  • The tools: Kiro (AWS) enforces a 3-phase spec workflow. GitHub Spec Kit (71K stars) works with 20+ agents. CLAUDE.md files act as lightweight specs.
  • The tradeoff: More upfront work, fewer downstream surprises. Not worth it for quick scripts. Essential for anything production-grade.
71K
GitHub Spec Kit stars in 6 months
3
Phases in Kiro's spec workflow
20+
AI agents supported by Spec Kit
25%
Average feature time reduction with SDD (McKinsey)

Spec-driven development emerged because vibe coding hit a wall. Typing "build me an auth system" into Claude or Copilot works for demos. It does not work for systems that need to handle edge cases, satisfy compliance requirements, or survive a code review. The fix is not better models. It is better inputs.

What Is Spec-Driven Development

Spec-driven development treats specifications as first-class artifacts. Instead of writing code and documenting it later, you write a detailed spec first, then use AI agents to generate an implementation that satisfies the spec. The spec becomes the source of truth. Code is the output.

The arxiv paper "Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants" (February 2026) organizes SDD into three levels of rigor:

Spec-First

Write the spec before coding. Use it to guide AI output. The spec may be discarded after implementation. This is where most teams start. CLAUDE.md files and .cursorrules fall into this category.

Spec-Anchored

The spec lives alongside code and evolves with it. Changes to code trigger spec updates, and spec changes regenerate tasks. Kiro and GitHub Spec Kit target this level. The spec never goes stale.

Spec-as-Source

The spec IS the source code. Humans edit specs, never generated code. Tessl (private beta) targets this level with 'DO NOT EDIT' comments on generated files. The most radical approach, closest to formal methods.

Most teams operate at level 1. The tools are pushing everyone toward level 2. Level 3 remains experimental.

The Key Insight

"Writing code is easy. Deciding what code should exist is the real work." Spec-driven development forces you to make those decisions explicitly, in writing, before the agent touches a file. This is why it works: not because it constrains the agent, but because it constrains you.

Why Vibe Coding Breaks Past 500 Lines

Vibe coding works when the entire system fits in a single prompt. A landing page. A CLI script. A data pipeline with one input and one output. Past roughly 500 lines, three failure modes emerge consistently:

Implicit Requirements

You ask for 'user registration.' The agent picks bcrypt or argon2 at random. It decides whether email verification is required. It chooses a session strategy. None of these decisions were in your prompt, so the agent guesses. Each guess is a potential bug.

Context Window Pollution

As the codebase grows, the agent reads more files to understand the system. By file 40, it has forgotten the patterns from file 1. Cognition measured this: coding agents spend 60% of their time on search and context gathering, not writing code.

Drift Without Anchor

Without a spec to check against, each agent session drifts slightly from the intended architecture. After 10 sessions, the codebase looks like it was written by 10 developers who never talked to each other. Because it was.

The Kiro team at AWS built their entire product around this observation. As one developer wrote after switching to spec-driven workflows: "Most of the time, I was not fixing bugs. I was fixing unclear decisions I never made explicitly."

Specs solve all three problems. Implicit requirements become explicit acceptance criteria. Context pollution shrinks because the agent reads the spec instead of the entire codebase. Drift stops because every session checks against the same contract.

The Three-Phase Workflow

Every major SDD tool converges on the same three phases, with minor naming differences. Kiro calls them Requirements, Design, Tasks. GitHub Spec Kit calls them Specify, Plan, Tasks. The structure is the same.

Phase 1: Requirements

Define what the system should do in behavioral terms. Not "use PostgreSQL" but "as an admin, I can query user activity for the last 90 days and receive results in under 2 seconds." Kiro uses EARS notation (Easy Approach to Requirements Syntax) for acceptance criteria. GitHub Spec Kit uses a freeform spec.md.

Kiro: requirements.md example

# Feature: Organization Registration

## User Story
As an entity owner, I want to register my organization
so that employees can access the platform under a shared account.

## Acceptance Criteria
- GIVEN a valid business email, WHEN the owner submits registration,
  THEN a new organization is created with the owner as admin
- GIVEN an email domain already registered, WHEN the owner submits,
  THEN the system rejects with "Domain already in use"
- GIVEN registration succeeds, THEN a verification email is sent
  within 30 seconds
- GIVEN no verification within 72 hours, THEN the organization
  is soft-deleted

## Out of Scope
- SSO configuration (separate spec)
- Billing setup (handled by onboarding flow)
- Custom domain mapping

Phase 2: Design

Translate requirements into technical architecture. Data schemas, API endpoints, sequence diagrams, error handling strategy. The agent generates this from Phase 1, but you review and edit it. This is where bad architectural decisions get caught before they become 2,000 lines of code.

Kiro: design.md excerpt

## Data Model

Organization {
  id: UUID (primary key)
  name: string (3-100 chars)
  domain: string (unique, validated)
  owner_id: UUID (references users.id)
  status: enum [pending_verification, active, suspended, deleted]
  created_at: timestamp
  verified_at: timestamp | null
}

## API Endpoints

POST /api/organizations
  - Body: { name, domain, owner_email }
  - Returns: 201 with organization object
  - Errors: 409 if domain exists, 422 if validation fails

## Sequence: Registration Flow

Owner -> API: POST /api/organizations
API -> DB: Check domain uniqueness
API -> DB: Insert organization (status: pending_verification)
API -> Email Service: Send verification link
API -> Owner: 201 Created

Phase 3: Tasks

Break the design into discrete, trackable implementation steps. Each task maps back to an acceptance criterion. Each task is small enough for an agent to complete in one session without context pollution.

Kiro: tasks.md excerpt

## Tasks

- [x] Task 1: Create Organization database migration
  - Add organizations table with all fields from design.md
  - Add unique index on domain column
  - Linked to: Acceptance Criteria 1

- [x] Task 2: Implement POST /api/organizations endpoint
  - Validate input, check domain uniqueness, insert record
  - Return 409 for duplicate domains, 422 for invalid input
  - Linked to: Acceptance Criteria 1, 2

- [ ] Task 3: Email verification service
  - Send verification link with signed token
  - Token expires after 72 hours
  - Linked to: Acceptance Criteria 3

- [ ] Task 4: Soft-delete cron job
  - Run every hour, delete unverified orgs older than 72h
  - Linked to: Acceptance Criteria 4

Why Three Phases, Not One

You could dump all this into a single document. The three-phase structure exists because each phase is a review checkpoint. Requirements catch scope errors. Design catches architectural errors. Tasks catch sequencing errors. Skipping phases means catching those errors in code review, where they cost 5-10x more to fix.

Spec-Driven Development Tools Compared

Three tools dominate the SDD space in March 2026. They target different levels of rigor and different workflows.

AspectKiro (AWS)GitHub Spec KitCLAUDE.md / Rules Files
SDD LevelSpec-firstSpec-first (aspiring anchored)Spec-first (lightweight)
WorkflowRequirements -> Design -> TasksSpecify -> Plan -> Tasks (cyclical)Single file, no phases
File Structure3 markdown files per specMultiple files per spec + constitution.md1 file at repo root
Agent SupportKiro IDE only20+ agents (Copilot, Claude, Cursor, Gemini...)Any agent that reads project files
Open SourceNo (proprietary IDE)Yes (MIT, 71K stars)N/A (convention, not a tool)
Best ForGreenfield features with clear requirementsTeams using multiple AI agentsQuick constraints on existing projects
OverheadMedium (3 docs per feature)High (many markdown files)Low (1 file, few minutes)

Kiro

AWS's bet on spec-driven development. A VS Code fork that makes specs the default workflow, not an afterthought. The three-phase process (requirements.md, design.md, tasks.md) is enforced by the IDE. Agent hooks trigger automated actions on file events, so the agent can auto-update specs when code changes.

The limitation: Kiro applies the same three-phase process to everything, including bug fixes. Martin Fowler's team called it a "sledgehammer for a nut" on small problems. Also locked to the Kiro IDE, which means your spec workflow cannot follow you to Cursor or Claude Code.

GitHub Spec Kit

The open-source alternative. 71,000 GitHub stars, 2,300+ forks, MIT license. Works with 20+ agents through slash commands (/specify, /plan, /tasks). Uses a constitution.md file for immutable project principles, similar to how CLAUDE.md works.

The limitation: verbose. The Marmelab team tested Spec Kit on a simple date-display feature and got 8 files and 1,300+ lines of spec text. The Martin Fowler team noted that reviewing the generated markdown was more tedious than reviewing code directly. The overhead pays off on large features but is disproportionate for small changes.

CLAUDE.md and Rules Files

The lightweight option. A single markdown file at the repo root that tells the agent how to behave: coding conventions, architecture constraints, testing requirements. Not a full SDD workflow, but it captures the most important benefit (explicit constraints) with minimal overhead.

CLAUDE.md for Claude Code, .cursorrules for Cursor, copilot-instructions.md for GitHub Copilot. Same concept, different filenames. These files are spec-first at level 1: they guide the agent but do not enforce a multi-phase workflow.

Which Tool to Pick

  • Starting from scratch, greenfield project: Kiro or Spec Kit for full workflow
  • Existing codebase, adding features: CLAUDE.md for constraints + Spec Kit for complex features
  • Small team, moving fast: CLAUDE.md only. Add Spec Kit when features start exceeding 500 lines.

SDD vs TDD vs BDD

Spec-driven development does not replace test-driven development. It operates at a different layer. Understanding where each practice fits prevents the "is this just waterfall?" confusion.

DimensionTDDBDDSDD
ScopeUnit correctnessCross-functional behaviorArchitectural contracts
Primary artifactTest filesFeature files (Gherkin)Spec documents (markdown)
ValidationAutomated test runnerAutomated acceptance testsAgent checks spec before implementing
When it runsDuring/after codingBefore/during codingBefore coding
AI governanceNoneNoneConstitutional constraints on agent behavior
Question answeredDoes this function return the right value?Does this feature behave correctly?Should this function exist? What should it accept?

The practical relationship: SDD generates the design that BDD features validate, which TDD tests implement. A spec says "registration must verify email within 30 seconds." A BDD feature file turns that into a Gherkin scenario. A TDD test validates the email service returns within 30 seconds.

You do not need all three. Most teams using SDD with AI agents skip BDD entirely because the spec already captures behavior. TDD remains valuable for unit-level correctness that specs do not cover.

When SDD Works (and When It Does Not)

SDD Pays Off

  • Features over 500 lines. The spec investment amortizes across implementation, testing, and future maintenance.
  • Multi-developer or multi-agent work. Specs align everyone (human and AI) on the same contract before they start writing code.
  • Compliance-sensitive systems. The EU AI Act (enforcement starts August 2026) requires documentation of AI-generated code decisions. Specs provide that audit trail.
  • Legacy modernization. Writing a spec for the target state before touching legacy code prevents agents from replicating old patterns.
  • Complex refactors. When 15 files need coordinated changes, a spec prevents the agent from forgetting the plan by file 8.

SDD Adds Drag

  • Exploratory prototyping. If you do not know what you are building yet, writing a spec for it is premature. Vibe code first, spec later.
  • Bug fixes under 50 lines. The fix is often obvious from the error message. A three-phase workflow for a null check is overhead.
  • Hackathons and spikes. Time-boxed experiments prioritize speed over reliability. Specs slow you down when speed is the point.
  • Solo scripts and automation. One-off scripts that run once and get deleted do not need acceptance criteria.
  • Small teams with frequent pivots. If requirements change daily, maintaining specs becomes a second job.

The Marmelab critique crystallizes the risk: SDD can become "systematic bureaucracy" if applied uniformly. Their test showed a simple date feature generating 1,300+ lines of spec. The fix is not abandoning specs but calibrating rigor to task size. A CLAUDE.md file for small tasks. A full Kiro workflow for features that will live in production for years.

How to Start with Spec-Driven Development

Do not adopt all three levels at once. Start at level 1 (spec-first) and move to level 2 (spec-anchored) only when you have evidence it helps.

Week 1: Add a CLAUDE.md (or equivalent)

Create a markdown file at your repo root that tells your AI agent the project's constraints: tech stack, coding conventions, testing requirements, architecture rules. This takes 15 minutes and immediately reduces agent drift.

Minimal CLAUDE.md for spec-first development

# Project: Acme Dashboard

## Stack
- Next.js 15, TypeScript strict, Tailwind CSS
- PostgreSQL with Drizzle ORM
- Auth: Clerk

## Rules
- Server components by default. Client components only for interactivity.
- All database access through Drizzle ORM, never raw SQL.
- Every API route must validate input with Zod.
- No console.log in production code. Use structured logging.

## Testing
- Unit tests: Vitest
- E2E tests: Playwright
- Run `bun test` before committing.

## Architecture
- /app/(auth)/ routes require Clerk middleware
- /app/api/ routes return JSON, never HTML
- Shared types in /types/, never co-located

Week 2-3: Try Spec Kit on One Feature

Pick a feature that is clearly defined and will take more than a day to implement. Run through the Specify, Plan, Tasks cycle. Measure: did the spec catch design errors before coding? Did the agent stay closer to plan?

Week 4+: Evaluate and Calibrate

If the overhead paid off, expand to more features. If it felt like busywork, drop back to CLAUDE.md and only use full specs for features over 1,000 lines. The right level of SDD depends on team size, feature complexity, and how often requirements change.

The Implementation Gap: Specs Are Only Half the Problem

A good spec tells the agent what to build. It does not guarantee the agent builds it well. The Martin Fowler team observed that "despite comprehensive workflows and larger context windows, agents frequently ignore instructions or over-interpret them, creating duplicates or diverging from specifications."

This is the implementation gap. Specs reduce the problem space from "anything the agent can imagine" to "what the spec defines." But within that reduced space, the agent still needs to produce correct, efficient code. Two bottlenecks remain:

Code Search Speed

Even with a spec, agents need to find existing code patterns to stay consistent. Cognition measured that agents spend 60% of their time searching. Subagent architectures with dedicated context windows solve this by giving each task its own focused search space.

Edit Accuracy

Agents that generate entire files waste tokens and introduce errors. A spec says 'add a field to the Organization model.' The agent should edit 3 lines, not regenerate the entire 200-line schema file. Fast, surgical edits make spec-to-code translation reliable.

This is where the underlying model infrastructure matters. Spec-driven agents still need fast, accurate code edits to translate specs into working code. A spec that takes 3 phases to write and 45 seconds to implement is productive. A spec that takes 3 phases to write and 10 minutes of hallucinated edits to implement is not.

Specs need fast, accurate edits

Morph's fast-apply model handles the implementation side of spec-driven development. 10,500 tok/s edit speed. Agents write specs, Morph applies them. The spec-to-code pipeline only works when the last mile is reliable.

FAQ

What is spec-driven development?

A methodology where you write formal specifications before code. The spec defines behavior, constraints, and acceptance criteria. AI agents implement against that contract instead of interpreting freeform prompts. GitHub Spec Kit (71K stars), Kiro (AWS), and CLAUDE.md files are the primary tools.

How does Kiro use spec-driven development?

Kiro enforces a three-phase workflow: Requirements (user stories with EARS-notation acceptance criteria), Design (architecture, schemas, sequence diagrams), and Tasks (discrete implementation steps with completion tracking). Each phase builds on the previous one. Kiro automatically includes all spec files in the agent's context.

Is spec-driven development just waterfall?

The criticism is fair but incomplete. Waterfall assumed specs were final before coding. SDD treats specs as living documents. Kiro allows updating specs mid-development. Spec Kit's workflow is explicitly cyclical. The real difference: in waterfall, humans implemented specs. In SDD, AI agents implement specs, and the cost of regenerating implementation from an updated spec is near zero.

What is the difference between SDD and TDD?

TDD validates unit correctness (does this function return the right value?). SDD defines architectural contracts (should this function exist? what should it accept?). They operate at different layers and complement each other. SDD generates the design. TDD tests the implementation.

What tools support spec-driven development?

Kiro (AWS, proprietary IDE), GitHub Spec Kit (open source, 71K stars, 20+ agent support), CLAUDE.md / .cursorrules / copilot-instructions.md (lightweight spec-first files), and Tessl (private beta, spec-as-source). Beads integrates with Spec Kit as a persistent memory layer for long-running work.

When should I NOT use spec-driven development?

Exploratory prototyping, bug fixes under 50 lines, hackathons, one-off scripts, and environments where requirements change daily. SDD adds overhead that only pays off when features are large enough and stable enough to justify the upfront investment. The threshold is roughly 500 lines of implementation.