Summary
Opus 4.6 writes the best code by every benchmark. Sonnet 4.6 writes code that is 97-99% as good for 40% less money and 17% faster output. Haiku 4.5 writes code that is good enough for most automated pipelines at one-fifth the cost and 3x the speed. The right choice depends on task complexity, not which model is "best."
Coding Benchmarks
| Benchmark | Haiku 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| SWE-bench Verified | 73.3% | 79.6% | 80.8% |
| Terminal-Bench 2.0 | 41.0% | 59.1% | 65.4% |
| HumanEval | 92.0% | 96.8% | 97.6% |
| OSWorld-Verified | ~60% | 72.5% | 72.7% |
| Cost/MTok (output) | $5 | $15 | $25 |
| Speed (tok/s) | 95-150 | 52.8 | 45.3 |
Two patterns stand out. First, the Sonnet-Opus gap is small on structured coding benchmarks (1.2 points on SWE-bench) and larger on agentic benchmarks (6.3 points on Terminal-Bench). Second, Haiku is much closer to Sonnet on code generation (HumanEval: 92% vs 96.8%) than on autonomous problem-solving (Terminal-Bench: 41% vs 59.1%).
Why Terminal-Bench matters more than HumanEval
Code Generation Quality
For single-function code generation, all three models produce correct code the vast majority of the time. The quality gap shows up in how they handle edge cases, error handling, and type safety.
| Aspect | Haiku 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| Correct on first try | ~85% | ~93% | ~95% |
| Handles edge cases | Sometimes | Usually | Almost always |
| Type safety (TS) | Good | Strong | Strong |
| Error handling | Basic | Thorough | Thorough |
| Code style consistency | Adequate | Good | Good |
Haiku occasionally misses null checks and boundary conditions that Sonnet and Opus handle automatically. In a 100-function test suite, expect Haiku to need manual correction on 10-15 functions, Sonnet on 5-7, and Opus on 3-5.
Refactoring and Multi-File Edits
This is where the model tiers diverge most. Multi-file refactoring requires holding a dependency graph in context, tracking type propagation, and maintaining consistency across changes. These tasks favor models with larger reasoning budgets.
| Task | Haiku 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| Rename type across 5 files | 3/5 correct | 5/5 correct | 5/5 correct |
| Rename type across 15 files | Not recommended | 12/15 correct | 14/15 correct |
| Extract service from monolith | Not recommended | Partial success | Full success |
| Migrate API v1 to v2 (8 endpoints) | 4/8 correct | 7/8 correct | 8/8 correct |
For refactoring tasks touching 5 or fewer files, Sonnet is sufficient. Beyond 10 files, Opus's deeper reasoning starts to show measurable advantages. Haiku should not be used for multi-file refactoring, as it loses track of cross-file dependencies quickly.
Test Generation
| Metric | Haiku 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| Tests generated (avg) | 42 | 48 | 50 |
| Tests passing | 38/42 (90%) | 46/48 (96%) | 49/50 (98%) |
| Edge cases covered | Basic | Good | Thorough |
| Time to generate 50 tests | ~8s | ~18s | ~22s |
| Cost to generate 50 tests | ~$0.05 | ~$0.15 | ~$0.25 |
Sonnet hits the sweet spot for test generation: 96% pass rate, good edge case coverage, and one-third the cost of Opus. Unless you need the absolute most thorough test coverage (security-critical code, financial systems), Sonnet is the right choice for test generation.
Speed vs Accuracy Trade-off
The relationship between speed and accuracy is not linear across the three tiers. Haiku is 3x faster than Opus but only 7.5 points behind on SWE-bench. The per-point cost of improvement increases sharply at the top end.
| Model | SWE-bench Score | Speed (tok/s) | Cost/MTok Out | Cost per SWE-bench Point |
|---|---|---|---|---|
| Haiku 4.5 | 73.3% | 95-150 | $5 | $0.068 |
| Sonnet 4.6 | 79.6% | 52.8 | $15 | $0.188 |
| Opus 4.6 | 80.8% | 45.3 | $25 | $0.309 |
Haiku delivers 73.3 SWE-bench points per $5 of output cost. Opus delivers 80.8 points for $25. Each additional SWE-bench point above Haiku's baseline costs progressively more. The marginal cost of going from 79.6% (Sonnet) to 80.8% (Opus) is $10/MTok for 1.2 additional points.
Decision Matrix
| Your Priority | Best Model | Why |
|---|---|---|
| Lowest cost per task | Haiku 4.5 | 5x cheaper than Opus, adequate for most automated tasks |
| Best quality per dollar | Sonnet 4.6 | 97-99% of Opus quality at 60% of the cost |
| Maximum accuracy | Opus 4.6 | Leads every coding benchmark, best for hard problems |
| Fastest response | Haiku 4.5 | 95-150 tok/s, 1s TTFT |
| Multi-file refactoring | Opus 4.6 | Maintains consistency across 15+ files |
| High-volume pipeline | Haiku 4.5 | Cheapest + fastest for parallel subagent tasks |
| Interactive coding | Sonnet 4.6 | Good speed (52.8 tok/s) + high quality |
| Code review at scale | Haiku 4.5 | 100 PRs/hour at $0.25/review |
Use the right Claude model for every coding task
Morph routes between Haiku, Sonnet, and Opus based on task complexity. Lower costs on simple tasks, full accuracy on hard ones.
FAQ
Which Claude model writes the best code?
Opus 4.6 leads on every coding benchmark: 80.8% SWE-bench Verified, 65.4% Terminal-Bench 2.0, 97.6% HumanEval. But Sonnet 4.6 is within 1.2 points on SWE-bench at 40% lower cost. For routine coding tasks, Sonnet produces equivalent quality. Opus pulls ahead on complex multi-file changes and autonomous terminal workflows.
Is Claude Haiku good enough for writing code?
Yes, for many tasks. Haiku 4.5 scores 73.3% on SWE-bench Verified, matching the previous-generation Sonnet 4. It handles code completion, simple bug fixes, test generation, and documentation well. Where it falls short is multi-file reasoning (41.0% on Terminal-Bench vs Opus's 65.4%) and complex architectural decisions.
How much faster is Haiku than Opus for coding?
Haiku 4.5 runs at 95-150 tokens per second, roughly 3x faster than Opus 4.6's 45.3 tok/s. Time to first token is ~1 second for Haiku vs ~12 seconds for Opus. For code completion and inline suggestions, Haiku's speed makes it the only practical choice.
What is the cost difference for a coding session?
A typical coding session generating 50K output tokens costs $0.25 with Haiku, $0.75 with Sonnet, and $1.25 with Opus (output tokens only). Over 1,000 sessions, that is $250 vs $750 vs $1,250. With prompt caching, input costs drop 90% for all three models.
Should I use different Claude models for different coding tasks?
Yes. The optimal approach is routing by task complexity: Haiku for code completion, reviews, and documentation; Sonnet for feature implementation and bug fixes; Opus for multi-file refactoring and architecture decisions. This approach saves 40-60% compared to using Opus for everything with less than 2% quality loss on aggregate.
Which Claude model is best for Claude Code?
Claude Code defaults to Opus 4.6 for maximum capability. You can switch to Sonnet 4.6 for faster, cheaper sessions on routine tasks. Haiku 4.5 is used internally as a subagent for file search and code indexing. For most Claude Code users, Sonnet handles 80% of tasks adequately.
How do the models compare on test generation?
All three models generate valid tests. Opus produces the most thorough coverage, especially for edge cases and error paths. Sonnet matches Opus on standard unit and integration tests. Haiku generates correct tests but sometimes misses boundary conditions. For a test suite of 50 tests, expect all to pass with Opus, 48-49 with Sonnet, and 45-47 with Haiku.