
TL;DR: Our code editing model now processes at 10,500 tokens/sec per request on single B200 (up from 4,500). Combined with network latency and real-world conditions, most file edits complete in 1-3 seconds.
Here's how we got there and why it matters for AI coding tools.
Reproducible benchmarks: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.
Why This Speed Matters (It's Not Just Marketing)
Search and replace requires a separate tool call for each chunk being edited. Multiple edits = multiple round trips.
FastApply handles all edits to a file in a single call—the model describes the changes, and we merge everything at once.
Based on our benchmarks, Morph delivers ~35% faster end-to-end task completion compared to search & replace.
| Model | Tokens/sec | Memory (GB) | Use Case |
|---|---|---|---|
| morph-v3-fast | 10,500+ | 40 | Speed-critical edits |
| morph-v3-large | 5,000 | 80 | Complex architectural changes |
The Technical Problem
The standard approach is search and replace. It works, but requires a separate tool call for each chunk being edited. When you need multiple edits, the round trips add up.
We built a model that takes original code + partial snippets (with // ... existing code ... markers) and merges all edits in one call. Think git merge but for AI-generated code.
How We Hit 10,500 Tok/Sec
1. Custom CUDA Kernels
- Fused attention + feedforward operations eliminate 3 memory roundtrips
- Custom FlashAttention variant optimized for code's hierarchical structure
- Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)
2. Speculative Execution
- While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
- Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
- When speculation hits: effective 3x speedup. When it misses: 5% penalty.
3. Architecture Changes
- Smaller model (7B vs 34B) trained specifically on code merging task
- Removed unused vocab (no need for Chinese characters in code editing)
- Custom positional encoding for hierarchical code structure
Trade-offs:
- Lower general reasoning ability vs GPT-4/Claude
- Optimized specifically for code editing (won't write your emails)
- Requires structured input format (
// ... existing code ...)
Benchmarked against:
- vLLM + Llama 3.1 8B: 3,200 tok/sec
- TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec
- Our previous version: 4,500 tok/sec
Real-World Performance Data
Based on our benchmarks, Morph FastApply delivers ~35% faster end-to-end task completion compared to search & replace methods. Typical apply latency is 1-3 seconds depending on file size and network conditions.
Example use cases:
- Live refactoring: Make edits across multiple files with real-time feedback
- Agent workflows: Coordinated edits with high accuracy
- Speculative editing: Apply suggestions as the model generates them
- Interactive changes: Quick feedback loop for iterative development
Limitations & Future Work
Current limitations:
- Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
- Requires specific input format (
// ... existing code ...markers) - Single-request only (no conversation context)
Next targets:
- 15k+ tok/sec: New Blackwell B200 kernels in development
- Multi-file edits: Atomic operations across entire repositories
- Sub-100ms p50: Moving inference closer to edge locations
Try It Yourself
API endpoint: POST https://api.morphllm.com/v1/apply
Pricing: $0.80/1M input, $1.20/1M output
Get API key • Full docs • Benchmarks Want to help us hit 15k tok/sec? Join us: info@morphllm.com
