We Hit 10,500 Tokens/Sec Applying Code Edits on a Single H100
TL;DR: Our code editing model now processes at 10,500 tokens/sec per request on single H100 (up from 4,500). This puts most file edits under 300ms—fast enough that the bottleneck shifts from inference to network latency.
Here's how we got there and why it matters for AI coding tools.
Reproducible benchmarks: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.
Why This Speed Matters (It's Not Just Marketing)
The problem: Current AI coding tools either rewrite entire files (slow, wasteful) or use brittle search-and-replace (fails on formatting/whitespace). We're solving a different problem—intelligently merging partial code snippets.
Real performance comparison (averaged across 1,000 production edits):
- Full file rewrite (GPT-4/Claude): ~15s for 5k tokens, 40% failure rate on large files
- Search-and-replace (regex-based): ~2s, 15% failure rate on whitespace/formatting
- morph-v3-fast: ~0.5s, 0.8% failure rate
Actual customer data: Fortune 100 bank replaced their search-and-replace pipeline. Previous: 20s per edit, 15% silent failures. Now: 400ms per edit, <1% failures.
Model | Tokens/sec | Memory (GB) | Use Case |
---|---|---|---|
morph-v3-fast | 10,500+ | 40 | Speed-critical edits |
morph-v3-large | 5,000 | 80 | Complex architectural changes |
The Technical Problem
Traditional approaches suck:
- Full file rewrite: LLMs output entire files even for single-line changes. Wastes 10-100x tokens, scales poorly.
- Search-and-replace: Brittle matching fails on whitespace, similar variable names, refactored code.
- Git patches: Models are terrible at generating valid diffs. High failure rates.
We built a model that takes original code + partial snippets (with // ... existing code ...
markers) and intelligently merges them. Think git merge
but for AI-generated code.
How We Hit 10,500 Tok/Sec
1. Custom CUDA Kernels
- Fused attention + feedforward operations eliminate 3 memory roundtrips
- Custom FlashAttention variant optimized for code's hierarchical structure
- Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)
2. Speculative Execution
- While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
- Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
- When speculation hits: effective 3x speedup. When it misses: 5% penalty.
3. Architecture Changes
- Smaller model (7B vs 34B) trained specifically on code merging task
- Removed unused vocab (no need for Chinese characters in code editing)
- Custom positional encoding for hierarchical code structure
Trade-offs:
- Lower general reasoning ability vs GPT-4/Claude
- Optimized specifically for code editing (won't write your emails)
- Requires structured input format (
// ... existing code ...
)
Benchmarked against:
- vLLM + Llama 3.1 8B: 3,200 tok/sec
- TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec
- Our previous version: 4,500 tok/sec
Real-World Performance Data
Measured latencies (p50/p95, including network):
File Size | Traditional | morph-v3-fast | Improvement
1-3k tokens | 2.5s / 7.5s | 0.5s / 0.8s | 5-9x faster
5-10k tokens | 8s / 15s | 0.9s / 1.2s | 8-12x faster
15k+ tokens | 25s+ / 160s+ | 1.4s / 2.1s | 17-40x faster
Example use cases now feasible:
- Live refactoring: Make edits across 50+ files in real-time as you type
- Agent swarms: 20+ agents making coordinated edits without conflicts
- Speculative editing: Apply suggestions before the model finishes generating them
- Interactive architecture changes: Restructure entire codebases with sub-second feedback
Limitations & Future Work
Current limitations:
- Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
- Requires specific input format (
// ... existing code ...
markers) - Single-request only (no conversation context)
Next targets:
- 15k+ tok/sec: New Blackwell B200 kernels in development
- Multi-file edits: Atomic operations across entire repositories
- Sub-100ms p50: Moving inference closer to edge locations
Try It Yourself
API endpoint: POST https://api.morphllm.com/v1/apply
curl -X POST https://api.morphllm.com/v1/apply \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "morph-v3-fast",
"original_code": "def hello():\n print(\"world\")",
"edit_snippet": "def hello():\n print(\"universe\")\n // ... existing code ..."
}'
Pricing: $0.50 per 1M tokens (4x cheaper than rewriting full files with GPT-4)
Get API key • Full docs • Benchmarks Want to help us hit 15k tok/sec? Join us: info@morphllm.com