We Hit 10,500 Tokens/Sec Applying Code Edits on a Single H100

Performance graph showing Morph v3-fast breaking through 10,500 tokens per second

TL;DR: Our code editing model now processes at 10,500 tokens/sec per request on single H100 (up from 4,500). This puts most file edits under 300ms—fast enough that the bottleneck shifts from inference to network latency.

Here's how we got there and why it matters for AI coding tools.

Reproducible benchmarks: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.

Why This Speed Matters (It's Not Just Marketing)

The problem: Current AI coding tools either rewrite entire files (slow, wasteful) or use brittle search-and-replace (fails on formatting/whitespace). We're solving a different problem—intelligently merging partial code snippets.

Real performance comparison (averaged across 1,000 production edits):

Full file rewrite (GPT-4/Claude): ~15s for 5k tokens, 40% failure rate on large files
Search-and-replace (regex-based): ~2s, 15% failure rate on whitespace/formatting
morph-v3-fast: ~0.5s, 0.8% failure rate

Actual customer data: Fortune 100 bank replaced their search-and-replace pipeline. Previous: 20s per edit, 15% silent failures. Now: 400ms per edit, <1% failures.

Model	Tokens/sec	Memory (GB)	Use Case
morph-v3-fast	10,500+	40	Speed-critical edits
morph-v3-large	5,000	80	Complex architectural changes

The Technical Problem

Traditional approaches suck:

Full file rewrite: LLMs output entire files even for single-line changes. Wastes 10-100x tokens, scales poorly.
Search-and-replace: Brittle matching fails on whitespace, similar variable names, refactored code.
Git patches: Models are terrible at generating valid diffs. High failure rates.

We built a model that takes original code + partial snippets (with // ... existing code ... markers) and intelligently merges them. Think git merge but for AI-generated code.

How We Hit 10,500 Tok/Sec

1. Custom CUDA Kernels

Fused attention + feedforward operations eliminate 3 memory roundtrips
Custom FlashAttention variant optimized for code's hierarchical structure
Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)

2. Speculative Execution

While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
When speculation hits: effective 3x speedup. When it misses: 5% penalty.

3. Architecture Changes

Smaller model (7B vs 34B) trained specifically on code merging task
Removed unused vocab (no need for Chinese characters in code editing)
Custom positional encoding for hierarchical code structure

Trade-offs:

Lower general reasoning ability vs GPT-4/Claude
Optimized specifically for code editing (won't write your emails)
Requires structured input format (// ... existing code ...)

Benchmarked against:

vLLM + Llama 3.1 8B: 3,200 tok/sec
TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec
Our previous version: 4,500 tok/sec

Real-World Performance Data

Measured latencies (p50/p95, including network):

File Size    | Traditional | morph-v3-fast | Improvement
1-3k tokens  | 2.5s / 7.5s | 0.5s / 0.8s  | 5-9x faster  
5-10k tokens | 8s / 15s    | 0.9s / 1.2s  | 8-12x faster
15k+ tokens  | 25s+ / 160s+ | 1.4s / 2.1s  | 17-40x faster

Example use cases now feasible:

Live refactoring: Make edits across 50+ files in real-time as you type
Agent swarms: 20+ agents making coordinated edits without conflicts
Speculative editing: Apply suggestions before the model finishes generating them
Interactive architecture changes: Restructure entire codebases with sub-second feedback

Limitations & Future Work

Current limitations:

Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
Requires specific input format (// ... existing code ... markers)
Single-request only (no conversation context)

Next targets:

15k+ tok/sec: New Blackwell B200 kernels in development
Multi-file edits: Atomic operations across entire repositories
Sub-100ms p50: Moving inference closer to edge locations

Try It Yourself

API endpoint: POST https://api.morphllm.com/v1/apply

curl -X POST https://api.morphllm.com/v1/apply \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "morph-v3-fast",
    "original_code": "def hello():\n    print(\"world\")",
    "edit_snippet": "def hello():\n    print(\"universe\")\n    // ... existing code ..."
  }'

Pricing: $0.50 per 1M tokens (4x cheaper than rewriting full files with GPT-4)

Get API key • Full docs • Benchmarks Want to help us hit 15k tok/sec? Join us: info@morphllm.com