We Hit 10,500 Tokens/Sec Applying Code Edits on a Single H100

Technical deep-dive: custom CUDA kernels + speculative execution for 2.3x speedup

Tejas Bhakta
Tejas Bhakta
September 15, 20254 min read

We Hit 10,500 Tokens/Sec Applying Code Edits on a Single H100

Performance graph showing Morph v3-fast breaking through 10,500 tokens per second

TL;DR: Our code editing model now processes at 10,500 tokens/sec per request on single H100 (up from 4,500). This puts most file edits under 300ms—fast enough that the bottleneck shifts from inference to network latency.

Here's how we got there and why it matters for AI coding tools.

Reproducible benchmarks: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.


Why This Speed Matters (It's Not Just Marketing)

The problem: Current AI coding tools either rewrite entire files (slow, wasteful) or use brittle search-and-replace (fails on formatting/whitespace). We're solving a different problem—intelligently merging partial code snippets.

Real performance comparison (averaged across 1,000 production edits):

  • Full file rewrite (GPT-4/Claude): ~15s for 5k tokens, 40% failure rate on large files
  • Search-and-replace (regex-based): ~2s, 15% failure rate on whitespace/formatting
  • morph-v3-fast: ~0.5s, 0.8% failure rate

Actual customer data: Fortune 100 bank replaced their search-and-replace pipeline. Previous: 20s per edit, 15% silent failures. Now: 400ms per edit, <1% failures.

ModelTokens/secMemory (GB)Use Case
morph-v3-fast10,500+40Speed-critical edits
morph-v3-large5,00080Complex architectural changes

The Technical Problem

Traditional approaches suck:

  1. Full file rewrite: LLMs output entire files even for single-line changes. Wastes 10-100x tokens, scales poorly.
  2. Search-and-replace: Brittle matching fails on whitespace, similar variable names, refactored code.
  3. Git patches: Models are terrible at generating valid diffs. High failure rates.

We built a model that takes original code + partial snippets (with // ... existing code ... markers) and intelligently merges them. Think git merge but for AI-generated code.


How We Hit 10,500 Tok/Sec

1. Custom CUDA Kernels

  • Fused attention + feedforward operations eliminate 3 memory roundtrips
  • Custom FlashAttention variant optimized for code's hierarchical structure
  • Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)

2. Speculative Execution

  • While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
  • Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
  • When speculation hits: effective 3x speedup. When it misses: 5% penalty.

3. Architecture Changes

  • Smaller model (7B vs 34B) trained specifically on code merging task
  • Removed unused vocab (no need for Chinese characters in code editing)
  • Custom positional encoding for hierarchical code structure

Trade-offs:

  • Lower general reasoning ability vs GPT-4/Claude
  • Optimized specifically for code editing (won't write your emails)
  • Requires structured input format (// ... existing code ...)

Benchmarked against:

  • vLLM + Llama 3.1 8B: 3,200 tok/sec
  • TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec
  • Our previous version: 4,500 tok/sec

Real-World Performance Data

Measured latencies (p50/p95, including network):

File Size    | Traditional | morph-v3-fast | Improvement
1-3k tokens  | 2.5s / 7.5s | 0.5s / 0.8s  | 5-9x faster  
5-10k tokens | 8s / 15s    | 0.9s / 1.2s  | 8-12x faster
15k+ tokens  | 25s+ / 160s+ | 1.4s / 2.1s  | 17-40x faster

Example use cases now feasible:

  • Live refactoring: Make edits across 50+ files in real-time as you type
  • Agent swarms: 20+ agents making coordinated edits without conflicts
  • Speculative editing: Apply suggestions before the model finishes generating them
  • Interactive architecture changes: Restructure entire codebases with sub-second feedback

Limitations & Future Work

Current limitations:

  • Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
  • Requires specific input format (// ... existing code ... markers)
  • Single-request only (no conversation context)

Next targets:

  • 15k+ tok/sec: New Blackwell B200 kernels in development
  • Multi-file edits: Atomic operations across entire repositories
  • Sub-100ms p50: Moving inference closer to edge locations

Try It Yourself

API endpoint: POST https://api.morphllm.com/v1/apply

curl -X POST https://api.morphllm.com/v1/apply \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "morph-v3-fast",
    "original_code": "def hello():\n    print(\"world\")",
    "edit_snippet": "def hello():\n    print(\"universe\")\n    // ... existing code ..."
  }'

Pricing: $0.50 per 1M tokens (4x cheaper than rewriting full files with GPT-4)

Get API keyFull docsBenchmarks Want to help us hit 15k tok/sec? Join us: info@morphllm.com