We Hit 10,500 Tokens/Sec on B200

Technical deep-dive: custom CUDA kernels + speculative execution for 2.3x speedup

Tejas Bhakta
Tejas Bhakta
September 15, 20254 min read
We Hit 10,500 Tokens/Sec on B200

Performance graph showing Morph v3-fast breaking through 10,500 tokens per second

TL;DR: Our code editing model now processes at 10,500 tokens/sec per request on single B200 (up from 4,500). Combined with network latency and real-world conditions, most file edits complete in 1-3 seconds.

Here's how we got there and why it matters for AI coding tools.

Reproducible benchmarks: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.


Why This Speed Matters (It's Not Just Marketing)

Search and replace requires a separate tool call for each chunk being edited. Multiple edits = multiple round trips.

FastApply handles all edits to a file in a single call—the model describes the changes, and we merge everything at once.

Based on our benchmarks, Morph delivers ~35% faster end-to-end task completion compared to search & replace.

ModelTokens/secMemory (GB)Use Case
morph-v3-fast10,500+40Speed-critical edits
morph-v3-large5,00080Complex architectural changes

The Technical Problem

The standard approach is search and replace. It works, but requires a separate tool call for each chunk being edited. When you need multiple edits, the round trips add up.

We built a model that takes original code + partial snippets (with // ... existing code ... markers) and merges all edits in one call. Think git merge but for AI-generated code.


How We Hit 10,500 Tok/Sec

1. Custom CUDA Kernels

  • Fused attention + feedforward operations eliminate 3 memory roundtrips
  • Custom FlashAttention variant optimized for code's hierarchical structure
  • Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)

2. Speculative Execution

  • While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
  • Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
  • When speculation hits: effective 3x speedup. When it misses: 5% penalty.

3. Architecture Changes

  • Smaller model (7B vs 34B) trained specifically on code merging task
  • Removed unused vocab (no need for Chinese characters in code editing)
  • Custom positional encoding for hierarchical code structure

Trade-offs:

  • Lower general reasoning ability vs GPT-4/Claude
  • Optimized specifically for code editing (won't write your emails)
  • Requires structured input format (// ... existing code ...)

Benchmarked against:

  • vLLM + Llama 3.1 8B: 3,200 tok/sec
  • TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec
  • Our previous version: 4,500 tok/sec

Real-World Performance Data

Based on our benchmarks, Morph FastApply delivers ~35% faster end-to-end task completion compared to search & replace methods. Typical apply latency is 1-3 seconds depending on file size and network conditions.

Example use cases:

  • Live refactoring: Make edits across multiple files with real-time feedback
  • Agent workflows: Coordinated edits with high accuracy
  • Speculative editing: Apply suggestions as the model generates them
  • Interactive changes: Quick feedback loop for iterative development

Limitations & Future Work

Current limitations:

  • Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
  • Requires specific input format (// ... existing code ... markers)
  • Single-request only (no conversation context)

Next targets:

  • 15k+ tok/sec: New Blackwell B200 kernels in development
  • Multi-file edits: Atomic operations across entire repositories
  • Sub-100ms p50: Moving inference closer to edge locations

Try It Yourself

API endpoint: POST https://api.morphllm.com/v1/apply

bash

Pricing: $0.80/1M input, $1.20/1M output

Get API keyFull docsBenchmarks Want to help us hit 15k tok/sec? Join us: info@morphllm.com