---
title: "We Hit 10,500 Tokens/Sec on B200"
url: "https://www.morphllm.com/blog/morph-breaks-10k-barrier"
description: "Technical deep-dive: custom CUDA kernels + speculative execution for 2.3x speedup"
date: "2025-09-15"
author: "Tejas Bhakta"
---
# We Hit 10,500 Tokens/Sec on B200

![Performance graph showing Morph v3-fast breaking through 10,500 tokens per second](/images/breakthrough.png)

**TL;DR**: Our code editing model now processes at 10,500 tokens/sec per request on single B200 (up from 4,500). Combined with network latency and real-world conditions, most file edits complete in 1-3 seconds.

Here's how we got there and why it matters for AI coding tools.

**Reproducible benchmarks**: All timing data from single-request, cold-start measurements. No batch tricks or cherry-picked examples.

---

## Why This Speed Matters (It's Not Just Marketing)

Search and replace requires a separate tool call for each chunk being edited. Multiple edits = multiple round trips.

FastApply handles all edits to a file in a single call—the model describes the changes, and we merge everything at once.

Based on [our benchmarks](https://morphllm.com/benchmarks), Morph delivers ~35% faster end-to-end task completion compared to search & replace.

| Model | Tokens/sec | Memory (GB) | Use Case |
|-------|------------|-------------|----------|
| **morph-v3-fast** | 10,500+ | 40 | Speed-critical edits |
| **morph-v3-large** | 5,000 | 80 | Complex architectural changes |

---

## The Technical Problem

The standard approach is search and replace. It works, but requires a separate tool call for each chunk being edited. When you need multiple edits, the round trips add up.

We built a model that takes original code + partial snippets (with `// ... existing code ...` markers) and merges all edits in one call. Think `git merge` but for AI-generated code.

---

## How We Hit 10,500 Tok/Sec

**1. Custom CUDA Kernels**
- Fused attention + feedforward operations eliminate 3 memory roundtrips
- Custom FlashAttention variant optimized for code's hierarchical structure  
- Memory bandwidth: 2.1TB/s utilization on H100 (vs 1.6TB/s with standard kernels)

**2. Speculative Execution**
- While processing tokens 1-N, speculatively compute likely continuations for tokens N+1 to N+8
- Code has predictable patterns (indentation, brackets, etc.) → 70% speculation hit rate
- When speculation hits: effective 3x speedup. When it misses: 5% penalty.

**3. Architecture Changes** 
- Smaller model (7B vs 34B) trained specifically on code merging task
- Removed unused vocab (no need for Chinese characters in code editing)
- Custom positional encoding for hierarchical code structure

**Trade-offs**: 
- Lower general reasoning ability vs GPT-4/Claude
- Optimized specifically for code editing (won't write your emails)
- Requires structured input format (`// ... existing code ...`)

**Benchmarked against**:
- vLLM + Llama 3.1 8B: 3,200 tok/sec  
- TensorRT-LLM + CodeLlama 7B: 4,800 tok/sec  
- Our previous version: 4,500 tok/sec

---

## Real-World Performance Data

Based on [our benchmarks](https://morphllm.com/benchmarks), Morph FastApply delivers ~35% faster end-to-end task completion compared to search & replace methods. Typical apply latency is 1-3 seconds depending on file size and network conditions.

**Example use cases**:
- **Live refactoring**: Make edits across multiple files with real-time feedback
- **Agent workflows**: Coordinated edits with high accuracy  
- **Speculative editing**: Apply suggestions as the model generates them
- **Interactive changes**: Quick feedback loop for iterative development

---

## Limitations & Future Work

**Current limitations**:
- Works best on structured languages (Python, JS, Go). Struggles with heavily macro-based C++
- Requires specific input format (`// ... existing code ...` markers)
- Single-request only (no conversation context)

**Next targets**:
- **15k+ tok/sec**: New Blackwell B200 kernels in development
- **Multi-file edits**: Atomic operations across entire repositories
- **Sub-100ms p50**: Moving inference closer to edge locations

---

## Try It Yourself

**API endpoint**: `POST https://api.morphllm.com/v1/apply`

```bash
curl -X POST https://api.morphllm.com/v1/apply \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "morph-v3-fast",
    "original_code": "def hello():\n    print(\"world\")",
    "edit_snippet": "def hello():\n    print(\"universe\")\n    // ... existing code ..."
  }'
```

**Pricing**: $0.80/1M input, $1.20/1M output

[Get API key](/dashboard/api-keys) • [Full docs](https://docs.morphllm.com) • [Benchmarks](https://morphllm.com/benchmarks)
Want to help us hit 15k tok/sec? Join us: info@morphllm.com
