---
title: "Codegen Is an Inference Research Problem"
url: "https://www.morphllm.com/blog/codegen-inference-research"
description: "General-purpose inference optimizes for the average token. Codegen isn't average. Morph serves 100B+ tokens a day as a top-25 OpenRouter provider, and attacks the full stack: custom speculators, machine-written kernels verified against production traces, RL-trained subagents, and GPU environments other labs can't use."
date: "2026-06-10"
author: "Tejas Bhakta"
---
# Codegen Is an Inference Research Problem

Every computing substrate in history starts general and ends specialized. The CPU gave way to the GPU. The GPU is giving way to the ASIC. The same force is now working on intelligence itself.

The first era of AI inference was general-purpose: one serving stack for every token, tuned for the average prompt, the average output, the average hardware. Codegen is not average. The gap between average-token inference and codegen-shaped inference is not 20%. It is an order of magnitude.

Closing it is a research problem. Morph is the lab working on it.

## Codegen has structure. General inference throws it away.

Watch what a coding agent does between thoughts. It applies edits, where the output is 95% identical to the input. It searches, where the work is wide, parallel, and short. It compacts context, where most tokens are redundant by construction. Cognition measured agents spending [over 60% of their time on retrieval alone](/blog/code-search-bottleneck).

A general serving stack sees none of this. It decodes merged code one token at a time, as if it were novel prose, while 95% of the answer sits in the prompt.

This is free energy. It will be collected by whoever owns enough of the stack to collect it: the model, the speculator, the kernel, the hardware. So we own all four.

## The research program

**Custom speculators.** A generic draft model captures a fraction of what task structure allows, so we build speculators per task. Our production apply speculator drafts 64 tokens per step by exploiting input-output similarity. Fast Apply merges edits at 10,500+ tokens per second, about 4x the fastest general model on Cerebras. People say variable-length inference is hopelessly memory-bandwidth bound. That is an admission you don't know what to do with your spare compute.

**Kernels written and verified by machines.** We tried every autoresearch kernel product on the market. They are all terrible in the same way: none of them run production inference workloads, so the kernel that passes their harness still throws an illegal memory access at batch size 21 with a 21,246-token input. We know because we hit that bug.

Morph is a top-25 provider on OpenRouter serving 100B+ tokens a day, so our harness replays the traces those products have never seen. An autoresearch loop proposes kernels and searches the design space. A verification loop proves correctness and benchmarks against production traffic before anything ships. Kernel engineering is the scarcest skill in AI infrastructure. We turned it into a search problem.

**Hardware nobody else can use.** Frontier labs bid against each other for NVLink-connected clusters because their workloads collapse without premium interconnect. Ours don't. Our kernels and workload shapes are co-designed for NVLink-denied environments, compute the rest of the market has written off. Structurally cheaper inputs, at any scale.

**RL-trained subagents.** [WarpGrep](/blog/warpgrep-v2) is a search model trained with RL to run 8 parallel tool calls per turn across a 4-turn workflow. It puts every major coding model at #1 on SWE-Bench Pro while cutting cost 15.6% and time 28%. It was trained on the thing nobody else has: our traffic.

## The flywheel

Morph processes millions of edit, search, and compaction calls against production codebases from teams like Framer, Webflow, Block, and Vercel. That traffic trains the next generation of speculators, kernels, and subagents. The techniques are publishable, and we publish them. The data is not replicable from a paper.

Frontier models keep moving upmarket, and every task they leave behind falls to a specialized system that does it 100x cheaper. [This layer is permanent.](/blog/everything-is-models) It runs at a volume the frontier never will. Whoever owns its full stack owns the margin of the agent economy.

## The first 10-person billion-dollar company

That is what we intend to be. The previous world selected for talented engineers who could manage people. This world selects for talented engineers who can manage the most agents. Some people swear that managing multiple agents is impossible. We know it isn't.

We live in the first era where reviewing code takes longer than writing it. The old world tolerated engineers of every skill level because review was cheap relative to writing. That subsidy is gone. Our team is extremely lean, and we only hire the best.

## Where this goes

$1.1M in revenue in our first five months. Four layers under one roof: model, speculator, kernel, hardware. The roadmap is to collapse them into a single co-designed system, then point the autoresearch loop at more of its own stack.

The frontier labs are racing to build a mind that can do everything. We are building the layer that mind will delegate to.

If you're shipping a coding agent, [the stack is one import away](https://docs.morphllm.com).
