---
title: "Compute Scarcity Is Morphing the Market for Kernels"
url: "https://www.morphllm.com/blog/compute-scarcity-kernels"
description: "When compute is abundant, inefficiency hides in the noise and nobody gets paid to find it. When it gets scarce, every crack in the stack gets a price. That price signal is fragmenting inference into a market of workload-specialized kernels."
date: "2026-06-18"
author: "Tejas Bhakta"
---
# Compute Scarcity Is Morphing the Market for Kernels

Abundant compute makes everyone lazy in the same direction. When H100s are everywhere, the fix for a slow workload is more H100s. The inefficiency is still there. It just hides in the noise, and nobody gets paid to find it.

Scarcity changes the math. The inefficiency gets a price, and the price keeps going up.

The way I think about it is fracking. For decades there were pockets of oil everyone knew about and nobody touched, because they were too small, too tight, too annoying to reach. Then the price moved and the economics flipped. The annoying pockets turned out to be the whole game.

Kernels are the same. In a world of infinite compute, a custom kernel for some niche architecture, a speculator tuned to one workload, a 20% win on a path most people never run, none of it is worth the engineering time. In the world we actually have, every one of those is a margin:

- kernels for the architectures the big libraries never bothered to support
- inference paths shaped around one workload instead of the average token
- hardware-aware tricks that only pay off at scale
- the 20% speedups that decide whether the unit economics close

People get the next part backwards. Scarcity doesn't slow this work down. It's the reason the work happens at all. The price signal pulls engineering into the cracks the market spent years stepping over, and what comes out is not one clean general stack. It's a fragmented, faster software layer: a market of specialized kernels, compiler passes, and model-specific hacks, each one pulling a pocket of value the general stack left in the ground.

The general stack can't follow you down there. That's the structural part. A horizontal runtime optimizes for the average token because it has to serve every model at once. The moment it tunes itself to your workload, it stops being horizontal. It can never quote the numbers a specialist quotes. That isn't a skill gap, it's the position it chose.

We took the other side. One workload, the coding agent loop, and we own all four layers under it: the model, the speculator, the kernel, the hardware. For the open models we serve, we train our own speculative decoders on the distribution of coding outputs instead of average web text, because a speculator that has read a million diffs predicts the next token far better than one that read the internet. Apply runs at 10,500 tok/s. Compaction at 33,000. The kernels are written and checked against production traces, not benchmarks. The long version of that argument is [here](/blog/codegen-inference-research).

Scarcity doesn't constrain innovation. It's the thing that finally pays for it. The boring infrastructure was always where the value sat. Now there's a price signal pointing straight at it.