State of the Art Price-Performance: Morph-Decode 1.67x Faster MoE Decode

162

tok/s

1.67× baseline

$0.012

per M tokens

3× cheaper vs H100

~$7K

GPU cost

vs $25K–$35K

Throughputtok/s · higher is better

B200(est.)

~180

RTX PRO 6000(warp-decode)

162

H100 SXM(TRT-LLM)

~120

RTX PRO 6000(baseline)

Cost per M Tokenslower is better

RTX PRO 6000(warp-decode)

$0.012

RTX PRO 6000(baseline)

$0.020

B200(est.)

$0.028

H100 SXM(TRT-LLM)

$0.035

$0.50/GPU-hour amortized, output tokens only, batch=1. B200 and H100 numbers from public benchmarks.

We profiled Qwen3-Next-80B on NVIDIA RTX PRO 6000 Blackwell GPUs under TensorRT-LLM. During token generation the MoE layers eat 73.6% of all GPU time. And inside a MoE layer, only 29% of that goes to matrix multiplication. The other 71% is bookkeeping: copying data between formats, quantizing activations, scattering inputs out to experts, running activation kernels off on their own, and merging the results back.

Look closer and five of the eight stages in CUTLASS's MoE path do no useful compute at all at batch size 1. Their entire job is reshaping data to feed the expert-centric GEMM libraries.

So we wrote CUDA kernels that sidestep the whole thing by organizing the work around outputs rather than experts. Our SM120 build hits 162 tok/s, a 1.67x jump over TRT-LLM's CUTLASS baseline, and it loses no accuracy doing it.

We tuned this for Blackwell, and more than that, for the RTX PRO 6000 specifically.

Why the RTX PRO 6000?

Three reasons, and the first one is counterintuitive. Kernel support for it is genuinely poor, which is exactly why there's headroom to win. On top of that, the cards are easy to get, and they're cheap enough that nothing else touches them on theoretical FLOPs per dollar.

Cost per token

An RTX PRO 6000 runs about $7,000. An H100 is north of $25,000, and a B200 is past $35,000.

Price-to-speed Pareto frontier showing RTX PRO 6000 with warp-decode achieving state-of-the-art cost:speed ratio

GPU	Price	MoE 80B tok/s	Cost per M tokens*
RTX PRO 6000 (warp-decode)	~$7K	162	~$0.012
RTX PRO 6000 (baseline)	~$7K	97	~$0.020
H100 SXM (TRT-LLM)	~$25K	~120	~$0.035
B200 (est.)	~$35K	~180	~$0.028

*$0.50/GPU-hour amortized, output tokens only, batch=1.

So at 162 tok/s the RTX PRO 6000 puts out 1.67x the baseline throughput while costing 67% of the price, and the per-token cost lands about 3x lower.

Where the time goes

Here's nsys run with --cuda-graph-trace=node over 64 profiled decode tokens:

text

Inside each MoE layer (145 us × 48 layers):

text

Add up the items marked for removal and you get 48.1 us a layer. The GEMMs, the actual math, come to 41.4. So the GPU is spending more time shuffling data into shape than it spends multiplying anything.

Warp decode

At batch size 1 a token routes to 10 of 512 experts, and each expert has exactly one input vector to chew on. The expert-centric pipeline still goes through the whole dance: it scatters that single vector 10 times, quantizes it to FP4, fires 10 tiny GEMMs through tile-based CUTLASS kernels, runs a separate SiLU pass, and then stitches the outputs back together.

Warp decode throws that structure out. It hands each warp, meaning 32 threads, responsibility for one output element. The warp pulls the weight rows it needs straight from memory, dequantizes the FP4 values in registers, and writes a single scalar. Nothing gets scattered, no intermediate buffer gets allocated, and there's no activation kernel off to the side.

Gate+Up kernel

5,120 warps in all, one for each of the 10 experts times 512 neurons. Every warp does this:

text

Gate and up read the activation once between them, and SiLU folds right into the epilogue. So this one kernel does the work of three CUTLASS launches.

Down kernel

2,048 warps, one per hidden dimension. Each of them loops over the 10 experts:

text

The routing weights get combined right in the accumulator, which means the expert outputs never have to touch memory at all.

What changes

CUTLASS stage	us/layer	After warp decode
Expert scatter (copy input 10×)	18.9	Eliminated
FP4 activation quantize	5.7	Bypassed (bf16 input)
SiLU activation kernel	13.7	Fused in gate_up
TMA stride computation	6.7	Eliminated
Scale format conversion	3.1	Computed in registers
Gate+Up + Down GEMMs	41.4	18.5 (warp-decode)
Total	145	~38

FP4 dequantization without a lookup table

The last big win was hiding in the dequant, and it took me longer to spot than I'd like to admit. Decoding a 4-bit nibble into a float is the kind of thing you do with a 16-entry lookup table and never think about again. I didn't think about it either, until the profile kept charging me 41 us I couldn't place. It was the table. Put 5,120 warps on it at the same moment, each reaching for a different nibble, and constant memory won't hand them out in parallel. It makes them line up. That queue was turning a 6 us kernel into a 47 us one.

So the table had to go. It turns out you don't need it. The float can be assembled from the nibble bits directly, no memory access anywhere in sight:

cuda

Three shifts, two masks, a conditional move, and the compiler is kind enough to fold all of it into one predicated FSEL. With the table out of the picture, Gate+Up went from 47 us to 10.3.

Dead ends

A lot of this project was things that didn't work, so I'll list a few. Async bulk memory ops looked good until the setup cost swallowed the win at our sizes. Shared-memory staging just handed us synchronization we didn't ask for. I burned real time on the specialized tensor core instructions before giving up on them; they're clumsy once the dimensions shrink and the tile-size rules turn against you. NVIDIA's own FP4 decode primitives even lost to the hand math above, once you tally the conversions they need.

Batching was the one we quit on purpose. Around batch 16 the grown-up libraries pull ahead and stay ahead, so we stopped fighting for that ground and dug into the tiny-batch corner nobody else tunes.

Wiring it into TRT-LLM's CUDA graphs

Writing the kernel was the easy half. The weeks went into convincing TRT-LLM to run it. The constraint that made it hard: TRT-LLM freezes the entire forward pass into a CUDA graph at warmup and then just replays that frozen graph on every inference step, which is precisely how it avoids paying Python dispatch overhead. Wonderful for throughput. Miserable for anyone slotting in a custom kernel, because yours now has to capture cleanly at build time and land on the identical tensor addresses every single replay after. Be off by one address and what you get back is garbage.

Our hook goes in at ConfigurableMoE._forward_chunk_impl. That's the moe_custom_op boundary, and it's where the graph will catch our launches during the batch-1 warmup. What happens there depends on the shape. See batch 1 in bf16 and we vault over Steps 4-5-6, the quantize plus the CUTLASS grouped GEMM, and run our two warp-decode calls in their place. See anything else and we don't interfere; it rides CUTLASS exactly as before.

The one thing left to sort out was torch.compile. It refuses to move unless it already knows the op is there, so we register through TORCH_LIBRARY and give it Meta dispatch stubs to keep it happy.

None of this came easily. It took 22 goes to get right, v5 through v22. The first 21 all failed the same two ways: either the output was correct but CUDA graphs wouldn't capture it, or the graphs captured fine and the text came out garbled. v22 was the first version that was both fast and correct.

Results

Configuration	tok/s	Speedup	Output
TRT-LLM baseline (CUTLASS)	97	1.0x	Correct
Warp-decode v22	162	1.67x	Correct

Because we skip the FP4 activation quantization entirely, keeping bf16 activations and FP32 accumulators all the way through, our output actually sits closer to the FP32 ground truth than the CUTLASS FP4×FP4 path does. Cursor saw the same thing, landing 1.4x nearer the 32-bit reference.

Where the speed comes from is the warp-per-element kernel dropping all the scatter, gather, and padding that grouped GEMM needs at small batches. Grouped GEMM has to pack inputs into expert-shaped tiles no matter how few tokens actually route to a given expert, and at batch=1 that packing is nearly the whole cost. By batch=8 the gap closes back up, which is what you'd expect: grouped GEMM spreads the reshape over more rows, and CUTLASS gets better tile utilization as the M dimension grows.

What's next

The router GEMV is still costing us 34.1 us a layer, about a quarter of the decode time that's left. We've already written a warp-decode bf16 GEMV that runs it in 4.1 us, but it isn't in production yet. If we land it, the projection is around 199 tok/s, another 23% on top of where we are.

We're building specialized models at Morph, and a specialized inference engine to go with each one. If you're

All measurements on NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, 96 GB HBM3e, ~1.5 TB/s measured bandwidth). Model: Qwen3-Next-80B-A3B-Instruct-NVFP4 on TensorRT-LLM 1.3.0rc10.

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

State of the Art Price-Performance: Morph-Decode 1.67x Faster MoE Decode

Why the RTX PRO 6000?

Cost per token

Where the time goes

Warp decode

Gate+Up kernel

Down kernel

What changes

FP4 dequantization without a lookup table

Dead ends

Wiring it into TRT-LLM's CUDA graphs

Results

What's next

Table of Contents