---
title: "One Backbone, Many Reflexes"
url: "https://www.morphllm.com/blog/reflex-inference-engine"
description: "A reflex is a classifier that has to read every agent turn, up to 64k tokens, and land a label in under 90ms. Put a frontier model on that job and you double your inference bill. So we took the open engine we run for codegen, SGLang, and wrote a backend that reads each turn once and labels it with many tiny heads. The tenth reflex costs almost nothing."
date: "2026-06-23"
author: "Tejas Bhakta"
---
# One Backbone, Many Reflexes

A [reflex](/products/reflex) is a fast LLM-based classifier that watches an agent. Is the user frustrated. Is this a jailbreak. Is the agent stuck in a loop. The hard part was never deciding what to detect. It was the serving cost.

To be worth anything, a reflex has to run on *every* turn, read up to 64k tokens of context, and return a label in under 90ms. Put a frontier model on that job and you roughly double your inference bill, because the model re-reads the full conversation before it answers. The watcher ends up costing as much as the agent. So the detector has to be tiny, and you have to serve a lot of them cheaply on the same traffic. That is an inference problem, and it's the one this post is about. (If you want the case for *why* you need per-turn detection at all, that's [here](/blog/agent-failures-dont-throw).)

## The encode is the expensive part

Asking a model "is this a jailbreak, yes or no" reads the whole conversation first, then writes one token. The read is up to 64k tokens. The answer is the cheap part. So a prompt-per-model design pays for the conversation once per question. Ten reflexes, ten reads of the same turn.

<PromptPerModelDiagram />

The work that scales is the conversation, not the label. Once you see that, the design writes itself. Read the turn once. Ask every question off that single read.

## Shared backbone, many heads

A transformer's forward pass has two parts: a backbone that turns the conversation into activations, and a head that turns activations into an answer. The backbone is almost all of the compute. The head is a thin layer on top.

This is an old idea pointed at a modern architecture. [MT-DNN](https://aclanthology.org/P19-1441.pdf) shared one BERT encoder across many language tasks back in 2019 and hung a small task-specific head on each. [MTFormer](https://link.springer.com/chapter/10.1007/978-3-031-19812-0_18) showed the same with transformers: lightweight per-task branches off one shared trunk, which cuts the time and space cost instead of adding to it. One encode, many heads.

So we run one backbone and bolt many heads onto it. The conversation gets encoded once. Each reflex is a small head reading the same activations and emitting its own label.

<ReflexBackboneDiagram />

Adding a reflex means adding a head, not another pass over the conversation. That's why a tenth reflex costs almost nothing, and why the whole stack lands a label in under 90ms end to end.

Standard serving won't do this. It assumes one model, one forward pass, one output. So we took the open inference engine we already run for codegen, [SGLang](https://github.com/sgl-project/sglang), and wrote a custom backend that fans a single backbone pass out to many heads in the same step. The activations are computed once and shared across every head before the GPU moves on.

## Why the models have to be tiny

You could skip all of this and point a big model at every turn. You can't afford it, which is the whole reason we're here. The detector has to be tiny.

Tiny models are the hard ones to serve well, because the backbone you're amortizing is small to begin with. A separate copy per reflex wastes most of the GPU on redundant reads. Sharing the backbone is what makes a tiny model cheap enough to leave running on everything.

Small-model training is the part we're good at. [Fast Apply](/products/fastapply) and [Compact](/products/compact) are both small models pushed well past where the scaling laws say they should plateau. Those laws assume training is the cost; for a model you train once and serve billions of times, the economics push hard toward small and overtrained. [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) is a 1.7B model trained to 11T tokens. [Sardana et al.](https://arxiv.org/abs/2401.00448) trained 47 models to 10,000 tokens per parameter with quality still climbing. A reflex head lives exactly there: small, overtrained, shaped to one signal.

## Cheap compute is what buys coverage

It's tempting to read all of this as a cost-saving footnote. Cheaper inference, smaller bill, nice. That undersells it. The cost of the filter is not a line item. It is the thing that decides how much of production you are allowed to look at.

I learned this at Tesla, going through driving data at scale. The fleet produced petabytes. No model worth running could touch a petabyte of video, so the real question was never "what's the best detector," it was "what can I afford to run on all of it." Is the camera obscured. Is this a rare cut-in. Is this the long-tail case the model keeps getting wrong. The cheap classifier that ran on 100% of frames beat the accurate one that could only afford a 1% sample, because the rare event you care about is, by definition, not in the sample. Halve the cost of the filter and you don't save money. You double how much of reality you can see.

Agents are the same shape. Every turn in production is a frame. You will never read them all, and you can't afford a frontier model on each one, so the cost of the reflex is the cost of seeing your own product.

## Think of it as a funnel

Production data is a funnel, and the filter sits at the mouth.

<FunnelDiagram />

At the top is everything: 100% of turns, far too much for a human or a heavy model to read. Reflexes run there, on all of it, because they're tiny enough to. They flag a few percent. That flagged slice is now small enough to hand to the expensive stuff you always wanted to point at production but never could afford to: heavier agents that read the flagged turns in depth, and your own team working the handful that actually matter. Volume drops a thousandfold down the funnel; attention per item climbs the same way.

Minimizing compute at the top is what makes every stage below it possible. A cheaper reflex widens the mouth: run it on more traffic, or run more reflexes on the same traffic, and fewer rare events slip through unflagged. An expensive filter narrows the mouth back to a sample, and the sample is where the failure you needed to catch wasn't.

<DataEngineDiagram />

And the funnel feeds itself. The turns reflexes surface today become the dataset that trains the next model, the eval that gates the next release, the reward term in the next RL run. A turn flagged today is the behavior the main agent learns tomorrow. That loop only closes if the detector is cheap enough to run on everything, which takes you right back to one backbone and many tiny heads.

The label was never the hard part. Running it on all of production was.

[Train a reflex on your own signal](https://www.morphllm.com/dashboard/reflex), or [read the docs](https://docs.morphllm.com/sdk/components/reflexes).
