A web server fails by throwing. You get a 500, a stack trace, a line number. An agent fails by succeeding at the wrong thing. It loops for twenty turns, refuses a reasonable request, gets talked out of its system prompt, closes the ticket by deleting the test. Every call returns 200 OK.
That's the gap. Your traces, your logs, your latency graphs all watch the plumbing. None of them watch the conversation.
Every call returned 200. The failures that matter are in the text, and nothing in the trace is looking at the text.
The stakes keep climbing while the visibility stays flat. Agents pull from more tools every month, call their own subagents, run for hours with no human in the loop, and now sit in healthcare, finance, and support queues where a quiet failure costs real money. The harder the agent works, the less a 200 tells you.
The failures live in the text
Error rate and latency already throw, so you already watch them. The signals that decide whether the agent is any good never reach your logs, because they're semantic. They live in what the agent and the user actually said:
- A user typing "this is the third time I've asked" and giving up.
- A prompt that talks the agent out of its own rules.
- An agent calling the same tool eight times, getting the same error, and trying again.
- A reasonable request the agent refused for no reason.
- The agent leaking its chain of thought into the reply.
Each one is invisible until something reads the turn and labels it. That something is a reflex: a small classifier that runs on a single turn and returns one answer. Not a score from one to ten that drifts when you reword the prompt. A label you can count, trend across a release, and page someone on when it spikes.
Eight reflexes, one API call
You don't train anything to start. Point a turn at the API and get a label back in under 90ms. These ship out of the box:
| Reflex | Fires when |
|---|---|
| User Frustration | the user is angry, repeating themselves, or giving up |
| Jailbreak | the input is trying to break the agent out of its instructions |
| Stuck in a loop | the agent is repeating an action with no progress |
| Guardrail | a turn crosses a policy line you defined |
| Incomplete thought | the agent stopped mid-task or left the job unfinished |
| Leaked thinking | internal reasoning leaked into a user-facing reply |
| Difficulty | the turn is hard or the agent needs more information |
| Ambiguity | the request is underspecified |
One call, one classifier, one label:
Send every turn through it. frustrated at 0.97 is a row in your dashboard, not a number buried in a transcript no one will read.
A rate you can alert on
One label is a data point. The same reflex on 100% of traffic is a rate, and a rate is the thing you actually run a product on. Frustration sits at a baseline, then a deploy lands and it spikes:
Frustration held near 12% until v2.4 shipped, then doubled within hours. No one read a thousand transcripts to find it. The reflex ran on every turn and the rate moved.
You shipped a new system prompt at v2.4, and the frustration rate doubled within hours. No one had to read a thousand transcripts to find out. The reflex did, on every turn, and the chart told you.
This turns shipping into an experiment. Roll a prompt or model change out to 5% of traffic, hold the rest as control, and watch the signal rates. If refusals or frustration climb after the change, you shipped a regression and you knew the same day instead of after the churn. It's A/B testing on what the agent actually did, not on what your eval set predicted it would do.
When we don't ship the signal, train it
Eight reflexes cover the common failures. Your product has failures that are yours alone: a medical agent that should never give dosing advice, a coding agent that quietly stubs out a failing test, a support agent drifting off-policy. Describe the behavior, bring a few labeled examples or let us generate a synthetic set, and you have a custom reflex in under an hour. It serves on the same API, at the same latency, beside the built-ins. The public surface is OpenAI-fine-tuning-compatible, so if you've trained a model through the OpenAI SDK, you already know the shape.
The labels feed back into the agent
A reflex is a detector, but the output is training signal. Every label is a row that can become an eval case, a fine-tune example, or a reward term in RL. The turn that flagged as frustrated today is the behavior your main agent learns to avoid tomorrow.
A big model on every turn is too expensive to run on all of production. The detector has to be tiny, or the loop never closes.
I learned this at Tesla, building the stack that went through driving data at scale to surface the frames engineers wanted to train on: the obscured camera, the rare cut-in, the long-tail case the model kept missing. The fleet produced petabytes. Nothing accurate enough to be useful was cheap enough to run on all of it, so the whole game was finding a filter you could afford to run on everything, because the rare event you need is never in the sample.
Treat your production traffic as a funnel. At the top, reflexes run on 100% of turns and flag a few percent. That flagged slice is small enough to hand to the expensive review you could never point at all of production: your team reads the turns that matter, and your own agents dig into the rest. Reflexes don't replace that judgment. They decide what's worth it, so nobody drowns in transcripts that returned 200 and meant nothing.
The cost of the top filter sets how much of production you can afford to look at. Cheap enough to run on 100%, and the expensive review below it only ever touches the slice worth the spend.
That only works because the detector is cheap enough to leave running on everything. One backbone pass labels every turn with many heads, so the tenth reflex costs almost nothing. Read how we built it.
The label was never the hard part. Running it on all of production was, and that's the part we solved.

