Agent Failures Don't Throw

An agent loops, a user gives up, a jailbreak slips past, and the trace still says 200 OK. The failures that matter to agents are semantic, and your logs can't see them. Reflexes are small classifiers that read every turn and label the ones that broke: frustration, jailbreaks, loops, policy violations. Eight ship out of the box, one API call, under 90ms. Train your own in an hour.

Tejas Bhakta
Tejas Bhakta
June 23, 20266 min read
Agent Failures Don't Throw

A web server fails by throwing. You get a 500, a stack trace, a line number. An agent fails by succeeding at the wrong thing. It loops for twenty turns, refuses a reasonable request, gets talked out of its system prompt, closes the ticket by deleting the test. Every call returns 200 OK.

That's the gap. Your traces, your logs, your latency graphs all watch the plumbing. None of them watch the conversation.

Same eight turns, two viewsthe plumbing is green while the conversation breaks
What your traces see
200
200
200
200
200
200
200
200
What actually happened
ok
ok
frustrated
ok
loop
loop
jailbreak
ok

Every call returned 200. The failures that matter are in the text, and nothing in the trace is looking at the text.

The stakes keep climbing while the visibility stays flat. Agents pull from more tools every month, call their own subagents, run for hours with no human in the loop, and now sit in healthcare, finance, and support queues where a quiet failure costs real money. The harder the agent works, the less a 200 tells you.

The failures live in the text

Error rate and latency already throw, so you already watch them. The signals that decide whether the agent is any good never reach your logs, because they're semantic. They live in what the agent and the user actually said:

  • A user typing "this is the third time I've asked" and giving up.
  • A prompt that talks the agent out of its own rules.
  • An agent calling the same tool eight times, getting the same error, and trying again.
  • A reasonable request the agent refused for no reason.
  • The agent leaking its chain of thought into the reply.

Each one is invisible until something reads the turn and labels it. That something is a reflex: a small classifier that runs on a single turn and returns one answer. Not a score from one to ten that drifts when you reword the prompt. A label you can count, trend across a release, and page someone on when it spikes.

Eight reflexes, one API call

You don't train anything to start. Point a turn at the API and get a label back in under 90ms. These ship out of the box:

ReflexFires when
User Frustrationthe user is angry, repeating themselves, or giving up
Jailbreakthe input is trying to break the agent out of its instructions
Stuck in a loopthe agent is repeating an action with no progress
Guardraila turn crosses a policy line you defined
Incomplete thoughtthe agent stopped mid-task or left the job unfinished
Leaked thinkinginternal reasoning leaked into a user-facing reply
Difficultythe turn is hard or the agent needs more information
Ambiguitythe request is underspecified

One call, one classifier, one label:

bash
json

Send every turn through it. frustrated at 0.97 is a row in your dashboard, not a number buried in a transcript no one will read.

A rate you can alert on

One label is a data point. The same reflex on 100% of traffic is a rate, and a rate is the thing you actually run a product on. Frustration sits at a baseline, then a deploy lands and it spikes:

User-frustration rate, last 48 hoursone reflex on 100% of traffic
2 days ago↑ system prompt v2.4 deployednow

Frustration held near 12% until v2.4 shipped, then doubled within hours. No one read a thousand transcripts to find it. The reflex ran on every turn and the rate moved.

You shipped a new system prompt at v2.4, and the frustration rate doubled within hours. No one had to read a thousand transcripts to find out. The reflex did, on every turn, and the chart told you.

This turns shipping into an experiment. Roll a prompt or model change out to 5% of traffic, hold the rest as control, and watch the signal rates. If refusals or frustration climb after the change, you shipped a regression and you knew the same day instead of after the churn. It's A/B testing on what the agent actually did, not on what your eval set predicted it would do.

When we don't ship the signal, train it

Eight reflexes cover the common failures. Your product has failures that are yours alone: a medical agent that should never give dosing advice, a coding agent that quietly stubs out a failing test, a support agent drifting off-policy. Describe the behavior, bring a few labeled examples or let us generate a synthetic set, and you have a custom reflex in under an hour. It serves on the same API, at the same latency, beside the built-ins. The public surface is OpenAI-fine-tuning-compatible, so if you've trained a model through the OpenAI SDK, you already know the shape.

The labels feed back into the agent

A reflex is a detector, but the output is training signal. Every label is a row that can become an eval case, a fine-tune example, or a reward term in RL. The turn that flagged as frustrated today is the behavior your main agent learns to avoid tomorrow.

The data enginecheap enough to run on everything, every day
01
Run on 100%
Label every turn in production, not a 1% sample.
02
Surface
Flag the rare turn: a jailbreak, a loop, a frustrated user.
03
Sample
Pull those turns into a dataset worth training on.
04
Improve
Feed evals, fine-tunes, RL rewards. Repeat.

A big model on every turn is too expensive to run on all of production. The detector has to be tiny, or the loop never closes.

I learned this at Tesla, building the stack that went through driving data at scale to surface the frames engineers wanted to train on: the obscured camera, the rare cut-in, the long-tail case the model kept missing. The fleet produced petabytes. Nothing accurate enough to be useful was cheap enough to run on all of it, so the whole game was finding a filter you could afford to run on everything, because the rare event you need is never in the sample.

Treat your production traffic as a funnel. At the top, reflexes run on 100% of turns and flag a few percent. That flagged slice is small enough to hand to the expensive review you could never point at all of production: your team reads the turns that matter, and your own agents dig into the rest. Reflexes don't replace that judgment. They decide what's worth it, so nobody drowns in transcripts that returned 200 and meant nothing.

The production funnelcheap on everything at the top, expensive on a slice at the bottom
All production
Reflexes — tiny, automated, every turn
100%
Reflex-flagged
Heavier agents read the flagged turns in depth
~5%
Worth a human
Your team works the slice that matters
~0.1%
volume falls ~1000×compute per item rises ↓

The cost of the top filter sets how much of production you can afford to look at. Cheap enough to run on 100%, and the expensive review below it only ever touches the slice worth the spend.

That only works because the detector is cheap enough to leave running on everything. One backbone pass labels every turn with many heads, so the tenth reflex costs almost nothing. Read how we built it.

The label was never the hard part. Running it on all of production was, and that's the part we solved.

Try a reflex on your own traffic, or read the docs.