Agent Observability (2026): Why Your Traces Stay Green While the Agent Fails

An agent that loops, calls the wrong tool, or drifts off-task still returns a 200 with normal latency and token counts. Structural tracing cannot see semantic failure. This page maps agent-specific observability tools (Langfuse, LangSmith, Arize Phoenix, Braintrust, AgentOps, Helicone), the metrics that actually matter for multi-step agents, and the per-turn classifier layer that catches what the trace cannot.

June 20, 2026 · 1 min read
Agent Observability (2026): Why Your Traces Stay Green While the Agent Fails

An agent runs for 18 steps, calls the same search tool six times, never finds the file, and reports success. The trace is green: every span returned 200, latency was normal, token usage was unremarkable. The dashboard shows a healthy agent and the product is broken. That gap is the whole problem with applying single-call observability to multi-step agents. Tool support and free-tier numbers below verified June 2026.

200 OK
Status a looping agent returns on every span
18 steps
A failed run that traces as healthy work
<90ms
Per-turn semantic label, every turn

Agent Observability vs LLM Observability

A single LLM call is a function: prompt in, completion out. Observability for it means capturing the prompt, the response, the latency, and the token count. If the call errors or returns garbage, the trace shows it. That problem is largely solved, and the tools that solve it are covered in LLM observability tools.

An agent is not a function. It is a loop. It plans, calls a tool, reads the result, decides whether to continue, and eventually decides it is done. One user request fans out into many model calls, many tool calls, retries, and sometimes subagents. The unit of analysis is not a call, it is a trajectory: the full sequence of steps from the request to the agent declaring it finished.

That changes what observability has to capture and what counts as a failure.

DimensionLLM observabilityAgent observability
Unit of analysisOne call (prompt to completion)One trajectory (many steps to done)
Trace shapeA single spanA nested tree of plans, tool calls, subagents
Cost attributionTokens per callTokens per step, per tool, per subagent
What a failure looks likeError code or bad completionLoop, wrong tool, off-task drift, step-limit hit
Whether the failure shows in the traceUsually yesUsually no, the trace stays green

The last row is the one that matters. An LLM failure tends to be visible in the trace. An agent failure tends not to be, because the agent ran every step correctly at the mechanical level and still failed at the task level.

The Failure Modes Traces Miss

Each of these produces a 200 on every span, normal latency, and a normal token count. None of them is visible from the structure of the trace alone.

  • Looping. The agent calls the same tool repeatedly, often with slightly varied arguments so it does not even register as an exact repeat. From the trace it looks like an agent doing work. The token total grows and the dashboard reads as healthy throughput.
  • Tool calls that fail by returning 200. A search returns an empty result, a file read returns the wrong file, an API returns a stale value. The HTTP status is 200, so the span is green, but the agent now reasons over wrong inputs for the rest of the run.
  • Off-task drift. The agent starts on the user's goal and gradually wanders. By step 12 it is solving a different problem than the one in the first message. No single step looks wrong; the trajectory as a whole left the rails.
  • User frustration. The user rephrases the same request three times, gets terser, and abandons the session. Every one of those turns is a structurally normal request and response. The frustration is in the meaning, not the metrics.
  • Jailbreaks and policy violations. A turn that smuggles in an instruction to ignore the system prompt traces identically to a benign turn. The model either complied or refused, but the trace records only that a completion was returned.

The common thread

Every failure above is semantic, not structural. The mechanics of each call are fine, which is exactly why spans, latency, and token counts cannot see them. Structural observability is necessary and it is not sufficient.

Agent Observability Tools Compared

The general LLM observability platforms have all added agent tracing, and a few tools were built agent-first. What separates them for agents specifically is whether tool calls render as a proper nested tree, whether cost attributes per step, and how the billing model interacts with agents that emit many spans per run. Columns below are verified against each vendor's docs and pricing page as of June 2026.

ToolAgent supportFree tierLicense / notes
LangfuseNested agent traces; LangGraph and LangChain via the callback handler, framework-agnostic SDK otherwise50k units/mo, no credit cardMIT core (~28.8k stars). Unit-based billing: a many-span agent run consumes many units
LangSmithFirst-party LangGraph tracing; a trace is one full agent run including its tool calls and retries5k base traces/mo, 1 seat, 14-day retentionClosed source. Deepest fit if you already build on LangChain/LangGraph
Arize PhoenixOpenTelemetry + OpenInference; parent-child spans across agent, LLM, and tool steps with full inputs/outputsSelf-host free, no event capsElastic License 2.0 (source-available), OTel-native
BraintrustNested agent spans; SDK wrappers for OpenAI Agents SDK, LangGraph, CrewAI, Pydantic AI, Vercel AI SDK1M trace spans + 10k scores/moClosed source. Eval-first: tracing feeds scoring
AgentOpsAgent-native: step-by-step execution graphs, session replays, multi-agent interaction views5k events/mo (an event is each LLM or tool call)MIT SDK (~5.6k stars). Integrates CrewAI, OpenAI Agents SDK, AG2/Autogen, LangChain
HeliconeSessions group an agent's requests into one flow view with per-step latency and cost10k requests/mo, 7-day retentionApache 2.0. Moved to maintenance mode after the March 2026 Mintlify acquisition

Two practical notes. First, billing model interacts with agent shape: a 20-step agent turn is roughly 20 Langfuse units or one LangSmith trace with 20 spans, which flips which tool is cheaper depending on your traffic. Second, the maintenance-mode status on Helicone matters if you are picking a platform to standardize on in 2026; the Sessions feature still works, but active development stopped.

All of these answer the same question well: what happened, step by step, in this run. None of them answers the question the green trace hides: was any of it correct.

The Metrics That Actually Matter

Latency, token cost, and error rate carry over from single-call monitoring and you still want them. They are necessary. For agents they are not the metrics that tell you whether the agent worked. Four agent-specific signals do that.

  • Task completion. Did the run actually finish the user's goal? Not "did it return a final message," which any run does, but did the final state satisfy the request. This is the headline number for an agent and the one a trace cannot compute on its own.
  • Tool-call accuracy. Of the tool calls the agent made, how many picked the right tool with the right arguments? A tool returning 200 with empty or wrong data passes every structural check, so accuracy has to be measured against intent, not status code.
  • Step and loop count. How many turns did the run take, and did it repeat actions? A run that solved the task in 4 steps and one that flailed for 18 can show identical token totals. Step count and a loop flag separate them.
  • Trajectory. Did the sequence of steps stay on the original task, or drift? This is a property of the whole path, not any single span, which is why per-step status codes never reveal it.

Task completion, tool-call accuracy, loop detection, and trajectory are all judgments about meaning. The tracing tools approximate them with LLM-as-judge evaluations that run offline on sampled traces. That works for measuring quality trends. It does not catch the looping agent in the run that is happening right now.

The Semantic Layer: Per-Turn Classifiers

The fix for a semantic failure is a label on the meaning of each turn: is_agent_looping, is_off_task, is_user_frustrated, tool_call_wrong, jailbreak_attempt, or a signal specific to your product. Offline LLM-as-judge evals approximate this on samples. To run it on every turn of every live run, you need a classifier that is fast and cheap enough to sit inline.

A Morph Reflex is that classifier. It returns a label in one forward pass, under 90 milliseconds end to end, over up to 64k tokens of context, which is enough to see the full agent turn including the tool calls. Cheap enough per call that you run it on every turn rather than a 1% sample, so you catch the loop while it is happening, not in tomorrow's eval report. You train a custom Reflex on your own labeled data, or generate synthetic data, in under an hour.

Private beta

Reflexes is currently in private beta. The API below is live in the docs and may change before general availability. See the Reflexes docs and custom Reflexes.

Score an agent turn for looping, inline

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "agent-loop", "text": "<the agent turn, including its tool calls>"}'

# {
#   "model": "agent-loop",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "looping", "score": 0.96, "selected": true },
#     { "class_id": 1, "label": "progressing", "score": 0.04, "selected": false }
#   ],
#   "inference_time_ms": 11
# }

Because the label comes back as an API response rather than a panel inside one vendor's dashboard, it composes with every tool in the table above. Write is_agent_looping onto the Langfuse or LangSmith span as an attribute and the loop shows up in the trace UI you already use. Alert on it from Slack the moment a live run drifts. Or route on it inline: stop the agent, escalate to a human, or switch strategies the moment the classifier fires. It complements your tracer; it does not replace one.

The base model is morph-reflex-v1, exposed at /v1/reflex/predict for inference and /v1/fine_tuning/* for training custom signals. The full surface is on the Reflex product page and in the Reflexes docs.

Frequently Asked Questions

What is agent observability?

The ability to explain what a multi-step, tool-calling agent did across an entire run, not just one model call. It captures the nested tree of plans, tool calls, retries, and subagents in a request, attributes cost per step, and surfaces where the agent looped, called the wrong tool, or drifted. See how it differs from LLM observability.

How is agent observability different from LLM observability?

LLM observability traces one call; agent observability traces a trajectory of many steps. The failures differ too: an LLM call fails by erroring, an agent fails by looping, repeating a 200-returning tool call with wrong data, or drifting off task, all of which trace as green. Full breakdown in the first section.

What are the best AI agent observability tools in 2026?

Langfuse, LangSmith, Arize Phoenix, Braintrust, AgentOps, and Helicone all support agent tracing; the comparison table lists each one's agent support, free tier, and license. AgentOps was built agent-first; Phoenix is the lightest OTel-native self-host; LangSmith fits deepest if you already use LangGraph.

What metrics matter for AI agents specifically?

Task completion, tool-call accuracy, step and loop count, and trajectory. Latency, tokens, and error rate still matter but do not tell you whether the agent did the job. See the metrics section.

Why do agent traces stay green when the agent is failing?

Because the failures are semantic, not structural. A wrong tool result returns 200, a loop looks like work, a drifting run produces normal spans. The mechanics are fine, so spans and metrics cannot see the failure. See the failure modes.

How do I detect agent looping and off-task drift?

Step caps and exact-repeat flags catch the obvious cases and miss the varied-argument loops and gradual drift. For those, label the content of each turn with a classifier. A Reflex returns that label in under 90 milliseconds, cheap enough to run every turn, and you write it onto your existing span. See the semantic layer.

Does a per-turn classifier replace my tracing tool?

No. Tracing gives you the structure of the run; the classifier adds the meaning of each turn. The label returns as an API response, so you write it onto a Langfuse or LangSmith span, alert on it, or route on it inline. It complements your tracer.

Go deeper

Catch what the green trace can't see

A Reflex returns a semantic label on every agent turn in under 90 milliseconds, over an API that composes with whichever tracing platform you already run. Looping, off-task drift, user frustration, jailbreaks, or a signal you define.