Agent tracing records what an AI agent did across a run as a tree of spans: the agent invocation, every model call, every tool call, every sub-agent, with inputs, outputs, tokens, and timing on each node. (This is AI-agent tracing, not the skip-tracing sense of the term.) The data model is right. The problem starts when an agent makes 200 decisions and the span tree becomes something you scroll, not something you read. Spec details and tool support below verified June 2026.
What Agent Tracing Is: Spans and the Run Tree
A span is one timed unit of work with structured data attached: a start and end time, a name, a status, and a bag of attributes. For an agent, the useful spans are the model calls, the tool calls, and the agent or sub-agent invocations. Each span carries the inputs it received, the output it produced, the tokens it consumed, and how long it took.
Spans nest into a run tree (a trace). One user request becomes a top-level agent span. Under it sit the model calls the agent made, and under those, the tool calls each model call triggered. If the agent spawns a sub-agent, that sub-agent is its own span with its own children. The tree is the full causal record of a single run: who called what, in what order, and what came back.
This is borrowed directly from distributed tracing, where a span tree follows a request across microservices. The borrow is exact and it is also the source of the friction further down: a web request touches a handful of services and produces a handful of spans; an agent loop touches a model and its tools dozens of times and produces hundreds. Same data structure, very different scale.
- Agent span (
invoke_agent): the whole run, or a sub-agent within it. Carries the agent name, id, and conversation id. - Model span (
chat): one LLM call. Carries model, token usage, and finish reason. - Tool span (
execute_tool): one tool call. Carries the tool name, call id, arguments, and result.
OpenTelemetry GenAI Semantic Conventions
Early LLM tracers each invented their own field names. OpenTelemetry's GenAI semantic conventions fix that: a shared vocabulary of span names and attributes so a trace produced by one library reads correctly in any backend. Every major tracer here, Langfuse, Arize Phoenix, OpenLLMetry, Laminar, has converged on it, which is why instrumenting to the standard is the move that does not lock you in.
The operations are named: chat for a model call, invoke_agent for an agent run, execute_tool for a tool call, embeddings for an embedding call, and create_agent when an agent is constructed. Attributes live under the gen_ai.* namespace and attach to the span where they make sense.
| Span | Operation | Key attributes | What they tell you |
|---|---|---|---|
| Model call | chat | gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons, gen_ai.provider.name | Which model, how many tokens, why it stopped (stop vs tool_calls) |
| Agent / sub-agent | invoke_agent | gen_ai.agent.name, gen_ai.agent.id, gen_ai.conversation.id | Which agent ran, and which conversation the run belongs to |
| Tool call | execute_tool | gen_ai.tool.name, gen_ai.tool.call.id, plus arguments and result | Which tool fired, tied back to the model call that requested it |
The reason OTel matters beyond tidiness: the convention is vendor-neutral. Instrument your agent once against gen_ai.* and the spans flow to Langfuse, to Arize Phoenix, to a self-hosted ClickHouse, or to several at once, with no re-instrumentation. It also means a per-turn signal you compute yourself (see below) is just another attribute you write onto the same span, readable by whatever backend you point at.
The GenAI conventions moved to a dedicated semantic-conventions-genai repository and are still stabilizing. The operation names and core gen_ai.* attributes above are stable enough that the major tracers ship them; some content attributes (full message capture) are opt-in and evolving. OpenTelemetry's own walkthrough, Inside the LLM Call, shows the invoke_agent → chat → execute_tool tree end to end.
Tracing Multi-Step and Sub-Agent Runs
A single LLM call is one span and there is nothing to trace beyond it. An agent is a loop: plan, call a tool, read the result, decide whether to continue, repeat, sometimes delegate to a sub-agent. Tracing that means capturing the whole nested sequence as one tree, with cost and tokens attributed per step, so you can see where a run spent its time and where it went off the rails.
Two things get hard at agent scale. The first is sub-agent visibility. When an agent spawns a sub-agent, the sub-agent often runs in a separate process or behind an SDK boundary, and naive instrumentation loses it: the parent trace shows a tool call that took 90 seconds and returns a blob, with no window into the dozens of model and tool calls the sub-agent actually made.
Capturing the sub-agent as a proper child span under the call that spawned it is a named gap. Laminar, for instance, advertises that it is the only platform that traces Claude Agent SDK sub-agents, which it does with a small Rust proxy that captures every call across the process boundary and nests it under the parent trace.
The second is the UI at scale. Span trees were designed for request-response systems where a trace is a few spans deep. Laminar's launch made the origin-story version of the complaint plainly: for an agent that ran for 30 minutes and made 200 decisions, clicking through each span individually "simply doesn't work."
The fix is not more tree, it is a different view: lay the run out as a readable transcript of the agent's reasoning and actions, and let you query it in natural language instead of expanding nodes by hand. The observability space was built for LLM calls that return in seconds, and agents that run for tens of minutes broke that assumption.
A trace tree is exhaustive and complete, which is exactly why it is unreadable at 200 spans. Exhaustiveness is the right property for debugging a specific known failure and the wrong property for answering "did this run go fine?" at a glance. The trees are not wrong; they are the wrong granularity for the question most teams ask most often.
Instrument a Trace
With OpenTelemetry, instrumenting an agent turn is creating nested spans and setting gen_ai.* attributes on them. The shape below is the convention by hand; in practice you get most of it for free from an auto-instrumentation library (OpenLLMetry instruments OpenAI, Anthropic, LangChain, and vector DBs out of the box), and the OpenAI Agents SDK emits the same tree with tracing on by default.
An agent turn as gen_ai.* spans (OpenTelemetry, Python)
from opentelemetry import trace
tracer = trace.get_tracer("my-agent")
# Top-level agent span: the whole run / a sub-agent
with tracer.start_as_current_span("invoke_agent research-agent") as agent:
agent.set_attribute("gen_ai.operation.name", "invoke_agent")
agent.set_attribute("gen_ai.agent.name", "research-agent")
agent.set_attribute("gen_ai.conversation.id", conversation_id)
# Child: the model call that decides what to do
with tracer.start_as_current_span("chat gpt-5") as llm:
llm.set_attribute("gen_ai.operation.name", "chat")
llm.set_attribute("gen_ai.provider.name", "openai")
llm.set_attribute("gen_ai.request.model", "gpt-5")
resp = client.chat.completions.create(...)
llm.set_attribute("gen_ai.usage.input_tokens", resp.usage.prompt_tokens)
llm.set_attribute("gen_ai.usage.output_tokens", resp.usage.completion_tokens)
llm.set_attribute("gen_ai.response.finish_reasons", [resp.choices[0].finish_reason])
# Child: the tool the model asked for
with tracer.start_as_current_span("execute_tool web_search") as tool:
tool.set_attribute("gen_ai.operation.name", "execute_tool")
tool.set_attribute("gen_ai.tool.name", "web_search")
tool.set_attribute("gen_ai.tool.call.id", call_id)
result = web_search(query) # returns 200 with... the right data? a span won't sayPoint the exporter at any OTel-compatible backend and this run shows up as the invoke_agent → chat → execute_tool tree. Every number on it is real: the model, the token counts, the tool name, the latency. The last line is the catch, and the next section is about it.
Agent Tracing Tools Compared
All of these capture the run tree well. They differ on the standard they build on, whether sub-agents render as proper nested spans, how the long-run UI holds up, and the hosting and license model. Columns verified against each vendor's docs as of June 2026.
| Tool | Built on | Agent + sub-agent tracing | Hosting / license |
|---|---|---|---|
| OpenLLMetry (Traceloop) | OpenTelemetry (gen_ai.* extensions) | Auto-instruments OpenAI, Anthropic, LangChain, vector DBs; exports to any OTel backend | Apache 2.0, vendor-neutral SDK |
| Langfuse | OpenTelemetry-based | Nested agent traces via LangChain/LangGraph handler or framework-agnostic SDK | MIT core, self-host or cloud; unit-based billing |
| Arize Phoenix | OpenTelemetry + OpenInference | Parent-child spans across agent, LLM, and tool steps with full I/O | Elastic License 2.0, self-host free, no event caps |
| Braintrust | Own SDK + OTel-compatible | Nested agent spans; wrappers for OpenAI Agents SDK, LangGraph, CrewAI | Closed source, cloud; eval-first |
| LangSmith | Own SDK (OTel ingest supported) | First-party LangGraph; a trace is one full agent run | Closed source, cloud; deepest LangChain fit |
| Laminar | OpenTelemetry-native | Transcript/reader view + chat-with-trace; traces Claude Agent SDK sub-agents via a Rust proxy | Open source, self-host or cloud |
| OpenAI Agents SDK | Built-in tracer (pluggable processors) | Generations, tool calls, handoffs, guardrails; on by default to the Traces dashboard | Tied to the SDK; fan out to others via processors |
Two practical notes. First, billing interacts with agent shape: a 20-step turn is roughly 20 spans, so a per-unit tracer (Langfuse) and a per-trace tracer price the same agent very differently at volume. Second, "OpenTelemetry-based" is the column that ages best; building on the standard means you can change the row above without re-instrumenting the agent below. For a fuller buyer's view including pricing and the build-vs-buy math, see best AI agent monitoring tools.
What a Span Can't Tell You
Here is the line every tool on this page runs into. A span records that a tool was called. It does not record whether the turn was good. Those are different questions, and tracing only answers the first.
- A tool span returns 200 whether it fetched the right file or the wrong one. The status is structural; correctness is semantic.
- A looping agent produces a long, clean sequence of
chatandexecute_toolspans. From the tree it looks like an agent doing work. - A jailbroken turn traces identically to a benign one: a model call, a finish reason, a completion. The trace records that something came back, not what it agreed to.
- A frustrated user who rephrases three times and abandons the session produces three structurally normal request-response spans.
All of these are failures of meaning, and meaning is not a field on a span. To know a turn looped, drifted off task, got jailbroken, or lost the user, you have to classify the content of the turn. The tracing tools approximate this with LLM-as-judge evaluations that run offline on sampled traces, which is fine for measuring quality trends and far too slow and expensive to run on every turn of every live run.
A Morph Reflex is a per-turn classifier built for exactly this. It returns a label in one forward pass, under 90ms end to end, over up to 64k tokens of context, enough to see the full agent turn including its tool calls.
Because it bills per event (one event is 2048 tokens) at $0.001 per event for realtime, around $0.49 per million tokens classified, it is up to 10x cheaper than running a frontier model as a judge, and cheap enough to run on every turn rather than a 1% sample. You train a custom Reflex on your own labeled failures in under an hour, and the API is OpenAI-fine-tuning-compatible (/v1/fine_tuning/* for training, /v1/reflex/predict for inference, base model morph-reflex-v1).
Label the same turn for looping, inline
curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
-H "Authorization: Bearer $MORPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "agent-loop", "text": "<the agent turn, including its tool calls>"}'
# {
# "model": "agent-loop",
# "mode": "single_label",
# "classes": [
# { "class_id": 0, "label": "looping", "score": 0.96, "selected": true },
# { "class_id": 1, "label": "progressing", "score": 0.04, "selected": false }
# ],
# "inference_time_ms": 11
# }The label comes back as an API response, so it composes with the tracing you already run. Write is_agent_looping or jailbreak_attempt onto the Langfuse, Arize, or OpenTelemetry span as a gen_ai.*-adjacent attribute and the signal shows up in the trace UI you already use.
Alert on it the moment a live run drifts. Or act on it inline: stop the agent, escalate to a human, switch strategies, before the bad turn ships. The trace tells you what happened; the classifier tells you whether it was okay. See agent observability for how the two layers fit together.
Frequently Asked Questions
What is agent tracing?
Recording what an AI agent did across a run as a tree of spans, where each span is a timed unit of work (a model call, a tool call, a sub-agent) with inputs, outputs, tokens, and timing attached. The spans nest into a run tree you can replay step by step. This is the AI-agent meaning, not skip-tracing. See spans and the run tree.
What are the OpenTelemetry GenAI semantic conventions?
A vendor-neutral standard for AI span names and attributes. Operations include chat, invoke_agent, execute_tool, and embeddings; attributes live under gen_ai.* (model, token usage, agent name, tool name, conversation id). Instrument once and any OTel backend reads it. See the OTel section.
How do you trace OpenAI agents?
The OpenAI Agents SDK traces by default: each run is collected as LLM generations, tool calls, handoffs, and guardrails and exported to the Traces dashboard. Disable with OPENAI_AGENTS_DISABLE_TRACING=1, or add processors to fan traces out to Langfuse, Braintrust, Arize Phoenix, or Laminar. See the tools table.
How does Langfuse agent tracing work?
Langfuse is an open-source, OpenTelemetry-based tracer. Instrument via the LangChain/LangGraph callback handler, the framework-agnostic SDK, or by piping in OpenLLMetry gen_ai.* spans. Each run renders as a nested trace with per-step cost; note its unit-based billing scales with span count. See the comparison.
Why is a span tree hard to use for long agent runs?
Span trees were built for request-response services with a few spans per trace. A 30-minute agent run with 200 decisions produces hundreds of spans, and reading it node by node does not scale. Newer tools lay the run out as a readable transcript and trace sub-agents as nested spans instead. See multi-step runs.
What can a span not tell you?
Whether the turn was good. A span records that a tool was called and returned 200, not that it returned the wrong file, that the agent is looping, or that the turn was a jailbreak. Those are semantic, and require classifying the turn's content. A Reflex returns that label in under 90ms. See what a span can't tell you.
Do I need a tracer and a classifier?
Usually both. Tracing answers what happened and is necessary for debugging, cost, and replay. A per-turn classifier answers whether each turn was okay, which the trace cannot. The classifier returns a label you write onto the span, so they compose; it does not replace the tracer.
Go deeper
- Agent observability: why your traces stay green while the agent fails, and the metrics that matter for multi-step runs
- Best AI agent monitoring tools: the same tools with pricing, pros and cons, and build-vs-buy
- Reflex: the per-turn classifier, under 90ms, custom-trained in under an hour
- Pricing: per-event Reflex rates for realtime and batch
Sources
- OpenTelemetry GenAI semantic conventions (operation names +
gen_ai.*attributes) - OpenTelemetry: Inside the LLM Call (the invoke_agent → chat → execute_tool tree)
- Laminar launch: "for an agent that ran for 30 minutes and made 200 decisions, this simply doesn't work"
- Laminar: instrumenting Claude Agent SDK sub-agents (Rust proxy across the process boundary)
- OpenLLMetry (Traceloop): OTel extensions for LLMs, Apache 2.0
- OpenAI Agents SDK: Tracing (on by default, pluggable processors)
- Morph Reflex capabilities and pricing
A span says the tool was called. A Reflex says whether the turn was okay.
Per-turn classification in under 90ms, over an API that composes with whichever tracer you run: looping, off-task drift, jailbreaks, frustration, or a signal you train in under an hour. Write the label onto your existing gen_ai.* span.