AI Agent Observability Tools (2026): Open Source & OTel Compared

89% of agent teams have observability. One in three still call quality their biggest blocker to shipping. The dashboards are everywhere; the hard problem did not move. The reason is that observability tools record what the agent did, not whether each turn was okay. Tool support, licenses, and pricing below are verified against vendor docs and GitHub as of June 2026.

89%

of agent teams already run observability (LangChain, 2025)

1 in 3

still name quality their #1 blocker to production

6 of 8

leading tools here are open source

<90ms

per-turn verdict, the layer none of them produce

The category sounds like one product but it is two layers. The first is tracing: capture the nested tree of plans, tool calls, retries, and subagents inside one request, attribute cost per step, and store it so you can read the run later. Almost every tool below does this well, and the best ones are open source and built on OpenTelemetry, so you self-host them and own the data.

The second layer is a verdict on each turn: is this a loop, an off-task drift, a jailbreak, a policy violation, a user giving up. That one is mostly missing, and it is the one teams keep saying they had to build themselves.

This page is the open-source and OpenTelemetry roundup. For the monitoring, alerting, and build-your-own angle, see best AI agent monitoring tools. For why a healthy-looking trace hides a broken agent, see agent observability.

Open-Source & OpenTelemetry Tools, Compared (2026)

One row per tool. "Open source" is the license on the self-hostable core. "OTel-native" means it emits standard OpenTelemetry spans rather than a proprietary format, so you are not locked to one backend. "What it traces" is the unit it records. Every tool here is a tracer; none of them returns a per-turn verdict, which is the gap the last section covers.

AI Agent Observability Tools at a Glance

Tool	Open source	OTel-native	Self-host	Pricing	What it traces
Langfuse	Yes (MIT)	Yes (v3 SDK)	Yes	Free OSS; cloud 50k units/mo free, then $29+/mo	Traces, observations, scores
Arize Phoenix	Source-available (ELv2)	Yes (+OpenInference)	Yes	Phoenix OSS free; Arize AX enterprise	Agent spans, evals, trajectories
OpenLLMetry / Traceloop	Yes (Apache 2.0)	Yes (is OTel)	Yes (SDK to any backend)	SDK free; cloud 50k spans/mo free	OTel spans for LLM/vector/framework calls
Helicone	Yes (Apache 2.0)	Ingests OTel	Yes	Maintenance mode since Mar 2026	Proxied requests, sessions, cost
Datadog LLM Obs	No (proprietary)	Ingests OTel	No	Per LLM span (~$8 / 10k spans)	LLM spans, tokens, cost, latency
Braintrust	No (proprietary core)	No (eval-first)	Enterprise only	Free 1GB + 10k scores/mo; Pro $249/mo	Spans tied to evals and scores
Langtrace	Yes (AGPL-3.0 app)	Yes	Yes	Free 5k spans/mo; $31/user/mo	OTel spans, metrics, annotations
SigNoz	Yes (MIT)	Yes	Yes	Community free; cloud from $49/mo	OTLP traces, logs, metrics (LLM via OTel)
Reflex (the missing layer)	API	Writes onto your spans	Cloud API	$0.001/event (~$0.49 / 1M tok)	A per-turn verdict, not a trace

Licenses, OTel support, GitHub stars, and free-tier numbers verified against each vendor's docs, pricing page, and repo in June 2026. Helicone's license is unchanged but active development stopped after the Mintlify acquisition. Reflex is not a tracer; it is the per-turn classifier the last column points at. See pricing.

Why OpenTelemetry Is the Through-Line

The reason "open source" and "OpenTelemetry" keep appearing together in this category is that OTel is what makes the open-source tools interchangeable. OpenTelemetry is a vendor-neutral standard for emitting spans. A tool that is OTel-native produces those standard spans, so the same instrumentation can feed Langfuse today, Phoenix tomorrow, and your own ClickHouse the day after, without rewriting a line of agent code. Arize maintains OpenInference, a set of semantic conventions on top of OTel specific to LLM and agent spans, which is why Phoenix and Langfuse can read each other's traces.

That portability is the whole appeal of the OSS stack. You instrument once with an OTel SDK or OpenLLMetry, point it at a self-hosted collector, and keep every byte of prompt and completion data inside your own infrastructure. No proxy in the request path, no per-seat lock-in, no vendor that can sunset the product out from under you, which is exactly what happened to Helicone customers in March 2026.

The proxy tradeoff

One architectural fork splits this list. Helicone is a proxy: it sits between your app and the model provider and logs requests passing through. That is the fastest way to get visibility, but it puts a service in the hot path. Langfuse, which is async by design, argues the case against it plainly in its own engineering writeup: a proxy "can introduce additional latency" and "can be a single point of failure." The common version of this objection on developer forums is blunter, that you do not want another service between you and your LLM provider. OpenTelemetry-native tracing sidesteps it because spans are emitted out of band, not inline.

Open-source LLM observability tools by GitHub stars (June 2026)

Community size is a proxy for integration breadth and how battle-tested the tool is. Every tool here is a tracer.

Langfuse

MIT, OTel-native

29,800

SigNoz

MIT, general OTel

27,500

Arize Phoenix

ELv2 + OpenInference

10,300

OpenLLMetry

Apache 2.0

7,200

Helicone

maintenance mode

5,900

Langtrace

AGPL-3.0

1,200

GitHub star counts as of June 2026 (rounded). Stars signal adoption, not capability; SigNoz is a general OpenTelemetry platform with LLM as one use case, the rest are LLM/agent-specific. All six are tracers: they record runs. None classifies whether a given turn was a loop, jailbreak, or frustrated user in real time.

1. Langfuse

Langfuse is the default open-source tracer and the name that comes up most in community threads. It rebuilt its SDK on OpenTelemetry in v3, so it ingests standard OTel spans and interops with Grafana, Jaeger, and Datadog. The entire product is MIT-licensed except a thin enterprise compliance folder, which means a self-hosted deploy has no seat caps, retention limits, or usage caps. The hosted free tier is 50,000 units a month (a unit is any trace, observation, or score). At ~29.8k GitHub stars it has the largest community in the category.

Strengths

MIT core, self-host with no caps, full data ownership
OpenTelemetry-native, broad framework support
Prompt management + dataset + eval tooling
Largest community, most integrations

Limitations

Self-host needs ClickHouse + Redis + S3, not one container
Unit billing: a many-span agent run burns many units
Tracing-first: shows what happened, not whether it worked
Post-hoc; cannot act on a turn while it runs

Best for: teams that want the most adopted open-source tracer with full data control. See Langfuse alternatives.

2. Arize Phoenix

Phoenix is Arize's open-source agent-evaluation and tracing layer; Arize AX is the enterprise platform above it. It is genuinely agent-aware: parent-child spans across agent, LLM, and tool steps, trajectory evaluation, and a strong eval library. It is OpenTelemetry-based and built on OpenInference, and a common community note is that it self-hosts in a single Docker container, lighter than Langfuse's multi-service deploy.

The license is Elastic License 2.0, which is source-available rather than OSI open source: free to self-host, with a restriction on reselling it as a managed service.

Strengths

Real agent-trajectory and span-level evaluation
Single-container self-host, lighter setup
OTel + OpenInference, strong eval library
Clear path from free Phoenix to enterprise AX

Limitations

ELv2 is source-available, not OSI open source
Oriented to ML and data-science teams
Full platform (AX) trends enterprise and per-volume
Eval is offline, not live per-turn blocking

Best for: ML teams that want rigorous, agent-aware evaluation with a light open-source on-ramp. See Arize Phoenix vs Langfuse.

3. Helicone

Status: maintenance mode since March 2026

Mintlify acquired Helicone on March 3, 2026, and both founders joined Mintlify. Security patches, bug fixes, and new model support continue, but there is no active roadmap. Helicone served roughly 16,000 organizations before the acquisition. If you are choosing a platform to standardize on, weigh the stalled development; if you already run it, the proxy and Sessions view still work. The fallout is live on r/LLMDevs.

Helicone is an Apache-2.0 AI gateway: a proxy that fronts 100+ models behind one OpenAI-compatible key, with routing, caching, fallbacks, and request logging. Sessions group an agent's requests into one flow view with per-step latency and cost. It was the fastest path from zero to visibility with a one-line base-URL change, and it is self-hostable via Docker and Helm. The proxy architecture is also its main objection: it sits in the request path, which adds latency and a failure point.

Strengths

One-line setup via OpenAI-compatible base URL
Gateway features: routing, caching, fallbacks
Apache 2.0, self-hostable
Cost and session views out of the box

Limitations

Maintenance mode: no new features after Mar 2026
Proxy sits in the hot path (latency, single point of failure)
Logging-first, thin evaluation
Post-hoc; no per-turn verdict

Best for: existing users who value the gateway, with a migration plan given the maintenance-mode status.

4. Datadog LLM Observability

If you already run Datadog, LLM Observability puts agent traces next to your APM, infra, and logs on one bill and one dashboard. It captures prompts, completions, token usage, cost, latency, and errors, and its AI Agents console visualizes multi-step runs. It can ingest OpenTelemetry and OpenInference data, but it is a closed commercial backend, not an open or OTel-first tool. The pricing model is the thing to understand before you adopt it for agents.

Per-span billing, in Datadog's own words

Datadog's docs state that each LLM span in a trace is priced independently (about $8 per 10,000 LLM spans on annual billing). Tool, embedding, and agent spans are not billed, but a single agent run fans out into many LLM calls across many turns, so a chatty or looping agent multiplies billable spans fast. Cost scales with how talkative the agent is, not with how much value the run produced. The forums that track observability spend put Datadog at the top of the "bill exploded" complaints.

Strengths

Agents in the same pane as infra and APM
Mature, reliable, enterprise-grade
Deep cost, latency, and error tracing
Ingests OTel/OpenInference data

Limitations

Closed source, no self-host
Per-LLM-span billing balloons with agent volume
Generic ops signals, no failure taxonomy
Only worth it if you already run Datadog

Best for: enterprises already standardized on Datadog who want agents on the same dashboard.

5. Braintrust

Braintrust is the tool to beat if your problem is evaluation. It treats evals as a first-class workflow: build datasets from production traces, write scorers including LLM-as-judge, run experiments, and gate changes on the results. Observability is wired to the evals, so a regression shows up as a failing scorer. The free tier is 1 GB of processed data plus 10,000 scores a month at 14-day retention; Pro is $249 a month. It is not open source and not OpenTelemetry-first, which is why it sits apart from the OSS stack above.

Strengths

Deepest eval workflow in the category
Datasets + experiments + scorers in one place
Observability tied to eval results
SDK wrappers for OpenAI Agents SDK, LangGraph, CrewAI

Limitations

Proprietary core, on-prem only on enterprise
Not OpenTelemetry-native
Evals go stale; Braintrust itself notes teams "spend more time maintaining eval infrastructure than improving agents"
Offline/async, not a live guardrail

Best for: teams whose core need is measuring quality with rigorous, versioned evals. See Braintrust vs Langfuse.

6. OpenLLMetry, Langtrace & SigNoz

Three more OpenTelemetry-native, self-hostable options that lean further toward "own your pipeline" than "buy a dashboard."

Apache 2.0

OpenLLMetry / Traceloop

AGPL-3.0

Langtrace app (Apache SDK)

MIT

SigNoz

OpenLLMetry (Traceloop). Not a dashboard but a set of OpenTelemetry instrumentations that auto-trace LLM providers, vector DBs, and frameworks, then emit standard OTel spans to any backend you run (Datadog, Grafana, Honeycomb, your own collector). Apache 2.0, ~7.2k stars. Traceloop sells a hosted backend with a 50k-span free tier, but the SDK's point is no lock-in. The best pick when you already have an OTel stack and just want LLM spans flowing into it.
Langtrace. An OTel-native tracer with a self-hostable app (AGPL-3.0) and Apache-2.0 SDKs, running Next.js + Postgres + ClickHouse via Docker Compose. Free for 5,000 spans a month, then roughly $31 per user a month. Smaller community (~1.2k stars) but a clean standards-based option if you want a turnkey self-host without Langfuse's service sprawl.
SigNoz. A general-purpose, MIT-licensed OpenTelemetry observability platform (~27.5k stars) positioned as the open-source Datadog alternative. LLM and agent monitoring is one use case via OTel instrumentation, not a purpose-built agent product. The right call if you want one self-hosted backend for traces, logs, metrics, and LLM spans rather than a dedicated LLM tool.

The shared trait

All three emit standard OpenTelemetry spans, so they compose with each other and with the per-turn classifier in the next section. You can instrument with OpenLLMetry, store in SigNoz, and write a semantic verdict onto each span, all over open standards. What none of them does, same as the rest of the list, is tell you whether a given turn was actually okay.

The Layer Every Tracer Is Missing

Strip away the dashboards and every tool above does one job well: it records the structure of a run so you can read it later. A span tells you a tool was called. A token counter tells you the run was expensive. Neither tells you the agent misread the user, went in circles with slightly varied arguments, got talked into ignoring its instructions, or watched a user rephrase the same request three times and quit.

Those are semantic signals, and producing them on every turn, fast and cheaply, is the part the trace cannot encode. It is why 89% of teams have observability and a third still cannot tell whether their agent is doing a good job.

That verdict needs a classifier that runs inline, not an eval that runs offline on a 1% sample tomorrow. A Morph Reflex is that classifier. It returns a label in one forward pass, under 90ms end-to-end, over up to 64k tokens of context, which is enough to see a full agent turn including its tool calls: is this a jailbreak, a loop, a frustrated user, a policy violation, or a failure you define.

It bills per event (1 event = 2048 tokens) at $0.001 for realtime, roughly $0.49 per million tokens classified, up to 10x cheaper than running a frontier model as a judge on every turn, and you can train a custom reflex on your own labeled failures in under an hour.

It does not replace your tracer; it composes with it. Because the label comes back as an API response rather than a panel inside one vendor's dashboard, you write is_agent_looping or jailbreak_attempt onto the Langfuse, Phoenix, or OpenTelemetry span you already emit, and the verdict shows up in the trace UI you already use. Alert on it the moment a live run drifts, or route on it inline: stop the agent, escalate to a human, or switch strategies before the bad turn ships. The OSS tracer keeps the history; the classifier adds the meaning.

Two layers, one OTel span

Tracing layer (Langfuse, Phoenix, OTel)

Records every span, tool call, and token
Open source, self-host, own your data
Answers: what did the agent do
Post-hoc, you read it after the run

Verdict layer (Reflex)

Labels each turn: loop, jailbreak, frustration, policy
Under 90ms, up to 64k context, $0.001/event
Answers: was this turn okay
Real-time, writes onto your existing span

How to Choose

Pick Based on Your Priority

Your priority	Best choice	Runner-up
Most adopted open-source tracer	Langfuse	Arize Phoenix
Lightest self-host (one container)	Arize Phoenix	Langtrace
Agent-trajectory evaluation	Arize Phoenix	Braintrust
Emit LLM spans into an existing OTel stack	OpenLLMetry	Langtrace
One backend for traces, logs, metrics + LLM	SigNoz	self-host Langfuse
Already on Datadog	Datadog LLM Obs	OpenLLMetry to Datadog
Rigorous versioned evals	Braintrust	Arize Phoenix
A verdict on each turn, in real time	Reflex (<90ms)	build on Langfuse + LLM-judge

Most teams end up with two layers: an open-source, OpenTelemetry-native tracer for history (Langfuse, Phoenix, or whatever already feeds your OTel collector) and a real-time classifier for the failures a trace cannot catch. The tracer answers what happened. The classifier answers whether this turn is okay, right now, and writes the answer back onto the same span.

The verdict every tracer is missing, as an API

Reflex returns a semantic label on every agent turn in under 90ms: jailbreaks, looping, frustration, policy violations, or a failure you train in under an hour. It writes onto the Langfuse, Phoenix, or OpenTelemetry span you already emit.

Explore Reflex

See Pricing

Frequently Asked Questions

What are the best open-source AI agent observability tools?

Langfuse (MIT, OpenTelemetry-native, ~29.8k stars) is the most adopted open-source tracer. Arize Phoenix (Elastic License 2.0, OTel + OpenInference) is the agent-native eval and observability option. OpenLLMetry from Traceloop (Apache 2.0) is a set of OTel instrumentations that emit to any backend. Langtrace (AGPL-3.0 app) and SigNoz (MIT) are also OTel-native and self-hostable. Helicone is Apache 2.0 but in maintenance mode since the Mintlify acquisition in March 2026.

Which AI observability tools are OpenTelemetry-native?

Langfuse, Arize Phoenix, OpenLLMetry/Traceloop, Langtrace, and SigNoz all emit standard OpenTelemetry spans, so you can route the same instrumentation to any compatible backend. Datadog LLM Observability and Helicone can ingest OTel but are not open and OTel-first. Braintrust is eval-first and not built around OpenTelemetry.

What is the difference between agent observability and agent monitoring?

Observability tools focus on tracing what happened so you can explain a run after the fact; monitoring tools add alerting, triage, and incident workflows on top. Most platforms do both. The shared limit is the same: they tell you what the agent did, not whether each turn was a loop, jailbreak, or frustrated user. For the monitoring and build-your-own angle, see best AI agent monitoring tools.

Is Helicone still maintained in 2026?

Mintlify acquired Helicone on March 3, 2026, and the founders joined Mintlify. It is in maintenance mode: security updates, bug fixes, and new model support continue, but there is no active roadmap. The proxy and Sessions view still work; factor the stalled development in if you are standardizing for the next few years.

Why does Datadog LLM Observability get expensive for agents?

Datadog meters per LLM span, and its docs state each LLM span is priced independently. One agent run fans out into many LLM calls across many turns, so a looping or retrying agent multiplies billable spans. Cost scales with how chatty the agent is, not with how much value the run produced.

Can observability tools catch agent failures in real time?

Tracing tools are post-hoc: the data lands in a dashboard after the turn. Catching a failure in time to act needs a classifier fast enough to run inline. Morph Reflex runs per-turn classifiers in under 90ms over up to 64k context, billed per event at $0.001 for realtime, so a loop, jailbreak, or policy violation can be flagged or blocked mid-run and written onto your existing span. See agent observability.

Should I run a per-turn classifier instead of a tracer?

No, run both. The tracer gives you the structure and history of every run. A per-turn classifier adds the one thing the trace cannot encode: a verdict on the meaning of each turn. The classifier returns a label over an API, so you write it onto a Langfuse, Phoenix, or OpenTelemetry span, alert on it, or route on it inline. See Reflex.

Go deeper

Best AI agent monitoring tools: the monitoring, alerting, and build-vs-buy angle
Agent observability: why a green trace hides a broken agent
Reflex: the per-turn classifier, under 90ms, custom-trained in under an hour
Pricing: per-event Reflex rates and the rest of the API

Sources

LangChain: State of Agent Engineering 2025 (1,340 practitioners; 89% run observability, quality is the top blocker)
Langfuse pricing and GitHub (MIT, OTel-native, ~29.8k stars)
Arize Phoenix and OpenInference (ELv2, OTel)
Traceloop pricing and OpenLLMetry (Apache 2.0)
Mintlify acquires Helicone (March 3, 2026) and r/LLMDevs: alternatives now?
Datadog LLM Observability cost docs ("each LLM span is priced independently")
Braintrust pricing and its agent observability guide (eval-upkeep admission)
Langtrace (AGPL-3.0 app) and SigNoz pricing (MIT)
Langfuse: should you use an LLM proxy? (latency, single point of failure)
r/LangChain: what's everyone actually using for LLM observability ("custom instrumentation on top of whichever backend frustrates you least")
Morph Reflex capabilities and pricing

Fast Apply

WarpGrep

Compact

Reflex

Model Router

DeepSeek

MiniMax

Qwen

GLM

Blog

Startup Credits

Contact Us

About

Careers

AI Agent Observability Tools (2026): The Open-Source & OpenTelemetry Stack, Compared

Open-Source & OpenTelemetry Tools, Compared (2026)

Why OpenTelemetry Is the Through-Line

Open-source LLM observability tools by GitHub stars (June 2026)

1. Langfuse

2. Arize Phoenix

3. Helicone

4. Datadog LLM Observability

5. Braintrust

6. OpenLLMetry, Langtrace & SigNoz

The Layer Every Tracer Is Missing

How to Choose

The verdict every tracer is missing, as an API

Frequently Asked Questions

What are the best open-source AI agent observability tools?

Which AI observability tools are OpenTelemetry-native?

What is the difference between agent observability and agent monitoring?

Is Helicone still maintained in 2026?

Why does Datadog LLM Observability get expensive for agents?

Can observability tools catch agent failures in real time?

Should I run a per-turn classifier instead of a tracer?

Go deeper

Sources