89% of agent teams have observability. One in three still call quality their biggest blocker to shipping. The dashboards are everywhere; the hard problem did not move. The reason is that observability tools record what the agent did, not whether each turn was okay. Tool support, licenses, and pricing below are verified against vendor docs and GitHub as of June 2026.
The category sounds like one product but it is two layers. The first is tracing: capture the nested tree of plans, tool calls, retries, and subagents inside one request, attribute cost per step, and store it so you can read the run later. Almost every tool below does this well, and the best ones are open source and built on OpenTelemetry, so you self-host them and own the data.
The second layer is a verdict on each turn: is this a loop, an off-task drift, a jailbreak, a policy violation, a user giving up. That one is mostly missing, and it is the one teams keep saying they had to build themselves.
This page is the open-source and OpenTelemetry roundup. For the monitoring, alerting, and build-your-own angle, see best AI agent monitoring tools. For why a healthy-looking trace hides a broken agent, see agent observability.
Open-Source & OpenTelemetry Tools, Compared (2026)
One row per tool. "Open source" is the license on the self-hostable core. "OTel-native" means it emits standard OpenTelemetry spans rather than a proprietary format, so you are not locked to one backend. "What it traces" is the unit it records. Every tool here is a tracer; none of them returns a per-turn verdict, which is the gap the last section covers.
| Tool | Open source | OTel-native | Self-host | Pricing | What it traces |
|---|---|---|---|---|---|
| Langfuse | Yes (MIT) | Yes (v3 SDK) | Yes | Free OSS; cloud 50k units/mo free, then $29+/mo | Traces, observations, scores |
| Arize Phoenix | Source-available (ELv2) | Yes (+OpenInference) | Yes | Phoenix OSS free; Arize AX enterprise | Agent spans, evals, trajectories |
| OpenLLMetry / Traceloop | Yes (Apache 2.0) | Yes (is OTel) | Yes (SDK to any backend) | SDK free; cloud 50k spans/mo free | OTel spans for LLM/vector/framework calls |
| Helicone | Yes (Apache 2.0) | Ingests OTel | Yes | Maintenance mode since Mar 2026 | Proxied requests, sessions, cost |
| Datadog LLM Obs | No (proprietary) | Ingests OTel | No | Per LLM span (~$8 / 10k spans) | LLM spans, tokens, cost, latency |
| Braintrust | No (proprietary core) | No (eval-first) | Enterprise only | Free 1GB + 10k scores/mo; Pro $249/mo | Spans tied to evals and scores |
| Langtrace | Yes (AGPL-3.0 app) | Yes | Yes | Free 5k spans/mo; $31/user/mo | OTel spans, metrics, annotations |
| SigNoz | Yes (MIT) | Yes | Yes | Community free; cloud from $49/mo | OTLP traces, logs, metrics (LLM via OTel) |
| Reflex (the missing layer) | API | Writes onto your spans | Cloud API | $0.001/event (~$0.49 / 1M tok) | A per-turn verdict, not a trace |
Licenses, OTel support, GitHub stars, and free-tier numbers verified against each vendor's docs, pricing page, and repo in June 2026. Helicone's license is unchanged but active development stopped after the Mintlify acquisition. Reflex is not a tracer; it is the per-turn classifier the last column points at. See pricing.
Why OpenTelemetry Is the Through-Line
The reason "open source" and "OpenTelemetry" keep appearing together in this category is that OTel is what makes the open-source tools interchangeable. OpenTelemetry is a vendor-neutral standard for emitting spans. A tool that is OTel-native produces those standard spans, so the same instrumentation can feed Langfuse today, Phoenix tomorrow, and your own ClickHouse the day after, without rewriting a line of agent code. Arize maintains OpenInference, a set of semantic conventions on top of OTel specific to LLM and agent spans, which is why Phoenix and Langfuse can read each other's traces.
That portability is the whole appeal of the OSS stack. You instrument once with an OTel SDK or OpenLLMetry, point it at a self-hosted collector, and keep every byte of prompt and completion data inside your own infrastructure. No proxy in the request path, no per-seat lock-in, no vendor that can sunset the product out from under you, which is exactly what happened to Helicone customers in March 2026.
One architectural fork splits this list. Helicone is a proxy: it sits between your app and the model provider and logs requests passing through. That is the fastest way to get visibility, but it puts a service in the hot path. Langfuse, which is async by design, argues the case against it plainly in its own engineering writeup: a proxy "can introduce additional latency" and "can be a single point of failure." The common version of this objection on developer forums is blunter, that you do not want another service between you and your LLM provider. OpenTelemetry-native tracing sidesteps it because spans are emitted out of band, not inline.
Open-source LLM observability tools by GitHub stars (June 2026)
Community size is a proxy for integration breadth and how battle-tested the tool is. Every tool here is a tracer.
GitHub star counts as of June 2026 (rounded). Stars signal adoption, not capability; SigNoz is a general OpenTelemetry platform with LLM as one use case, the rest are LLM/agent-specific. All six are tracers: they record runs. None classifies whether a given turn was a loop, jailbreak, or frustrated user in real time.
1. Langfuse
Langfuse is the default open-source tracer and the name that comes up most in community threads. It rebuilt its SDK on OpenTelemetry in v3, so it ingests standard OTel spans and interops with Grafana, Jaeger, and Datadog. The entire product is MIT-licensed except a thin enterprise compliance folder, which means a self-hosted deploy has no seat caps, retention limits, or usage caps. The hosted free tier is 50,000 units a month (a unit is any trace, observation, or score). At ~29.8k GitHub stars it has the largest community in the category.
- MIT core, self-host with no caps, full data ownership
- OpenTelemetry-native, broad framework support
- Prompt management + dataset + eval tooling
- Largest community, most integrations
- Self-host needs ClickHouse + Redis + S3, not one container
- Unit billing: a many-span agent run burns many units
- Tracing-first: shows what happened, not whether it worked
- Post-hoc; cannot act on a turn while it runs
Best for: teams that want the most adopted open-source tracer with full data control. See Langfuse alternatives.
2. Arize Phoenix
Phoenix is Arize's open-source agent-evaluation and tracing layer; Arize AX is the enterprise platform above it. It is genuinely agent-aware: parent-child spans across agent, LLM, and tool steps, trajectory evaluation, and a strong eval library. It is OpenTelemetry-based and built on OpenInference, and a common community note is that it self-hosts in a single Docker container, lighter than Langfuse's multi-service deploy.
The license is Elastic License 2.0, which is source-available rather than OSI open source: free to self-host, with a restriction on reselling it as a managed service.
- Real agent-trajectory and span-level evaluation
- Single-container self-host, lighter setup
- OTel + OpenInference, strong eval library
- Clear path from free Phoenix to enterprise AX
- ELv2 is source-available, not OSI open source
- Oriented to ML and data-science teams
- Full platform (AX) trends enterprise and per-volume
- Eval is offline, not live per-turn blocking
Best for: ML teams that want rigorous, agent-aware evaluation with a light open-source on-ramp. See Arize Phoenix vs Langfuse.
3. Helicone
Mintlify acquired Helicone on March 3, 2026, and both founders joined Mintlify. Security patches, bug fixes, and new model support continue, but there is no active roadmap. Helicone served roughly 16,000 organizations before the acquisition. If you are choosing a platform to standardize on, weigh the stalled development; if you already run it, the proxy and Sessions view still work. The fallout is live on r/LLMDevs.
Helicone is an Apache-2.0 AI gateway: a proxy that fronts 100+ models behind one OpenAI-compatible key, with routing, caching, fallbacks, and request logging. Sessions group an agent's requests into one flow view with per-step latency and cost. It was the fastest path from zero to visibility with a one-line base-URL change, and it is self-hostable via Docker and Helm. The proxy architecture is also its main objection: it sits in the request path, which adds latency and a failure point.
- One-line setup via OpenAI-compatible base URL
- Gateway features: routing, caching, fallbacks
- Apache 2.0, self-hostable
- Cost and session views out of the box
- Maintenance mode: no new features after Mar 2026
- Proxy sits in the hot path (latency, single point of failure)
- Logging-first, thin evaluation
- Post-hoc; no per-turn verdict
Best for: existing users who value the gateway, with a migration plan given the maintenance-mode status.
4. Datadog LLM Observability
If you already run Datadog, LLM Observability puts agent traces next to your APM, infra, and logs on one bill and one dashboard. It captures prompts, completions, token usage, cost, latency, and errors, and its AI Agents console visualizes multi-step runs. It can ingest OpenTelemetry and OpenInference data, but it is a closed commercial backend, not an open or OTel-first tool. The pricing model is the thing to understand before you adopt it for agents.
Datadog's docs state that each LLM span in a trace is priced independently (about $8 per 10,000 LLM spans on annual billing). Tool, embedding, and agent spans are not billed, but a single agent run fans out into many LLM calls across many turns, so a chatty or looping agent multiplies billable spans fast. Cost scales with how talkative the agent is, not with how much value the run produced. The forums that track observability spend put Datadog at the top of the "bill exploded" complaints.
- Agents in the same pane as infra and APM
- Mature, reliable, enterprise-grade
- Deep cost, latency, and error tracing
- Ingests OTel/OpenInference data
- Closed source, no self-host
- Per-LLM-span billing balloons with agent volume
- Generic ops signals, no failure taxonomy
- Only worth it if you already run Datadog
Best for: enterprises already standardized on Datadog who want agents on the same dashboard.
5. Braintrust
Braintrust is the tool to beat if your problem is evaluation. It treats evals as a first-class workflow: build datasets from production traces, write scorers including LLM-as-judge, run experiments, and gate changes on the results. Observability is wired to the evals, so a regression shows up as a failing scorer. The free tier is 1 GB of processed data plus 10,000 scores a month at 14-day retention; Pro is $249 a month. It is not open source and not OpenTelemetry-first, which is why it sits apart from the OSS stack above.
- Deepest eval workflow in the category
- Datasets + experiments + scorers in one place
- Observability tied to eval results
- SDK wrappers for OpenAI Agents SDK, LangGraph, CrewAI
- Proprietary core, on-prem only on enterprise
- Not OpenTelemetry-native
- Evals go stale; Braintrust itself notes teams "spend more time maintaining eval infrastructure than improving agents"
- Offline/async, not a live guardrail
Best for: teams whose core need is measuring quality with rigorous, versioned evals. See Braintrust vs Langfuse.
6. OpenLLMetry, Langtrace & SigNoz
Three more OpenTelemetry-native, self-hostable options that lean further toward "own your pipeline" than "buy a dashboard."
- OpenLLMetry (Traceloop). Not a dashboard but a set of OpenTelemetry instrumentations that auto-trace LLM providers, vector DBs, and frameworks, then emit standard OTel spans to any backend you run (Datadog, Grafana, Honeycomb, your own collector). Apache 2.0, ~7.2k stars. Traceloop sells a hosted backend with a 50k-span free tier, but the SDK's point is no lock-in. The best pick when you already have an OTel stack and just want LLM spans flowing into it.
- Langtrace. An OTel-native tracer with a self-hostable app (AGPL-3.0) and Apache-2.0 SDKs, running Next.js + Postgres + ClickHouse via Docker Compose. Free for 5,000 spans a month, then roughly $31 per user a month. Smaller community (~1.2k stars) but a clean standards-based option if you want a turnkey self-host without Langfuse's service sprawl.
- SigNoz. A general-purpose, MIT-licensed OpenTelemetry observability platform (~27.5k stars) positioned as the open-source Datadog alternative. LLM and agent monitoring is one use case via OTel instrumentation, not a purpose-built agent product. The right call if you want one self-hosted backend for traces, logs, metrics, and LLM spans rather than a dedicated LLM tool.
All three emit standard OpenTelemetry spans, so they compose with each other and with the per-turn classifier in the next section. You can instrument with OpenLLMetry, store in SigNoz, and write a semantic verdict onto each span, all over open standards. What none of them does, same as the rest of the list, is tell you whether a given turn was actually okay.
The Layer Every Tracer Is Missing
Strip away the dashboards and every tool above does one job well: it records the structure of a run so you can read it later. A span tells you a tool was called. A token counter tells you the run was expensive. Neither tells you the agent misread the user, went in circles with slightly varied arguments, got talked into ignoring its instructions, or watched a user rephrase the same request three times and quit.
Those are semantic signals, and producing them on every turn, fast and cheaply, is the part the trace cannot encode. It is why 89% of teams have observability and a third still cannot tell whether their agent is doing a good job.
That verdict needs a classifier that runs inline, not an eval that runs offline on a 1% sample tomorrow. A Morph Reflex is that classifier. It returns a label in one forward pass, under 90ms end-to-end, over up to 64k tokens of context, which is enough to see a full agent turn including its tool calls: is this a jailbreak, a loop, a frustrated user, a policy violation, or a failure you define.
It bills per event (1 event = 2048 tokens) at $0.001 for realtime, roughly $0.49 per million tokens classified, up to 10x cheaper than running a frontier model as a judge on every turn, and you can train a custom reflex on your own labeled failures in under an hour.
It does not replace your tracer; it composes with it. Because the label comes back as an API response rather than a panel inside one vendor's dashboard, you write is_agent_looping or jailbreak_attempt onto the Langfuse, Phoenix, or OpenTelemetry span you already emit, and the verdict shows up in the trace UI you already use. Alert on it the moment a live run drifts, or route on it inline: stop the agent, escalate to a human, or switch strategies before the bad turn ships. The OSS tracer keeps the history; the classifier adds the meaning.
- Records every span, tool call, and token
- Open source, self-host, own your data
- Answers: what did the agent do
- Post-hoc, you read it after the run
- Labels each turn: loop, jailbreak, frustration, policy
- Under 90ms, up to 64k context, $0.001/event
- Answers: was this turn okay
- Real-time, writes onto your existing span
How to Choose
| Your priority | Best choice | Runner-up |
|---|---|---|
| Most adopted open-source tracer | Langfuse | Arize Phoenix |
| Lightest self-host (one container) | Arize Phoenix | Langtrace |
| Agent-trajectory evaluation | Arize Phoenix | Braintrust |
| Emit LLM spans into an existing OTel stack | OpenLLMetry | Langtrace |
| One backend for traces, logs, metrics + LLM | SigNoz | self-host Langfuse |
| Already on Datadog | Datadog LLM Obs | OpenLLMetry to Datadog |
| Rigorous versioned evals | Braintrust | Arize Phoenix |
| A verdict on each turn, in real time | Reflex (<90ms) | build on Langfuse + LLM-judge |
Most teams end up with two layers: an open-source, OpenTelemetry-native tracer for history (Langfuse, Phoenix, or whatever already feeds your OTel collector) and a real-time classifier for the failures a trace cannot catch. The tracer answers what happened. The classifier answers whether this turn is okay, right now, and writes the answer back onto the same span.
The verdict every tracer is missing, as an API
Reflex returns a semantic label on every agent turn in under 90ms: jailbreaks, looping, frustration, policy violations, or a failure you train in under an hour. It writes onto the Langfuse, Phoenix, or OpenTelemetry span you already emit.
Frequently Asked Questions
What are the best open-source AI agent observability tools?
Langfuse (MIT, OpenTelemetry-native, ~29.8k stars) is the most adopted open-source tracer. Arize Phoenix (Elastic License 2.0, OTel + OpenInference) is the agent-native eval and observability option. OpenLLMetry from Traceloop (Apache 2.0) is a set of OTel instrumentations that emit to any backend. Langtrace (AGPL-3.0 app) and SigNoz (MIT) are also OTel-native and self-hostable. Helicone is Apache 2.0 but in maintenance mode since the Mintlify acquisition in March 2026.
Which AI observability tools are OpenTelemetry-native?
Langfuse, Arize Phoenix, OpenLLMetry/Traceloop, Langtrace, and SigNoz all emit standard OpenTelemetry spans, so you can route the same instrumentation to any compatible backend. Datadog LLM Observability and Helicone can ingest OTel but are not open and OTel-first. Braintrust is eval-first and not built around OpenTelemetry.
What is the difference between agent observability and agent monitoring?
Observability tools focus on tracing what happened so you can explain a run after the fact; monitoring tools add alerting, triage, and incident workflows on top. Most platforms do both. The shared limit is the same: they tell you what the agent did, not whether each turn was a loop, jailbreak, or frustrated user. For the monitoring and build-your-own angle, see best AI agent monitoring tools.
Is Helicone still maintained in 2026?
Mintlify acquired Helicone on March 3, 2026, and the founders joined Mintlify. It is in maintenance mode: security updates, bug fixes, and new model support continue, but there is no active roadmap. The proxy and Sessions view still work; factor the stalled development in if you are standardizing for the next few years.
Why does Datadog LLM Observability get expensive for agents?
Datadog meters per LLM span, and its docs state each LLM span is priced independently. One agent run fans out into many LLM calls across many turns, so a looping or retrying agent multiplies billable spans. Cost scales with how chatty the agent is, not with how much value the run produced.
Can observability tools catch agent failures in real time?
Tracing tools are post-hoc: the data lands in a dashboard after the turn. Catching a failure in time to act needs a classifier fast enough to run inline. Morph Reflex runs per-turn classifiers in under 90ms over up to 64k context, billed per event at $0.001 for realtime, so a loop, jailbreak, or policy violation can be flagged or blocked mid-run and written onto your existing span. See agent observability.
Should I run a per-turn classifier instead of a tracer?
No, run both. The tracer gives you the structure and history of every run. A per-turn classifier adds the one thing the trace cannot encode: a verdict on the meaning of each turn. The classifier returns a label over an API, so you write it onto a Langfuse, Phoenix, or OpenTelemetry span, alert on it, or route on it inline. See Reflex.
Go deeper
- Best AI agent monitoring tools: the monitoring, alerting, and build-vs-buy angle
- Agent observability: why a green trace hides a broken agent
- Reflex: the per-turn classifier, under 90ms, custom-trained in under an hour
- Pricing: per-event Reflex rates and the rest of the API
Sources
- LangChain: State of Agent Engineering 2025 (1,340 practitioners; 89% run observability, quality is the top blocker)
- Langfuse pricing and GitHub (MIT, OTel-native, ~29.8k stars)
- Arize Phoenix and OpenInference (ELv2, OTel)
- Traceloop pricing and OpenLLMetry (Apache 2.0)
- Mintlify acquires Helicone (March 3, 2026) and r/LLMDevs: alternatives now?
- Datadog LLM Observability cost docs ("each LLM span is priced independently")
- Braintrust pricing and its agent observability guide (eval-upkeep admission)
- Langtrace (AGPL-3.0 app) and SigNoz pricing (MIT)
- Langfuse: should you use an LLM proxy? (latency, single point of failure)
- r/LangChain: what's everyone actually using for LLM observability ("custom instrumentation on top of whichever backend frustrates you least")
- Morph Reflex capabilities and pricing