AI Observability in 2026: Tools, Standards, and What Actually Matters

AI agents fail silently. Traditional APM tracks latency and errors but can't tell you why an agent picked the wrong tool or looped for 40 steps. AI observability closes that gap. Tool comparison, pricing, and OpenTelemetry GenAI conventions explained.

March 18, 2026 · 2 min read

Why Traditional Monitoring Fails for AI Agents

Traditional APM was built for deterministic software. A function either returns the right value or throws an exception. HTTP status codes, latency percentiles, and error rates tell you everything you need to know. Datadog, New Relic, and Grafana have spent two decades refining this model.

AI agents break every assumption. The same prompt produces different outputs on consecutive runs. A 200 response in 50ms can contain a hallucinated API endpoint. An agent can select the wrong tool, execute it without errors, and produce a confidently wrong result. Your APM dashboard stays green while users file tickets.

Silent Failures

An agent returns a 200 with a plausible but wrong answer. No error, no latency spike. Traditional monitoring sees a healthy system. The user sees a broken product.

Non-Determinism

The same input produces different outputs across runs. Temperature, context window contents, and tool availability all affect results. You can't write assertions for 'correct' the way you can for deterministic code.

Cascading Context Errors

Agent A retrieves bad context. Agent B reasons correctly over it. Agent C produces a wrong final answer. The error originated two agents ago. Single-service APM can't trace this chain.

The gap is specific: APM confirms a request succeeded. AI observability evaluates whether the answer was correct. These are different questions requiring different instrumentation. A model can hallucinate at 50ms latency with zero errors.

What AI Observability Actually Tracks

AI observability adds three layers on top of traditional monitoring: LLM-specific tracing, output evaluation, and agent instrumentation.

LLM Tracing

Prompt/completion pairs, token counts (input and output), model ID and version, latency per LLM call, cost per request. This is the foundation: you can't debug what you can't see.

Output Evaluation

Faithfulness (is the output grounded in context?), relevance (does it answer the question?), safety (toxicity, PII leakage, jailbreak detection). These metrics catch failures that return 200 OK.

Agent Instrumentation

Tool call sequences, reasoning step traces, decision points, retry loops, and planning stages. When an agent enters a 30-step loop burning $4 in tokens, this layer shows why.

Cost and Usage Analytics

Per-request cost by model and provider, daily/weekly spend trends, cost attribution by feature or team, budget alerts. Most teams discover runaway costs from invoices, not dashboards.

The Evaluation Gap

Tracing alone is not observability. Seeing that an agent called GPT-5.4 with a 3,000-token prompt and got a 500-token response in 1.2s tells you what happened. Evaluation tells you whether the response was good. Arize Phoenix ships 50+ built-in evaluation metrics. LangSmith integrates evaluations directly into traces. Without this layer, you have logging, not observability.

AI Observability Tools Compared (2026)

The market split into three categories: open-source platforms (Langfuse, Arize Phoenix, Helicone), LLM-native vendors (LangSmith, Portkey, Pydantic Logfire), and incumbent APM extensions (Datadog LLM Observability). Each has different trade-offs in flexibility, cost, and integration depth.

ToolTypeOpen SourceKey StrengthBest For
LangfusePlatformYes (MIT)Self-hostable, ClickHouse-backed, prompt managementTeams wanting full control and self-hosting
LangSmithPlatformNoDeep LangChain/LangGraph integration, evaluationsLangChain ecosystem users
Arize PhoenixPlatformYes50+ eval metrics, local-first, embedding analysisTeams needing built-in evaluation
HeliconeProxyYesOne-line proxy setup, 100+ model routingFast integration with any provider
PortkeyGatewayYes (gateway)AI gateway + observability, MCP tool loggingTeams needing gateway + observability combined
Pydantic LogfirePlatformYesFull-stack OTEL, 10M free spans/moPython teams using Pydantic AI
Datadog LLM ObsExtensionNoUnified with existing APM, sensitive data scanningTeams already on Datadog
AgentOpsPlatformYesSession replay, agent lifecycle trackingAgent-first workflows with CrewAI/AutoGen

Langfuse

Langfuse was the most widely adopted open-source LLM observability platform before ClickHouse acquired it in January 2026 as part of a $400M Series D at $15B valuation. It has 2,000+ paying customers, 26M+ SDK installs per month, and is used by 19 of the Fortune 50 and 63 of the Fortune 500. The architecture runs entirely on ClickHouse, both in the cloud offering and for self-hosted deployments.

Core capabilities: trace visualization, prompt versioning and management, cost tracking, A/B experiments, and native SDKs for Python, JavaScript, Java, and Go. The MIT license and self-hosting option make it the default choice for teams with data residency requirements.

LangSmith

LangSmith is LangChain's observability platform, tightly integrated with LangGraph for agent tracing. It provides end-to-end traces across agent steps with inline evaluations, dataset management, and experiment tracking. The free tier includes 5,000 traces/month. The Plus plan at $39/seat/month includes 100K traces with 400-day retention.

The trade-off: LangSmith works with any LLM application, but the deepest integration is with LangChain and LangGraph. If you use a different framework, you get tracing but not the full workflow instrumentation.

Arize Phoenix

Phoenix is an open-source platform built by the Arize AI team, focused on evaluation and experimentation. It ships 50+ research-backed evaluation metrics covering faithfulness, relevance, safety, and more. It runs locally, in Docker, or via a free cloud instance. The cloud free tier includes 25K trace spans/month.

Phoenix embeds evaluation into the observability loop: you can create datasets from production traces, run experiments against them, and compare results across prompt or model changes. This tight coupling between tracing and evaluation is its primary differentiator.

Helicone

Helicone (YC W23) takes a proxy-based approach. One line of code routes your LLM calls through Helicone's proxy, capturing traces, costs, and latency without SDK integration. It supports 100+ AI models with intelligent routing and automatic fallbacks. The free tier includes 10K requests/month. Paid plans start at $20/seat/month. SOC 2 and GDPR compliant.

Portkey

Portkey combines an AI gateway with observability. Every LLM call and MCP tool invocation that flows through the gateway is automatically logged, including tool name, inputs, outputs, latency, and authorization status. Most observability platforms don't cover MCP at all. The pricing is usage-based, centered on recorded logs and retention duration.

Pydantic Logfire

Logfire is built on OpenTelemetry and monitors the full application stack, not just LLM calls. It integrates natively with Pydantic AI for agent tracing and Pydantic Evals for continuous evaluation. SDKs for Python, JavaScript/TypeScript, and Rust. The free tier is generous: 10M spans/month with no credit card. The trade-off: strongest in Python-first stacks.

Datadog LLM Observability

Datadog added LLM Observability as an extension to its existing APM platform. It provides end-to-end tracing across agents with visibility into inputs, outputs, latency, token usage, and errors at each step. It automatically calculates estimated cost per request using providers' public pricing. Sensitive Data Scanner detects and redacts PII. The advantage: unified with existing Datadog infrastructure monitoring. The cost: Datadog pricing, which scales by LLM spans and compounds with other Datadog products.

Pricing Comparison

ToolFree TierPaid Starting AtPricing Model
Langfuse50K observations/mo$29/mo (Core)Observations + retention
LangSmith5K traces/mo, 1 seat$39/seat/mo (Plus)Per-seat + trace overage
Arize Phoenix25K spans/mo (cloud)$50/moSpans + storage
Helicone10K requests/mo$20/seat/moPer-seat
PortkeyDev tier freeUsage-basedRecorded logs + retention
Pydantic Logfire10M spans/moUsage-basedSpans
Datadog LLM ObsNonePer LLM spanPer-span (adds to DD bill)
AgentOpsFree tierNot publishedUsage-based

Self-Hosting Economics

Langfuse, Arize Phoenix, and Helicone can all be self-hosted. Langfuse self-hosting requires PostgreSQL, ClickHouse, Redis, and S3-compatible storage, with typical infrastructure costs of $500-$1,000/month. Phoenix can run locally with zero infrastructure cost for development. For teams processing millions of traces, self-hosting often becomes cheaper than cloud pricing within 3-6 months.

$15B
ClickHouse valuation (Langfuse acquisition, Jan 2026)
26M+
Langfuse SDK installs per month
10M
Free spans/month on Pydantic Logfire

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry published GenAI Semantic Conventions that define standardized attribute names, span structures, and metrics for tracing generative AI operations. This is the first vendor-neutral standard for AI telemetry. The conventions are currently experimental, with technology-specific definitions for Anthropic, OpenAI, Azure AI Inference, and AWS Bedrock.

The conventions cover three areas:

Client Spans

Standardized attributes for LLM calls: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reason. Every LLM call gets a consistent shape regardless of provider.

Agent Spans

Attributes for tracing agent workflows: tasks, actions, tool invocations, and reasoning steps. Currently proposed via RFC, these conventions aim to standardize how multi-step agent behavior is captured.

Metrics

Token usage histograms, request duration, and operation counts by model and provider. These enable cost and performance dashboards without vendor-specific adapters.

Frameworks adopting OTEL GenAI conventions include Pydantic AI, smolagents, Strands Agents, and the OpenLLMetry project (which provides auto-instrumentation for LangChain, LlamaIndex, and Haystack). The practical benefit: instrument once with OTEL, send traces to any compatible backend (Langfuse, Logfire, Jaeger, Grafana Tempo, Datadog).

Why This Matters

Before these conventions, every observability tool defined its own schema for LLM traces. Switching from LangSmith to Langfuse meant re-instrumenting your application. OTEL GenAI conventions decouple instrumentation from the backend. The same trace data can flow to multiple destinations simultaneously, and you can switch backends without changing application code.

Multi-Agent Observability

Single-agent tracing is mostly solved. Multi-agent observability is not. When multiple agents collaborate on a task, failures become emergent: they arise from interactions between agents, not from any single agent's behavior.

A data retrieval agent returns malformed JSON. A reasoning agent parses it without error (the JSON is valid, just wrong). A synthesis agent produces a confident, incorrect final answer. The trace for each individual agent looks clean. The system trace shows a cascade of correct operations producing a wrong result.

Distributed Trace Context

Each agent needs its own spans linked by a shared trace ID. When Agent A delegates to Agent B, the parent-child span relationship must propagate across process boundaries, not just within a single service.

Decision Point Tracing

When an orchestrator selects which sub-agent to invoke, that decision is a first-class event. Tracing tool selection, input routing, and output aggregation reveals where orchestration logic fails.

Anthropic reported 90% performance improvements when using multi-agent architectures on complex tasks. Cognition measured that coding agents spend 60% of their time searching for context. Both findings point to the same problem: multi-agent systems need observability that crosses agent boundaries, not just monitors individual agents in isolation.

The OpenTelemetry GenAI agent span conventions (currently proposed via RFC) aim to standardize this. They define attributes for tracing tasks, actions, agents, teams, artifacts, and memory across agentic systems. Until these stabilize, most teams build custom instrumentation on top of the existing OTEL span model.

What to Look For in an AI Observability Tool

Trace Depth

Can you see the full prompt, completion, token counts, and model ID for every LLM call? Can you trace through tool calls, retries, and multi-step reasoning? Surface-level request logging is not enough.

Evaluation Integration

Does the tool evaluate output quality, or just log it? Built-in metrics for faithfulness, relevance, and safety catch the failures that return 200 OK. Without evaluation, you're building a logging system, not an observability platform.

Cost Attribution

Can you attribute costs to specific features, users, or teams? Per-request cost tracking by model and provider is table stakes. Budget alerts and anomaly detection prevent $500 surprises on monthly invoices.

OpenTelemetry Support

Does it accept OTEL traces natively? Vendor-specific SDKs create lock-in. OTEL support means you can switch backends, send to multiple destinations, and benefit from the growing ecosystem of auto-instrumentors.

For teams building with multi-agent architectures, the ability to trace across agent boundaries matters more than any individual feature. An observability tool that only traces within a single agent misses the class of failures that multi-agent systems are most prone to.

Morph's subagent architecture produces structured traces by design. When a coding agent delegates search to WarpGrep and code editing to Fast Apply, each subagent operation is a discrete, traceable unit with clear inputs, outputs, and cost. 8 parallel tool calls per turn, 4 turns, sub-6s total. This granularity makes agent behavior observable without custom instrumentation.

Frequently Asked Questions

What is AI observability?

The practice of monitoring, tracing, and evaluating AI systems beyond traditional APM. It tracks LLM-specific signals (prompt/completion pairs, token usage, model IDs, tool calls) and evaluates output quality (faithfulness, relevance, safety). The goal: when an agent fails, trace exactly where the reasoning went wrong.

How is AI observability different from traditional APM?

Traditional APM monitors deterministic software: latency, error rates, uptime. AI observability handles non-deterministic systems. An LLM can return 200 OK in 50ms and still hallucinate. APM confirms the request succeeded. AI observability evaluates whether the answer was correct, grounded, and safe. They complement each other.

What are the best open-source AI observability tools?

Langfuse (MIT license, acquired by ClickHouse January 2026, self-hostable, 26M+ monthly SDK installs), Arize Phoenix (10,000+ GitHub stars, 50+ evaluation metrics, runs locally), and Helicone (YC W23, proxy-based one-line setup, 10K free requests/month). All three support OpenTelemetry.

What are OpenTelemetry GenAI Semantic Conventions?

Standardized attribute names and span structures for tracing generative AI operations. They define how to capture model IDs, token usage, tool calls, and agent behavior in a vendor-neutral format. Currently experimental, supported by Pydantic AI, smolagents, Strands Agents, and the OpenLLMetry project. They decouple instrumentation from the observability backend.

How do you observe multi-agent AI systems?

Distributed tracing across agent boundaries with shared trace IDs. Each agent's tool calls and LLM invocations get their own spans linked to a parent trace. The challenge: multi-agent failures are emergent. Agent A retrieves bad context, Agent B reasons correctly over it, and the output is wrong. You need system-level traces, not per-agent logs. The OTEL GenAI agent span conventions are still in RFC status.

Related Reading

Observable by Design

Morph's subagent architecture produces structured, traceable operations. WarpGrep searches codebases in 8 parallel tool calls. Fast Apply merges edits at 10,500+ tok/s. Each operation is a discrete span with clear inputs, outputs, and cost.