LLM Observability: Tools, Metrics & Implementation Guide (2026)

Why LLMs Need Different Observability

A REST API either returns 200 or it doesn't. An LLM can return 200 with a confidently wrong answer. Your infrastructure metrics will show a healthy service while your users get hallucinated outputs. This is the core problem: traditional observability instruments the transport layer, but LLMs fail at the semantic layer.

Three properties of LLMs make standard APM insufficient:

Non-Deterministic Outputs

The same prompt can produce different outputs across runs. Temperature, sampling, and model updates all introduce variance. You can't write assertions against exact responses.

Semantic Failures

Hallucinations, prompt drift, context window overflow, and reasoning errors are invisible to HTTP status codes. A 200 response with fabricated data is worse than a 500 error.

Compound Cost

Each token costs money. A runaway agent loop or verbose system prompt can burn through API budgets without triggering any traditional alarm. Cost is a first-class metric, not an afterthought.

89%

Orgs with LLM observability (LangChain survey, n=1,300)

52%

Orgs running offline evaluations

71.5%

Production teams with full tracing

LangChain's State of AI Agents survey puts the gap in numbers. Nearly every team has some observability. Barely half evaluate whether the outputs are correct. Teams can see their agents running but not measure whether the agents are right.

The Four Pillars of LLM Observability

Traditional observability rests on MELT: metrics, events, logs, traces. LLM observability extends this with behavioral signals that capture dimensions infrastructure metrics can't: output quality, factual grounding, and cost attribution.

1. Tracing

Captures the full execution path through multi-step workflows: prompt construction, model calls, tool invocations, retrieval steps, guardrail checks. Each step is a span in a trace. For agent systems, tracing must propagate context across agent boundaries so you can follow a request from the orchestrator through every subagent.

2. Evaluation

Measures output quality against ground truth or reference data. Includes automated metrics (BERTScore for semantic similarity, faithfulness scoring for RAG systems, LLM-as-judge for subjective quality) and human annotation workflows. The 89% vs 52% gap in the LangChain survey shows most teams skip this pillar.

3. Monitoring

Real-time dashboards and alerts for latency (time to first token, total generation time), error rates (API failures, rate limits, timeouts), throughput (requests per second), and safety signals (jailbreak attempts, toxicity, PII leakage). This is closest to traditional APM, extended with LLM-specific dimensions.

4. Cost Tracking

Attributes token spend to specific features, users, models, and workflows. Tracks prompt vs completion tokens, cache hit ratios, and cost per task (aggregated across multi-step agent sessions). Critical for teams running multiple models where a routing mistake can 10x costs overnight.

The Evaluation Gap

Among teams with agents in production, 94% have observability but only 77.2% run evaluations. The 17-point gap between "can see what's happening" and "can measure whether it's correct" is where most production failures hide. Tracing without evaluation is flying with instruments but no altimeter.

LLM Observability Tools Compared

The landscape splits roughly into open-source platforms, commercial AI-native tools, and traditional APM vendors adding LLM support. Each has a different integration model and cost structure.

Tool	Type	Integration	Open Source	Key Strength
Langfuse	Platform	SDK + OTel	MIT (self-host)	Most complete OSS: tracing, evals, prompt mgmt, datasets
LangSmith	Platform	SDK (LangChain)	No	Deep LangChain integration, agent debugging, annotation queues
Helicone	Gateway	Proxy (URL swap)	Apache 2.0	Fastest setup, 2B+ requests processed, built-in caching
Arize Phoenix	Platform	SDK + OTel	Yes (OTel-native)	Evaluation-focused, built entirely on OpenTelemetry
Portkey	Gateway	Proxy + SDK	No	1,600+ model routing, guardrails, 40+ tracked dimensions
Datadog LLM Obs	APM Extension	SDK + OTel	No	Native OTel GenAI support, unified with existing APM
Splunk AI Monitor	APM Extension	OTel	No	Enterprise APM integration, AGNTCY quality metrics
Pydantic Logfire	Platform	SDK + OTel	Yes	Full-stack OTel tracing, SQL query interface, Python-native
HoneyHive	Platform	SDK	No	Evaluation pipelines, dataset management, CI/CD integration
W&B Weave	Platform	SDK	No	ML experiment tracking heritage, model comparison workflows

Choosing by Use Case

Fastest Setup

Helicone or Portkey. Proxy-based: swap your API base URL, add a header. Observability in under 5 minutes. Tradeoff: proxies can't see internal application state (prompt templating, local RAG retrieval, control flow logic).

Open-Source, Self-Hosted

Langfuse. MIT license, ClickHouse-backed since the acquisition, 1,000+ self-hosted deployments in production. Full tracing, evals, prompt management, and datasets. Self-hosting avoids per-trace SaaS costs at scale.

Already on Datadog/Splunk

Use their native LLM observability features. Datadog supports OTel GenAI Semantic Conventions (v1.37+). Splunk's AI Agent Monitoring is GA. No new vendor. Tradeoff: pricing runs high ($0.10 per 1K tokens at Datadog).

LangChain-Heavy Stack

LangSmith. Automatic instrumentation for LangChain, LangGraph, and LangServe. Annotation queues for human review. $39/user/month. Tradeoff: deep vendor coupling to the LangChain ecosystem.

The ClickHouse + Langfuse Shift

ClickHouse acquired Langfuse in 2025, keeping it MIT-licensed and self-hostable. The migration from PostgreSQL to ClickHouse cut query latency from minutes to near real-time, reduced memory usage 3x, and made analytical queries 20x faster. For teams processing millions of traces, the ClickHouse backend is a meaningful architectural advantage over tools that store traces in general-purpose databases.

Implementation Patterns

Three architectural approaches to instrumenting LLM applications, each with distinct tradeoffs:

Pattern	How It Works	Setup Time	Visibility	Latency Overhead
Proxy / Gateway	Route API traffic through a middleware (Helicone, Portkey). Swap your base URL.	Minutes	API request/response only	50-80ms per request
SDK / Manual	Embed instrumentation library in your code (Langfuse, LangSmith). Decorators and wrappers.	Hours to days	Full: internal state, control flow, local processing	Negligible (async export)
Auto-Instrumentation (OTel)	Install OpenTelemetry SDK + GenAI instrumentation packages. Configure collector.	30 min - 2 hours	Model calls, token usage, tool invocations	Negligible

When to Use Each

Proxy if you need observability today with zero code changes. Works well for simple chatbots and single-model applications. Falls short when debugging requires visibility into prompt construction, RAG retrieval logic, or multi-step agent reasoning that happens before the API call.

SDK if you're building complex agent systems where the interesting failures happen between model calls, not during them. A retrieval step returns irrelevant documents, the prompt template injects stale context, or a tool call fails silently. SDKs capture these intermediate states.

OpenTelemetry if you want vendor portability and already have OTel infrastructure. The GenAI Semantic Conventions (v1.37+) standardize the schema. Instrument once, export to Datadog, Langfuse, Splunk, or any OTel-compatible backend. The OpenLLMetry project by Traceloop provides auto-instrumentation for OpenAI, Anthropic, Cohere, and 15+ other providers.

OpenTelemetry: The Convergence Point

The same convergence that happened with traditional observability is happening with LLM observability. In 2018, every APM vendor had a proprietary agent. By 2022, OpenTelemetry became the de facto standard, and observability backends competed on analysis rather than data collection. LLM observability is following the same arc.

OpenTelemetry GenAI Semantic Conventions define a standard vocabulary for AI telemetry:

Spans for LLM calls, agent steps, tool invocations, and retrieval operations
Events for prompt/completion content capture with configurable redaction
Metrics for token usage (input/output), latency distributions, and cost aggregation
Attributes for model ID, provider, temperature, max tokens, and response metadata

Datadog was the first major APM vendor to announce native support for OTel GenAI conventions, allowing teams to send LLM traces via their existing OTel Collector pipeline with no SDK changes. Splunk followed with AI Agent Monitoring built on OTel. Langfuse and Arize Phoenix accept OTel traces natively.

The practical benefit: teams that instrument with OTel today can switch between backends without re-instrumenting their applications. The observability vendor lock-in that plagued traditional monitoring is avoidable from the start with LLM observability.

Agent Span Conventions

The OTel GenAI working group is actively developing semantic conventions for agentic systems, covering agent-to-agent context propagation, session-level tracing, and framework-specific instrumentation for LangGraph, CrewAI, and AutoGen. These conventions, once stable, will enable consistent agent tracing across frameworks, solving the fragmentation that makes multi-agent debugging difficult today.

Multi-Agent Observability

Single-model applications are straightforward to observe: one prompt goes in, one completion comes out, you log both. Multi-agent systems break this model. A user request might traverse an orchestrator, a planning agent, three specialist agents, and a verification agent, each making multiple model calls with tool invocations in between. The trace is no longer a single span; it's a directed graph.

The specific challenges:

Cascading Semantic Errors

Agent A hallucinates a fact. Agent B uses that fact as context. Agent C acts on Agent B's output. The error propagates through three steps before surfacing. Without tracing the full chain, you see a wrong result but can't identify which step introduced the error.

Context Propagation

When Agent A hands off to Agent B, the trace context must follow. Traditional distributed tracing (W3C Trace Context) handles this for HTTP services. Agent frameworks need equivalent standards for passing trace IDs across agent boundaries.

Non-Deterministic Branching

Agents make dynamic routing decisions. The same input may trigger 3 subagents in one run and 7 in another. Trace schemas must handle variable-depth graphs, not fixed-depth hierarchies.

Cost Attribution

A single user request might consume tokens across 5 different models via 12 agent steps. Attributing cost to the user request, not just individual API calls, requires aggregating spans into sessions.

Galileo AI documented 9 key challenges in monitoring multi-agent systems at scale, including tool call failures across agent boundaries, emergent interaction patterns that only appear under load, and the difficulty of setting meaningful SLOs when agent behavior is inherently non-deterministic.

The architectural solution most platforms converge on: hierarchical tracing with session-level grouping. Each user request creates a session. Each agent step is a span within the session. Tool calls and model invocations are child spans. Context propagation passes trace IDs across agent handoffs, creating a stitched view of the entire execution graph.

Subagent Architecture and Clean Trace Boundaries

Multi-agent architectures that use dedicated subagents for discrete tasks (search, code editing, evaluation) naturally produce cleaner traces than monolithic agents. Each subagent is a bounded context with a clear input/output contract. WarpGrep runs 8 parallel tool calls per turn for up to 4 turns, each call a separate span. Morph Fast Apply processes edit operations at 10,500+ tok/s with deterministic input-output mapping. When your subagents have clean boundaries, your traces have clean boundaries.

Key Metrics to Track

Category	Metric	What It Tells You
Latency	Time to first token (TTFT)	User-perceived responsiveness. High TTFT = users staring at a spinner.
Latency	Total generation time (E2E)	Full request duration including all model calls and tool invocations.
Cost	Tokens per request (prompt + completion)	Direct input to cost calculation. Track ratio of prompt to completion tokens.
Cost	Cost per task (session-level)	Aggregate cost across all steps in a multi-agent workflow. The real unit economics metric.
Cost	Cache hit ratio	Percentage of requests served from cache (prompt caching, semantic caching). Higher = lower cost.
Quality	Hallucination rate	Percentage of outputs containing fabricated facts. Measured via faithfulness scoring or LLM-as-judge.
Quality	Eval pass rate	Percentage of outputs passing automated evaluation criteria (relevance, groundedness, safety).
Reliability	Error rate by type	API failures, rate limits, timeouts, content filter blocks. Track separately for each provider.
Reliability	Steps per task	For agents: how many steps to complete a task. Increasing steps = degrading efficiency.
Safety	Guardrail trigger rate	How often safety filters (jailbreak detection, PII redaction, toxicity) activate. Baseline for policy tuning.

The metric that matters most depends on your deployment stage. Pre-production: evaluation pass rate (is the model producing correct outputs?). Early production: error rate and latency (is the system reliable?). At scale: cost per task and cache hit ratio (is it sustainable?).

Frequently Asked Questions

What is LLM observability?

The practice of collecting and correlating signals across every model interaction, including prompts, completions, tool calls, retrieval steps, guardrail checks, and costs, to understand the performance, reliability, and risk of AI applications in production. It extends traditional observability with behavioral signals specific to language models: hallucination detection, prompt drift monitoring, evaluation scoring, and multi-step agent tracing.

How is LLM observability different from traditional APM?

Traditional APM tracks infrastructure metrics: CPU, memory, latency, error rates. LLM observability adds a semantic layer. An LLM can return HTTP 200 with a confidently wrong answer. APM sees a successful request. LLM observability detects the hallucination. The key additions are prompt/completion logging, token cost tracking, output quality evaluation, and multi-step agent tracing that follows reasoning chains across model calls.

What are the best open-source LLM observability tools?

Langfuse (MIT, acquired by ClickHouse, 1,000+ self-hosted deployments) is the most complete: tracing, evals, prompt management, datasets. Arize Phoenix is built entirely on OpenTelemetry, making it the most portable. Helicone (Apache 2.0) provides the fastest setup via proxy integration with built-in caching and cost tracking. All three have generous free tiers.

Does OpenTelemetry support LLM observability?

Yes. OpenTelemetry GenAI Semantic Conventions (v1.37+) define standard schemas for LLM traces. Datadog, Splunk, Arize, and Langfuse all accept OTel-native ingestion. The OpenLLMetry project by Traceloop provides auto-instrumentation libraries for OpenAI, Anthropic, Cohere, and 15+ other providers. Instrument once, export to any backend.

What metrics should I track first?

Start with three: latency (time to first token + total generation time), cost per request (prompt + completion tokens), and error rate by type (API failures, rate limits, content filter blocks). Add evaluation scoring once you have ground truth datasets. Add session-level cost tracking once you're running multi-step agent workflows.

Subagents with Clean Trace Boundaries

Morph powers the subagent layer of coding agents. WarpGrep searches codebases with 8 parallel tool calls per turn. Fast Apply merges edits at 10,500+ tok/s. Both produce structured, traceable operations by design.

Try Morph Free

Read the Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

LLM Observability: What It Is, Why It Matters, and How to Implement It