Why LLMs Need Different Observability
A REST API either returns 200 or it doesn't. An LLM can return 200 with a confidently wrong answer. Your infrastructure metrics will show a healthy service while your users get hallucinated outputs. This is the core problem: traditional observability instruments the transport layer, but LLMs fail at the semantic layer.
Three properties of LLMs make standard APM insufficient:
Non-Deterministic Outputs
The same prompt can produce different outputs across runs. Temperature, sampling, and model updates all introduce variance. You can't write assertions against exact responses.
Semantic Failures
Hallucinations, prompt drift, context window overflow, and reasoning errors are invisible to HTTP status codes. A 200 response with fabricated data is worse than a 500 error.
Compound Cost
Each token costs money. A runaway agent loop or verbose system prompt can burn through API budgets without triggering any traditional alarm. Cost is a first-class metric, not an afterthought.
LangChain's State of AI Agents survey puts the gap in numbers. Nearly every team has some observability. Barely half evaluate whether the outputs are correct. Teams can see their agents running but not measure whether the agents are right.
The Four Pillars of LLM Observability
Traditional observability rests on MELT: metrics, events, logs, traces. LLM observability extends this with behavioral signals that capture dimensions infrastructure metrics can't: output quality, factual grounding, and cost attribution.
1. Tracing
Captures the full execution path through multi-step workflows: prompt construction, model calls, tool invocations, retrieval steps, guardrail checks. Each step is a span in a trace. For agent systems, tracing must propagate context across agent boundaries so you can follow a request from the orchestrator through every subagent.
2. Evaluation
Measures output quality against ground truth or reference data. Includes automated metrics (BERTScore for semantic similarity, faithfulness scoring for RAG systems, LLM-as-judge for subjective quality) and human annotation workflows. The 89% vs 52% gap in the LangChain survey shows most teams skip this pillar.
3. Monitoring
Real-time dashboards and alerts for latency (time to first token, total generation time), error rates (API failures, rate limits, timeouts), throughput (requests per second), and safety signals (jailbreak attempts, toxicity, PII leakage). This is closest to traditional APM, extended with LLM-specific dimensions.
4. Cost Tracking
Attributes token spend to specific features, users, models, and workflows. Tracks prompt vs completion tokens, cache hit ratios, and cost per task (aggregated across multi-step agent sessions). Critical for teams running multiple models where a routing mistake can 10x costs overnight.
The Evaluation Gap
Among teams with agents in production, 94% have observability but only 77.2% run evaluations. The 17-point gap between "can see what's happening" and "can measure whether it's correct" is where most production failures hide. Tracing without evaluation is flying with instruments but no altimeter.
LLM Observability Tools Compared
The landscape splits roughly into open-source platforms, commercial AI-native tools, and traditional APM vendors adding LLM support. Each has a different integration model and cost structure.
| Tool | Type | Integration | Open Source | Key Strength |
|---|---|---|---|---|
| Langfuse | Platform | SDK + OTel | MIT (self-host) | Most complete OSS: tracing, evals, prompt mgmt, datasets |
| LangSmith | Platform | SDK (LangChain) | No | Deep LangChain integration, agent debugging, annotation queues |
| Helicone | Gateway | Proxy (URL swap) | Apache 2.0 | Fastest setup, 2B+ requests processed, built-in caching |
| Arize Phoenix | Platform | SDK + OTel | Yes (OTel-native) | Evaluation-focused, built entirely on OpenTelemetry |
| Portkey | Gateway | Proxy + SDK | No | 1,600+ model routing, guardrails, 40+ tracked dimensions |
| Datadog LLM Obs | APM Extension | SDK + OTel | No | Native OTel GenAI support, unified with existing APM |
| Splunk AI Monitor | APM Extension | OTel | No | Enterprise APM integration, AGNTCY quality metrics |
| Pydantic Logfire | Platform | SDK + OTel | Yes | Full-stack OTel tracing, SQL query interface, Python-native |
| HoneyHive | Platform | SDK | No | Evaluation pipelines, dataset management, CI/CD integration |
| W&B Weave | Platform | SDK | No | ML experiment tracking heritage, model comparison workflows |
Choosing by Use Case
Fastest Setup
Helicone or Portkey. Proxy-based: swap your API base URL, add a header. Observability in under 5 minutes. Tradeoff: proxies can't see internal application state (prompt templating, local RAG retrieval, control flow logic).
Open-Source, Self-Hosted
Langfuse. MIT license, ClickHouse-backed since the acquisition, 1,000+ self-hosted deployments in production. Full tracing, evals, prompt management, and datasets. Self-hosting avoids per-trace SaaS costs at scale.
Already on Datadog/Splunk
Use their native LLM observability features. Datadog supports OTel GenAI Semantic Conventions (v1.37+). Splunk's AI Agent Monitoring is GA. No new vendor. Tradeoff: pricing runs high ($0.10 per 1K tokens at Datadog).
LangChain-Heavy Stack
LangSmith. Automatic instrumentation for LangChain, LangGraph, and LangServe. Annotation queues for human review. $39/user/month. Tradeoff: deep vendor coupling to the LangChain ecosystem.
The ClickHouse + Langfuse Shift
ClickHouse acquired Langfuse in 2025, keeping it MIT-licensed and self-hostable. The migration from PostgreSQL to ClickHouse cut query latency from minutes to near real-time, reduced memory usage 3x, and made analytical queries 20x faster. For teams processing millions of traces, the ClickHouse backend is a meaningful architectural advantage over tools that store traces in general-purpose databases.
Implementation Patterns
Three architectural approaches to instrumenting LLM applications, each with distinct tradeoffs:
| Pattern | How It Works | Setup Time | Visibility | Latency Overhead |
|---|---|---|---|---|
| Proxy / Gateway | Route API traffic through a middleware (Helicone, Portkey). Swap your base URL. | Minutes | API request/response only | 50-80ms per request |
| SDK / Manual | Embed instrumentation library in your code (Langfuse, LangSmith). Decorators and wrappers. | Hours to days | Full: internal state, control flow, local processing | Negligible (async export) |
| Auto-Instrumentation (OTel) | Install OpenTelemetry SDK + GenAI instrumentation packages. Configure collector. | 30 min - 2 hours | Model calls, token usage, tool invocations | Negligible |
When to Use Each
Proxy if you need observability today with zero code changes. Works well for simple chatbots and single-model applications. Falls short when debugging requires visibility into prompt construction, RAG retrieval logic, or multi-step agent reasoning that happens before the API call.
SDK if you're building complex agent systems where the interesting failures happen between model calls, not during them. A retrieval step returns irrelevant documents, the prompt template injects stale context, or a tool call fails silently. SDKs capture these intermediate states.
OpenTelemetry if you want vendor portability and already have OTel infrastructure. The GenAI Semantic Conventions (v1.37+) standardize the schema. Instrument once, export to Datadog, Langfuse, Splunk, or any OTel-compatible backend. The OpenLLMetry project by Traceloop provides auto-instrumentation for OpenAI, Anthropic, Cohere, and 15+ other providers.
OpenTelemetry: The Convergence Point
The same convergence that happened with traditional observability is happening with LLM observability. In 2018, every APM vendor had a proprietary agent. By 2022, OpenTelemetry became the de facto standard, and observability backends competed on analysis rather than data collection. LLM observability is following the same arc.
OpenTelemetry GenAI Semantic Conventions define a standard vocabulary for AI telemetry:
- Spans for LLM calls, agent steps, tool invocations, and retrieval operations
- Events for prompt/completion content capture with configurable redaction
- Metrics for token usage (input/output), latency distributions, and cost aggregation
- Attributes for model ID, provider, temperature, max tokens, and response metadata
Datadog was the first major APM vendor to announce native support for OTel GenAI conventions, allowing teams to send LLM traces via their existing OTel Collector pipeline with no SDK changes. Splunk followed with AI Agent Monitoring built on OTel. Langfuse and Arize Phoenix accept OTel traces natively.
The practical benefit: teams that instrument with OTel today can switch between backends without re-instrumenting their applications. The observability vendor lock-in that plagued traditional monitoring is avoidable from the start with LLM observability.
Agent Span Conventions
The OTel GenAI working group is actively developing semantic conventions for agentic systems, covering agent-to-agent context propagation, session-level tracing, and framework-specific instrumentation for LangGraph, CrewAI, and AutoGen. These conventions, once stable, will enable consistent agent tracing across frameworks, solving the fragmentation that makes multi-agent debugging difficult today.
Multi-Agent Observability
Single-model applications are straightforward to observe: one prompt goes in, one completion comes out, you log both. Multi-agent systems break this model. A user request might traverse an orchestrator, a planning agent, three specialist agents, and a verification agent, each making multiple model calls with tool invocations in between. The trace is no longer a single span; it's a directed graph.
The specific challenges:
Cascading Semantic Errors
Agent A hallucinates a fact. Agent B uses that fact as context. Agent C acts on Agent B's output. The error propagates through three steps before surfacing. Without tracing the full chain, you see a wrong result but can't identify which step introduced the error.
Context Propagation
When Agent A hands off to Agent B, the trace context must follow. Traditional distributed tracing (W3C Trace Context) handles this for HTTP services. Agent frameworks need equivalent standards for passing trace IDs across agent boundaries.
Non-Deterministic Branching
Agents make dynamic routing decisions. The same input may trigger 3 subagents in one run and 7 in another. Trace schemas must handle variable-depth graphs, not fixed-depth hierarchies.
Cost Attribution
A single user request might consume tokens across 5 different models via 12 agent steps. Attributing cost to the user request, not just individual API calls, requires aggregating spans into sessions.
Galileo AI documented 9 key challenges in monitoring multi-agent systems at scale, including tool call failures across agent boundaries, emergent interaction patterns that only appear under load, and the difficulty of setting meaningful SLOs when agent behavior is inherently non-deterministic.
The architectural solution most platforms converge on: hierarchical tracing with session-level grouping. Each user request creates a session. Each agent step is a span within the session. Tool calls and model invocations are child spans. Context propagation passes trace IDs across agent handoffs, creating a stitched view of the entire execution graph.
Subagent Architecture and Clean Trace Boundaries
Multi-agent architectures that use dedicated subagents for discrete tasks (search, code editing, evaluation) naturally produce cleaner traces than monolithic agents. Each subagent is a bounded context with a clear input/output contract. WarpGrep runs 8 parallel tool calls per turn for up to 4 turns, each call a separate span. Morph Fast Apply processes edit operations at 10,500+ tok/s with deterministic input-output mapping. When your subagents have clean boundaries, your traces have clean boundaries.
Key Metrics to Track
| Category | Metric | What It Tells You |
|---|---|---|
| Latency | Time to first token (TTFT) | User-perceived responsiveness. High TTFT = users staring at a spinner. |
| Latency | Total generation time (E2E) | Full request duration including all model calls and tool invocations. |
| Cost | Tokens per request (prompt + completion) | Direct input to cost calculation. Track ratio of prompt to completion tokens. |
| Cost | Cost per task (session-level) | Aggregate cost across all steps in a multi-agent workflow. The real unit economics metric. |
| Cost | Cache hit ratio | Percentage of requests served from cache (prompt caching, semantic caching). Higher = lower cost. |
| Quality | Hallucination rate | Percentage of outputs containing fabricated facts. Measured via faithfulness scoring or LLM-as-judge. |
| Quality | Eval pass rate | Percentage of outputs passing automated evaluation criteria (relevance, groundedness, safety). |
| Reliability | Error rate by type | API failures, rate limits, timeouts, content filter blocks. Track separately for each provider. |
| Reliability | Steps per task | For agents: how many steps to complete a task. Increasing steps = degrading efficiency. |
| Safety | Guardrail trigger rate | How often safety filters (jailbreak detection, PII redaction, toxicity) activate. Baseline for policy tuning. |
The metric that matters most depends on your deployment stage. Pre-production: evaluation pass rate (is the model producing correct outputs?). Early production: error rate and latency (is the system reliable?). At scale: cost per task and cache hit ratio (is it sustainable?).
Frequently Asked Questions
What is LLM observability?
The practice of collecting and correlating signals across every model interaction, including prompts, completions, tool calls, retrieval steps, guardrail checks, and costs, to understand the performance, reliability, and risk of AI applications in production. It extends traditional observability with behavioral signals specific to language models: hallucination detection, prompt drift monitoring, evaluation scoring, and multi-step agent tracing.
How is LLM observability different from traditional APM?
Traditional APM tracks infrastructure metrics: CPU, memory, latency, error rates. LLM observability adds a semantic layer. An LLM can return HTTP 200 with a confidently wrong answer. APM sees a successful request. LLM observability detects the hallucination. The key additions are prompt/completion logging, token cost tracking, output quality evaluation, and multi-step agent tracing that follows reasoning chains across model calls.
What are the best open-source LLM observability tools?
Langfuse (MIT, acquired by ClickHouse, 1,000+ self-hosted deployments) is the most complete: tracing, evals, prompt management, datasets. Arize Phoenix is built entirely on OpenTelemetry, making it the most portable. Helicone (Apache 2.0) provides the fastest setup via proxy integration with built-in caching and cost tracking. All three have generous free tiers.
Does OpenTelemetry support LLM observability?
Yes. OpenTelemetry GenAI Semantic Conventions (v1.37+) define standard schemas for LLM traces. Datadog, Splunk, Arize, and Langfuse all accept OTel-native ingestion. The OpenLLMetry project by Traceloop provides auto-instrumentation libraries for OpenAI, Anthropic, Cohere, and 15+ other providers. Instrument once, export to any backend.
What metrics should I track first?
Start with three: latency (time to first token + total generation time), cost per request (prompt + completion tokens), and error rate by type (API failures, rate limits, content filter blocks). Add evaluation scoring once you have ground truth datasets. Add session-level cost tracking once you're running multi-step agent workflows.
Related
Subagents with Clean Trace Boundaries
Morph powers the subagent layer of coding agents. WarpGrep searches codebases with 8 parallel tool calls per turn. Fast Apply merges edits at 10,500+ tok/s. Both produce structured, traceable operations by design.