Build Your Own LLM Observability (2026): OpenTelemetry + ClickHouse + Reflexes, the Stack the Vendors Run

Langfuse, Helicone, and SigNoz all store traces in ClickHouse. You can run the same stack: instrument with OpenLLMetry (Apache-2.0), export OTel spans to ClickHouse from ~$66/mo, dashboard in Grafana, and run Morph Reflexes over the trace content for the semantic signals a span never carries.

June 15, 2026 · 1 min read
Build Your Own LLM Observability (2026): OpenTelemetry + ClickHouse + Reflexes, the Stack the Vendors Run

Langfuse, Helicone, and SigNoz all store their traces in ClickHouse and ingest OpenTelemetry. The build-your-own path is that same stack without the per-trace meter: instrument with OpenLLMetry, export spans to your own ClickHouse from around $66 a month, dashboard in Grafana, and run Reflexes over the content for the signals a span never carries. Pricing verified against each vendor's published page as of June 2026.

~$66/mo
ClickHouse Cloud Basic, the only real line item
$0
OpenLLMetry + Grafana OSS
<90ms
Reflex label per turn, inline

TL;DR

Build the stack the vendors run: OpenLLMetry (Apache-2.0) for instrumentation, ClickHouse for the trace store, Grafana for dashboards, and Morph Reflexes for the semantic signals a trace cannot carry. The open components are free; ClickHouse Cloud Basic starts around $66 a month and there is no per-trace meter. You own the data, you avoid lock-in, and you keep the option to add what no managed tool ships: a label on the meaning of every turn.

Why Own the Stack

Three things push teams off managed observability, and all three are structural rather than cosmetic.

The meter compounds on agents. A 20-step agent turn is one user request that fans into 20 or more spans. On a per-trace plan that is one trace; on a per-event plan it is 20+ units. Either way the bill scales with agent depth, not with users.

Lock-in is an instrumentation question. If your tracing lives in a vendor SDK, ripping the tool out means re-instrumenting your code. OpenTelemetry inverts that: instrument once, point the exporter anywhere. In an r/LangChain thread from this month, the top reply was "instrument everything via native OpenTelemetry so you can swap backends when you inevitably get frustrated."

The vendors already run this stack. ClickHouse acquired Langfuse in January 2026; Helicone migrated to ClickHouse and cut query times from over 100 seconds to 0.5; SigNoz is built on it. Building your own is assembling the open parts they package and meter.

The Architecture

Four components, each swappable, connected by the OpenTelemetry protocol.

LayerComponentLicense / CostJob
InstrumentOpenLLMetry (Traceloop)Apache-2.0, freeAuto-emit OTel spans for LLM, vector DB, and framework calls
CollectOpenTelemetry CollectorApache-2.0, freeReceive spans, batch, route to the store
StoreClickHouseApache-2.0 self-host, or Cloud ~$66/moColumnar store for high-ingest trace analytics
DashboardGrafana OSSAGPLv3, freeQuery and visualize spans, latency, token cost
LabelMorph ReflexesPer-event APIClassify the meaning of each turn, write back as span attributes

The first four layers are the standard OpenTelemetry-to-ClickHouse observability pattern, the same one used for application traces. The fifth layer is the one specific to LLMs, because LLM failures are semantic and a generic span does not capture them.

Instrument with OpenLLMetry

OpenLLMetry is an Apache-2.0 set of OpenTelemetry instrumentations maintained by Traceloop, with about 7.2k GitHub stars. It auto-instruments LLM providers, vector databases, and frameworks (LangChain, LlamaIndex, CrewAI), emitting standard OTel spans. One initialization call sends those spans to any OTel endpoint, including a Collector in front of ClickHouse.

Initialize tracing and point it at your collector

from traceloop.sdk import Traceloop

# Spans go to your OTel Collector, which writes to ClickHouse.
Traceloop.init(
    app_name="my-agent",
    api_endpoint="http://otel-collector:4318",
)

# From here, every LLM and framework call is auto-instrumented as an
# OpenTelemetry span: prompt, response, model, tokens, latency, tool calls.

Because the output is plain OpenTelemetry, nothing about this step binds you to a backend. The same spans can fan out to ClickHouse, Grafana Tempo, or a managed tool in parallel while you evaluate.

Store Spans in ClickHouse

ClickHouse is a columnar database built for high-ingest analytical queries, which is the trace workload exactly. The OpenTelemetry Collector has a ClickHouse exporter, so the path is Collector to ClickHouse with no custom glue. Once spans land, you query them with SQL: slowest agent turns, token cost per tool call, error rate by model.

Query agent traces with SQL

-- Most expensive agent turns in the last day, by total tokens
SELECT
    SpanAttributes['traceloop.entity.name'] AS agent_step,
    count()                                  AS calls,
    sum(toInt64(SpanAttributes['llm.usage.total_tokens'])) AS tokens
FROM otel_traces
WHERE Timestamp > now() - INTERVAL 1 DAY
GROUP BY agent_step
ORDER BY tokens DESC
LIMIT 20;

Self-hosted ClickHouse is Apache-2.0 and free; you pay only for the machine. ClickHouse Cloud Basic runs around $66 a month in a low-usage worked example, billed on metered compute (about $0.22 per unit-hour) plus compressed storage. Either way, there is no per-trace charge, so agent depth does not inflate the bill.

Cost: DIY vs Managed

The open components are free, so the DIY line item is the trace store. Set that against what managed tools charge at the same volume.

OptionMonthly costWhat you run / give up
DIY: OpenLLMetry + ClickHouse + Grafana~$50-100You operate ClickHouse; you own the data
Langfuse Core~$101Managed; MIT core if you self-host instead
LangSmith Plus~$2,514Managed, closed; one seat at 14-day retention

The DIY number and Langfuse Core are close, which is the honest read: if you want managed and open, Langfuse Core is a fine deal and self-hosting Langfuse is free. DIY wins decisively over the closed per-trace plans, and it wins on control: your traces never leave your infrastructure, and you can run any classifier you want over them. For the full vendor-by-vendor pricing, see Langfuse vs LangSmith and LangSmith alternatives.

The Layer Traces Miss: Reflexes

A span records structure: prompt, response, latency, tokens, the call tree. It does not record meaning. A response that quotes the wrong refund policy returns a 200 with normal latency and a normal token count. A user who is quietly getting angry produces the same span as a delighted one. An agent stuck in a three-step loop looks like an agent doing work. The trace is green and the product is broken. These are the signals that are not tracebacks, and a DIY stack stores them in ClickHouse without ever labeling them.

A Morph Reflex is the inference layer that adds the label. It is a small, fast text classifier that scores the content of a turn and returns a label in under 90 milliseconds, cheap enough to run on every span rather than an offline sample. Built-in signals include stuck-in-a-loop, leaked-thinking, jailbreak, guardrail, incomplete-thought, ambiguity, and difficulty, and you can train a custom signal for your product in under an hour from a prompt, a labeled set, or an unlabeled one.

Label a span's content, then write it back to ClickHouse

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "stuck-in-a-loop", "text": "<the agent turn from the span>"}'

# {
#   "model": "stuck-in-a-loop",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "progressing", "score": 0.04, "selected": false },
#     { "class_id": 1, "label": "looping",     "score": 0.96, "selected": true }
#   ],
#   "inference_time_ms": 88
# }

The predicted label is the class with selected: true. Write it onto the span as an attribute, and now your ClickHouse query can do what no raw trace can: count looping agents, find frustrated sessions, and alert on policy violations that never threw an error. The same outputs feed evals, fine-tunes, and RL reward terms downstream. At the realtime rate of $0.0005 per event (one event is 2048 tokens), labeling every turn stays cheap enough to leave on in production.

Why a classifier, not an LLM-as-judge

The managed tools approximate semantic checks with LLM-as-judge evals that run offline on a sample. That misses the looping agent in real time and costs a full model call per check. A Reflex runs inline on every turn at classifier speed and price, so the signal lands before the agent takes its next step, not in tomorrow's eval run.

When to Buy Instead

DIY is not free of cost, it moves the cost from a bill to your team. Buy a managed tool when you do not want to operate ClickHouse, when you need prompt management and annotation queues out of the box, or when you are committed to LangChain and want first-party LangSmith tracing. Self-host Langfuse when you want a managed-grade product on your own infrastructure for the price of the servers. The decision is whether owning the data and the bill is worth running one more database. Either way, the semantic layer is additive: Reflexes returns a label over an API that writes onto a managed span or a DIY one the same way.

Frequently Asked Questions

Can I build my own LLM observability instead of buying a tool?

Yes, and it is the stack the vendors run: OpenLLMetry to instrument, ClickHouse to store, Grafana to dashboard. Langfuse, Helicone, and SigNoz all use ClickHouse. See the architecture.

What is OpenLLMetry and is it free?

OpenLLMetry is an Apache-2.0 set of OpenTelemetry instrumentations for LLM apps from Traceloop (~7.2k stars). It is free and emits standard OTel spans you can send anywhere, covered in instrument with OpenLLMetry.

How much does a DIY stack cost?

The instrumentation and dashboard are free; the trace store is the line item. ClickHouse Cloud Basic is around $66 a month, or self-host for the cost of a VM. The cost section compares it to managed plans.

What does a trace still miss in a DIY stack?

Meaning. Wrong answers, frustration, and looping all produce structurally normal spans. Labeling them needs a per-turn classifier, which is what Reflexes adds.

Related comparisons

Own the stack, add the layer it can't see

Run OpenTelemetry and ClickHouse for the traces; run Reflexes over the content for the semantic signals that never throw. One API, under 90 milliseconds a turn.