Two things a demo of agent monitoring software will not tell you. Whether it catches the specific way your agents fail, because the demo runs on the vendor's example traces. And what you will actually pay, because most of these tools bill per span, event, or request, and one agent turn is many of those. The sticker price and the real bill are different numbers. This page is the buyer's guide: the criteria that matter, the pricing side by side, and what each path costs. Pricing verified June 2026.
The reason a clean comparison is hard: every tool meters on a different unit, and most agent runs emit many units per turn. A single turn with three model calls and three tool calls is roughly seven spans on Datadog, Arize, or Sentry, seven events on AgentOps or Raindrop, but one request on Helicone. Same workload, very different bill. Every vendor's enterprise tier is sales-led with no public number, so the published prices below are the self-serve plans, which is what most teams actually start on.
Start with what you are buying for. Two pages sit next to this one: what agent monitoring is defines the category and its failure modes, and the best agent monitoring tools ranks each one on what it catches. This page answers the two questions those leave open: how to choose, and what it costs.
The Buying-Criteria Checklist
Six criteria separate these tools more than any feature list. Score each candidate on all six before you look at a dashboard. The first three decide whether the tool fits your failures; the last three decide what it costs you over a year.
Real-time or post-hoc?
Almost every tool here is post-hoc: it records what happened so you read it later, which cannot stop a jailbreak, a loop, or a policy violation while the agent is still running. Ask whether it can block a turn, or only show it to you tomorrow.
Can you train a custom check?
The most common complaint in this category is needing a check the vendor never built: numeric precision for finance agents, render quality for UI agents, tone and policy for support agents. Ask whether you can define and train your own signal, or are limited to the built-in checks.
Does it cover your framework?
First-class support for LangGraph, CrewAI, the OpenAI Agents SDK, AutoGen, Pydantic AI, or the Vercel AI SDK saves weeks of instrumentation. OpenTelemetry-native tools (Phoenix, Langfuse) are framework-agnostic but need more wiring. Ask whether it drops into your stack, or you instrument every span by hand.
Per-span or per-event pricing?
This quietly sets your bill. Per-span (Datadog, Arize, Sentry), per-event (AgentOps, Raindrop), and per-unit (Langfuse) tools all charge for every sub-step, so a 7-step turn costs 7x a single call. Ask what one billable unit is, and how many your agent turns produce.
What does eval upkeep cost?
Eval-first tools (Braintrust, Phoenix) are powerful, but scorers go stale, datasets drift, and LLM-as-judge scorers cost a model call each. Budget the engineering time, not just the license. Ask who keeps the evals current, and what running them at production volume costs.
Can you keep the data?
If traces carry customer data, or you cannot put a third party between your app and your model, self-host matters. Langfuse (MIT) and Arize Phoenix (ELv2) run on your own infrastructure with no usage cap, while Datadog and Raindrop are hosted-only. Ask whether you can keep the data in-house, and at what license.
Pricing, Side by Side
Self-serve plans only, verified against each vendor's pricing page in June 2026. Every tool also has a sales-led enterprise tier with no public price, so these are the numbers you actually start on. "Billing unit" is the thing you pay per, which matters more than the plan price for an agent workload (see the next section).
| Tool | Free tier | Entry paid plan | Billing unit | Self-host / OSS |
|---|---|---|---|---|
| Langfuse | 50k units/mo, 2 users | Core $29/mo (100k units) | Per unit: each trace + span + score | Yes, MIT, free |
| Braintrust | 1GB + 10k scores/mo, unlimited seats | Pro $249/mo | GB processed + scores | Enterprise only |
| Arize | Phoenix: free self-host; AX: 25k spans/mo | AX Pro $50/mo (50k spans) | Per span ($0.0008 over) | Phoenix ELv2, free |
| Datadog LLM Obs | 40k LLM spans/mo | Pro $160/mo (100k spans) | Per LLM span (~$8/10k) | No |
| AgentOps | 5k events/mo | Pro from $40/mo | Per event (call or tool) | Yes, MIT, full app |
| Helicone | 10k requests/mo, 7-day | Pro $79/mo | Per request | Apache 2.0 (maint. mode) |
| Sentry | 5k errors/mo, 5M spans | Team $26/mo; Seer +$40/contributor | Per span (in tracing quota) | FSL, source-available |
| Raindrop | 14-day trial, no free tier | Startup $59/mo + $0.001/event | Base fee + per event | No |
| Reflex (build-your-own) | Self-serve, usage-based | No monthly base; pay per event | Per event = 2048 tokens, $0.001 | API, OpenAI-FT-compatible |
Sources: each vendor's pricing page, June 2026. Langfuse Pro is $199/mo above Core; Braintrust Starter is free with $10/mo credits; Datadog bills ~$8/10k LLM spans annually (~$12 on-demand) above a 100k-span minimum; Sentry's AI agent monitoring is billed as spans in the tracing quota, with the Seer AI add-on at $40 per active contributor per month; Raindrop Pro is $399/mo + $0.0007/event. Reflex realtime is $0.001/event for the first 1M events then $0.0005; batch is half that. Enterprise tiers for all eight are sales-led. See Morph pricing.
Two kinds of free are mixed in that table. Open-source self-host (Langfuse, Phoenix, AgentOps, Sentry) means the software is free and you pay for the compute and storage you run it on, with no usage cap. A hosted free tier (Datadog 40k spans, Helicone 10k requests, AgentOps 5k events, Sentry 5k errors) is a trial-sized quota that a production agent burns through in days because of fan-out. Helicone is Apache 2.0 and self-hostable, but in maintenance mode since its March 2026 acquisition by Mintlify, so the price is low and the roadmap is frozen.
The Billing Unit Decides Your Bill
The plan name tells you the floor. The billing unit tells you the slope. Six of the eight tools meter on a unit that an agent emits many of per turn, so your cost scales with how many steps your agents take, not how many users you have or how much traffic you serve.
Sentry's own documentation gives the clean example: an agent step with three model calls and three tool calls traces as roughly seven spans (one for the agent invocation, one per call, one per tool). On a per-span or per-event tool, that one turn is seven billable units. Run a million of those turns and you are billing for roughly seven million units, even if traffic and headcount never changed. This is why teams report bills jumping 3x to 5x when they move from a single-call app to an agent, with no change in user count.
Three buyer implications fall out of this. A simpler agent (few tool calls) is cheap on every model; a deep agent (many tool calls, sub-agents, retries) gets expensive fast on per-span and per-event tools. A per-request tool looks cheap because it bills once, but it also sees only the request, not the reasoning and tool calls inside it. And a token-bucket unit (Reflex's 2,048-token event) is the one model where your cost is predictable from the size of the turn you classify, independent of how many steps the agent ran to produce it.
One billing behavior is worth singling out. Datadog auto-activates LLM Observability when your telemetry carries the standard OpenTelemetry GenAI tags (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens). The premium product starts billing on the presence of those tags, with no opt-in toggle; to stop it you strip the attributes in your collector config. If you already run Datadog, confirm this before you instrument agents, or the bill activates itself.
Enterprise vs Startup vs Build-Your-Own
Three paths, three cost structures. Pick the one that matches how you buy, not just what you need.
Enterprise: consolidate on one vendor
If you already pay Datadog or Sentry, putting agents on the same bill and the same dashboard is the path of least resistance. Datadog LLM Observability, Sentry Business plus the Seer add-on, Braintrust Enterprise, Arize AX Enterprise, and Raindrop Enterprise are all sales-led, with negotiated volume discounts that the public per-span and per-event rates do not reflect. The trade is span metering at agent scale and a generic failure taxonomy. Buy here when single-vendor consolidation and procurement simplicity outweigh best-of-breed depth.
Startup: free tier now, watch the cliff
For a small team, the cheap entry points are real: Sentry Team at $26/mo, Langfuse Core at $29/mo, AgentOps Pro from $40/mo, Arize AX Pro at $50/mo. Open-source self-host (Langfuse, Phoenix) costs nothing for the software. The cliff is fan-out: a hosted free tier sized in the thousands of spans or events is gone in days once a production agent is running, and you land on usage-based overage. Start free, but model your real per-turn unit count before you commit, because the overage rate, not the base plan, is what you will pay.
Build-your-own: an OSS tracer plus a classifier API
The build path is now two pieces, not a from-scratch project. A free open-source tracer (Langfuse or Phoenix) stores the structure of every run. A classifier API supplies the one thing that used to make building impractical: a fast, accurate judgment of what each turn meant. You own the dashboard, the alerting, and the definition of a bad turn, and you add any custom check the vendors do not sell. The economics work because the semantic signal is now an API call instead of a frontier-model judge.
Build Your Own on a Classifier API
Every tool above is a product built around one capability: turning agent activity into a signal you can act on. Reflex sells that capability directly, as an API, with the one thing the others hide: a per-event price you can read off the page. A reflex is a per-turn classifier that runs in under 90ms over up to 64k tokens of context and returns a verdict: jailbreak, loop, frustrated user, policy violation, or a custom failure you define. Because it is an API and not a dashboard, you build the monitoring you actually want on top of it.
The pricing is the differentiator on a buyer's-guide page. Reflex bills per event where one event is 2,048 tokens. Realtime is $0.001 per event (roughly $0.49 per million tokens classified) for the first 1M events, then $0.0005; batch is half that again. There is no monthly base fee and no per-seat charge: you pay for what you classify.
Compared with running a frontier model as a judge on every turn (a full model call, 1 to 3 seconds, $3 to $25 per million tokens), a specialized classifier is up to 10x cheaper, and fast enough to block a bad turn before it ships rather than flag it after.
The token-bucket unit is also what makes the bill predictable. On a per-span tool, a deeper agent costs more even at constant traffic, because depth means more spans. A Reflex event tracks the size of the turn you hand it, not the number of steps the agent took to produce it, so your monitoring cost stays legible as your agents get more complex.
The API is OpenAI-fine-tuning-compatible (/v1/fine_tuning/* to train, /v1/reflex/predict to score, base model morph-reflex-v1), so it drops into an existing stack, and you train a custom check on your own labeled failures in under an hour.
- Transparent per-event price, no base or seat fee
- Predictable cost: per token, not per step
- Real-time: blocks a turn mid-run, not after
- Train any custom check in under an hour
- Up to 10x cheaper than an LLM-as-judge
- It is a primitive; you build the dashboard
- Best paired with a tracer for full history
- Newest entrant in this list
Best for: teams that want a predictable per-event bill, real-time blocking, and a custom check no vendor sells. See Reflex and the full tool comparison.
Which One to Buy
| Your situation | Best fit | Why |
|---|---|---|
| Tightest budget, want OSS + own your data | Langfuse (MIT) or Phoenix (ELv2) | Free software, no per-unit charge |
| Already pay Datadog or Sentry | Datadog LLM Obs / Sentry | One bill, one dashboard; watch span metering |
| Evaluation is the core need | Braintrust | Deepest scorer and dataset workflow |
| Agent-native SaaS, fast | Raindrop / AgentOps | Built for agents; per-event billing |
| Predictable per-event cost, no base fee | Reflex ($0.001/event) | Pay per token classified, not per seat |
| Catch failures in real time (block the turn) | Reflex (<90ms) | Classifier in the request path, not a trace |
| A custom check no vendor sells | Build on Reflex | Train your own signal in under an hour |
| Avoid a frozen or acquired product | Skip Helicone | Maintenance mode since March 2026 |
Most teams end up with two layers: a tracer for history (an open-source one if budget is tight, or whatever is already in the stack) and a real-time classifier for the failures a trace cannot catch. The tracer answers "what happened." The classifier answers "is this turn okay, right now," and it is the layer with the clearest price.
The one number the other tools won't print
Reflex bills per event at $0.001, one event = 2048 tokens, no base or seat fee. A per-turn classifier in under 90ms: jailbreaks, looping, frustration, policy, or a custom check trained in under an hour. Build the monitoring you actually want on top of it.
Frequently Asked Questions
How much does agent monitoring software cost?
Entry self-serve plans run from free to about $249/month before usage (June 2026): Sentry Team $26, Langfuse Core $29, AgentOps Pro from $40, Arize AX Pro $50, Raindrop Startup $59 plus $0.001/event, Helicone Pro $79, Datadog LLM Observability Pro $160, Braintrust Pro $249. Open-source self-host (Langfuse, Phoenix) is free for the software. The sticker price is not the real cost: most tools meter per span, event, or request, so the bill scales with how many steps your agents take. See the pricing table.
Why is agent monitoring pricing so hard to compare?
Because the billing unit differs across tools and most agent runs emit many units per turn. Langfuse bills per unit (trace plus observation plus score), Arize, Datadog, and Sentry bill per span, AgentOps and Raindrop bill per event, Helicone bills per request. One agent turn is roughly seven spans or events but one request, so the same workload costs very differently. Enterprise tiers are all sales-led with no public price. See the billing unit section.
What is the cheapest agent monitoring option?
For the software, open-source self-host is cheapest: Langfuse (MIT) and Arize Phoenix (ELv2) are free to run with no per-unit charge. For a hosted free tier, Sentry Developer and AgentOps Basic cost nothing to start, though agent fan-out burns those quotas fast. For the cheapest predictable real-time per-turn check, Morph Reflex bills per event at $0.001 (one event = 2,048 tokens, about $0.49 per million tokens classified) with no monthly base fee.
What is the difference between per-span and per-event pricing?
Both bill for the sub-steps of an agent run, so both scale with agent complexity. A span is one node of a trace; an event in AgentOps or Raindrop is the same idea. Per-request pricing (Helicone) bills once per logged call, which is flatter but captures less. Reflex uses a token-bucket event (2,048 tokens), so its cost tracks the size of the turn, not the number of steps. See the billing unit section.
Should I buy agent monitoring software or build my own?
Buy a tracer when a generic feature set is fine; the open-source ones cost nothing for the software. Build when you keep hitting a custom check the vendor lacks, need to join monitoring with your own data, or need to act on a failure in real time. Building is now practical because the hard primitive, a fast semantic judge, is an API: Reflex returns a per-turn label in under 90ms at $0.001 per event. See build your own.
Which agent monitoring tools can I self-host?
Langfuse (MIT) and Arize Phoenix (Elastic License 2.0) are fully open-source with no usage caps. AgentOps open-sources its full app under MIT. Helicone is Apache 2.0 and self-hostable but in maintenance mode since its March 2026 Mintlify acquisition. Sentry is source-available (Functional Source License). Braintrust, Datadog, and Raindrop are closed; Braintrust self-host is Enterprise-only.
Does Datadog LLM Observability cost balloon with agents?
It can. It bills per LLM span at roughly $8 per 10,000 spans annually (about $12 on-demand) above a 100,000-span minimum on the $160/month plan, and agents emit many spans each. The sharper gotcha is automatic activation: if your telemetry carries the standard OpenTelemetry GenAI tags, Datadog auto-routes that data into the premium product and starts billing, with no opt-in toggle. See the activation gotcha.
Sources
- Langfuse pricing and self-host (MIT)
- Braintrust pricing
- Arize AX pricing and Phoenix (Elastic License 2.0)
- Datadog LLM Observability and cost docs
- AgentOps pricing and repo (MIT)
- Helicone pricing and the Mintlify acquisition (maintenance mode)
- Sentry pricing, Seer add-on, and the agent monitoring guide
- Raindrop pricing and its YC W24 profile
- Morph Reflex capabilities and pricing
- Pricing verified June 2026 against each vendor's published pricing page; enterprise tiers are sales-led with no public rate.