Braintrust vs Langfuse vs Raindrop

Three different answers to monitoring AI agents: Braintrust is the closed eval-first loop metered by scores ($0 to $249/mo, no mid-tier). Langfuse is the open-source OTel tracer you self-host free or run usage-priced. Raindrop is the agent-native startup ($15M seed) that gates Custom Signals, Issue Detection, and Triage to its $399/mo Pro tier and runs them on a general model. Full pricing, open vs closed, what each is built for, and the inline per-turn signal all three miss.

June 27, 2026 · 2 min read

A team standing up monitoring for an AI agent picks between three tools with three different shapes, and the friction shows up before features do. Braintrust publishes its full price and runs self-serve, but closed and eval-first. Langfuse is the one you can self-host and own the data on, open source under an MIT core. Raindrop is the finished agent-native product, but it has no free tier and gates the parts most teams come for, Custom Signals, Issue Detection, and Triage, to its $399/mo Pro tier. Pricing verified against each vendor's published page as of June 2026.

Scores
What Braintrust meters
Open source
Langfuse, self-host free
$399/mo
Raindrop tier that unlocks Signals
What buyers actually run into

On Braintrust, the cliff: the free Starter plan's 10k scores run out and the next step is the $249/mo Pro tier, with nothing in between. On Langfuse, the tradeoff is operational: self-hosting is free but you run and scale the Postgres, ClickHouse, and workers yourself. On Raindrop, the gate: there is no free tier, only a 14-day trial, and Custom Signals, Issue Detection, and Triage are reserved for the $399/mo Pro plan, on top of per-event fees, with Signals running on a general-purpose model whose accuracy on custom failures you tune by hand.

TL;DR

Three tools, three shapes, one shared gap. Braintrust meters scores and publishes every number self-serve, from $0 to $249/mo flat, closed and eval-first. Langfuse is open source under an MIT core, free to self-host and usage-priced in the cloud, the default tracer. Raindrop is the agent-native startup, closed and hosted-only, with no free tier and its real features gated to $399/mo Pro plus per-event fees.

Pick Braintrust when your bottleneck is offline evaluation and you want to own the eval logic: datasets, scorers, run-to-run diffs, and CI regression gates, framework-agnostic, metered by scores (10k free, 50k on the $249/mo Pro plan). Pick Langfuse when you want open-source tracing you self-host and own the data on, OpenTelemetry-native, free on your own infrastructure or usage-priced in the cloud. Pick Raindrop when you want a finished agent-native product that traces runs, auto-detects silent failures, and triages incidents, and you are comfortable on a closed, hosted plan that starts at $59/mo and reaches its real capability at $399/mo Pro. All three watch the agent after the fact; none of them score the turn in front of the user in real time.

Skip onboarding another dashboard

Integrate Reflexes into your monitoring with one prompt.

Quick Comparison

Braintrust is the closed eval loop billed on scores, $0 to $249/mo flat. Langfuse is the open-source tracer you self-host free or run usage-priced. Raindrop is the closed, hosted agent-native product, $59/mo entry with the real features at $399/mo Pro plus per-event fees. The table lines up every dimension.

Braintrust vs Langfuse vs Raindrop (June 2026)
DimensionBraintrustLangfuseRaindrop
Built forOffline eval loop, scorers + datasetsOpen-source tracing + observabilityAgent-native monitoring + auto-triage
LicenseClosed sourceOpen source (MIT core)Closed source
Self-hostEnterprise only (quote)Yes, freeNo
Metered unitScores (+ data, tokens)Units (traces / observations / scores)Per event + flat tier
Free tier$10 credits, 10k scores, 14-daySelf-host free; cloud Hobby freeNone (14-day trial)
First paid tierPro $249/mo: 50k scores, RBACUsage-priced cloud (~$101/mo at 1M events)Startup $59/mo + $0.001/event
Real features tierRBAC + 30-day on Pro $249All core features in the OSS buildCustom Signals, Triage gated to $399 Pro
Real-time per-turnNo (offline LLM-as-judge)No (post-hoc traces)Alerts + triage, not inline blocking

Pricing and Transparency

The three bills follow three different meters. Braintrust charges flat tiers sized in scores. Langfuse charges nothing if you self-host and usage if you run the cloud. Raindrop charges a flat tier plus a per-event fee, and reserves the features most teams want for the higher tier. Every number below is verified against each vendor's published page as of June 2026.

Braintrust is sized in scores, plus processed data and tokens, and the price is flat. The free Starter plan ships $10 of credits, 1 GB of processed data, 10k scores, 14-day retention, and unlimited users. Pro is $249/mo for 5 GB, 50k scores, 30-day retention, and RBAC, with no tier in between, so a team that outgrows the free 10k scores goes straight to $249/mo. Overage on Pro is $3/GB plus $1.50 per 1k scores. On-prem and hybrid deployments are quote-based Enterprise.

Langfuse is the only one with a free production path: the core is MIT-licensed and self-hostable, so you run the full product on your own infrastructure and pay only for the Postgres, ClickHouse, and workers behind it. The managed cloud has a free Hobby tier and then usage-based pricing metered in units, traces, observations, and scores, which works out to roughly $101/mo at a million events on the published tiers. The cost is operational, not licensing: you own the database, the scaling, and the upgrades.

Raindrop has no free tier, only a 14-day trial, and a two-part bill. Startup is $59/mo plus $0.001 per event and gives you basic tracing, search, and signals. Pro is $399/mo plus $0.0007 per event and is where the product actually lives: Custom Signals, Issue Detection, the Triage agent, Experiments, and Semantic Search are all Pro-tier features. So the $59 entry is a teaser, the real price for a team that wants the capabilities Raindrop is known for is $399/mo plus per-event fees, and there is no way to self-host any of it.

The shape of each bill

Braintrust's Pro is $249/mo no matter how many people log in; the cost moves with how many scores and how much data your evals produce. Langfuse self-hosted has no per-unit meter at all, only your infrastructure bill, which is why a high-volume team often lands there. Raindrop bills a flat tier plus an event fee, so the cost tracks both your plan and your traffic, and the capabilities you most want sit on the higher of the two tiers.

Open Source vs Closed

This is the cleanest line through the three. Langfuse is open source under an MIT core, so you can read it, self-host it, fork it, and keep every trace on infrastructure you control. For a team with data-residency rules, a security review that balks at sending production traffic to a vendor, or a simple preference to own the stack, that is decisive, and it is the reason Langfuse is the default first move for so many teams.

Braintrust and Raindrop are both closed. Braintrust self-hosting exists only as a quote-based Enterprise arrangement; Raindrop has no self-host option at any price and runs hosted-only. With both, your traces, your prompts, and your agent transcripts live on the vendor's infrastructure, and you trade ownership for a managed product you do not operate. That is a reasonable trade for many teams, but it is a trade Langfuse does not ask you to make, and it is worth naming before the feature comparison rather than after.

What Each Is Built For

The three were built to answer different questions, and the mechanism gives each away. Braintrust scores outputs. Langfuse captures traces. Raindrop watches agents and triages incidents. None of the three is wrong; they sit at different points in the loop, and the one you want depends on where your pain is.

Braintrust owns the eval loop. Datasets, scorers, and the diff between runs are the primary objects. You bring your own scoring functions, write run-to-run diffs as first-class objects, and drop regression gates into CI. The eval logic is yours, so when a score looks wrong you open the scorer and fix it. It is the strongest pick when the question is "is this prompt or model change better," answered offline on fixed datasets.

Langfuse captures what happened. It is OpenTelemetry-native tracing first: every LLM call, tool call, and nested step renders as a span, with prompt management and a place to run LLM-as-judge evals on top. It tells you, after the run, exactly what the agent did. What a trace cannot tell you on its own is whether the run was any good, which is why eval sits as a configurable layer rather than the core, and why the signal arrives after the fact rather than in the loop.

Raindrop watches the agent. It is purpose-built for agents rather than retrofitted: trace every run, auto-detect silent failures, group them into incidents, run Custom Signals over millions of events, and triage what broke. For a team that wants a finished agent-native product and a polished incident workflow, it is the most complete of the three out of the box. The two caveats are that the capabilities live on the $399/mo Pro tier, and that Signals run on a general-purpose model, so accuracy on custom failure modes is limited and the workflow is alert-and-triage after the fact, not block-the-turn in real time.

When Braintrust Wins

Braintrust wins when your bottleneck is offline evaluation and you want to own the eval logic on a published, self-serve price. You bring your own scorers and datasets, read run-to-run diffs as first-class objects, and drop regression gates into CI, reading your cost off the page in scores. It is the better pick for a framework-agnostic team that wants a neutral eval scaffold over a prebuilt metric set.

Strengths
  • You own the eval logic: your own scorers, datasets, run-to-run diffs
  • Self-serve, published pricing tied to scores
  • Framework-agnostic, strong CI regression gating
Limitations
  • Free tier is 10k scores; production jumps straight to $249/mo Pro, no tier in between
  • Closed source; self-hosting is Enterprise-only
  • Offline LLM-as-judge on samples, not a live per-turn check

When Langfuse Wins

Langfuse wins when you want open-source tracing you self-host and own the data on. It is OpenTelemetry-native, free to run on your own infrastructure, backed by the largest community of the three, and the safe default when data residency or a security review rules out sending production traffic to a vendor. The tradeoff is operational: you run and scale the database and workers yourself, and turning traces into a verdict on quality is a layer you configure.

Strengths
  • Open source (MIT core): self-host free, own your trace data
  • OpenTelemetry-native, broad framework support, largest community
  • No per-unit meter when self-hosted, only your infrastructure cost
Limitations
  • Tracing-first: shows what happened, not whether the run was good
  • Eval is a configurable layer you instrument, and it is post-hoc
  • Self-hosting means you operate the Postgres, ClickHouse, and workers

When Raindrop Wins

Raindrop wins when you want a finished agent-native product rather than a kit, and you are comfortable on a closed, hosted plan. It traces runs, auto-detects silent failures, groups them into incidents, and triages them, with a polish that comes from being built for agents from day one. It fits a team that values the managed workflow over ownership and is ready for the $399/mo Pro tier where the real capabilities live.

Strengths
  • Agent-native and polished, not an APM bolt-on
  • Auto-detects silent failures and triages incidents out of the box
  • Custom Signals over millions of events, the closest analog to a classifier
Limitations
  • The features you need (Custom Signals, Issue Detection, Triage) are gated to the $399/mo Pro tier, plus per-event fees
  • No free tier, only a 14-day trial; closed and hosted-only, no self-host
  • Signals run on a general model, so accuracy on custom failures is limited (false positives you tune by hand)
  • Alerts and triages after the fact; does not block the bad turn live

The Fourth Option: Own the Stack

There is a path none of the three sells as the headline, and a growing number of teams take it: skip the per-score, per-event, and managed-cloud meters and instrument yourself. With Langfuse already open source this is less of a leap than it sounds. You wire OpenTelemetry-native tracing into a datastore and dashboards you run, give up the managed UI and the proprietary signals, and in return owe no per-unit meter and no closed vendor holding your transcripts.

For the always-on signals, you instrument with OpenLLMetry (Apache-2.0, free) or self-hosted Langfuse (MIT core, free), store the spans in your own ClickHouse (self-host the Apache-2.0 build for free), and dashboard it with Grafana OSS. You take on the operational weight in exchange for no per-unit meter and no sales motion. We walk through the whole build in build your own LLM observability.

What They All Miss: Semantic Signals in Production

All three watch the agent after the run, not the turn in front of the user. Braintrust runs LLM-as-judge offline on samples. Langfuse captures the trace and you score it later. Raindrop alerts and triages once a failure has already happened. A wrong refund-policy answer still returns a 200 with normal latency and a normal token count, a frustrated user produces the same span as a happy one, and the batch, the trace review, or the incident that would catch it arrives after the user already saw the bad turn.

A Reflex closes that specific gap: an inline per-turn classifier you call directly, self-serve, regardless of which of the three you picked. Where Braintrust's LLM-as-judge runs offline on samples, Langfuse's trace is post-hoc, and Raindrop's Signals run on a general model and alert after the fact, a Reflex returns the label as an API response in under 90 milliseconds, on a specialized backbone that keeps it accurate and cheap enough to score every turn rather than a sample. It has no model to train and no GPU to host, and the classification rides your existing trace export, so it adds nothing to your agent's latency.

For the agents all three target, two built-ins matter most: stuck-in-a-loop, which fires the instant an agent starts repeating itself, and user-frustrated, which catches the turn where the person on the other end gives up. The built-in signals cover jailbreak, guardrail, leaked-thinking, stuck-in-a-loop, incomplete-thought, user-frustrated, ambiguity, difficulty, and domain, and you can train a custom one in a single API call. Pricing is realtime at $0.001 per event, where one event is 2,048 tokens, which is what makes scoring every turn affordable instead of sampling.

Score a turn inline, then attach it to your trace

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "stuck-in-a-loop", "text": "<the agent turn>"}'

# {
#   "model": "stuck-in-a-loop",
#   "classes": [
#     { "label": "progressing", "score": 0.04, "selected": false },
#     { "label": "looping",     "score": 0.96, "selected": true }
#   ],
#   "inference_time_ms": 88
# }

Because the label comes back as an API response and not a dashboard panel, it composes with whichever tool you picked: write it onto a Braintrust span as a score, attach it to a Langfuse trace, attach it to a Raindrop incident, alert on it in Slack, or block on it inline. It complements an eval, a tracer, or an agent platform; it does not replace one. Full docs are at docs.morphllm.com, and the full field is mapped in the best AI agent monitoring tools.

Frequently Asked Questions

Braintrust vs Langfuse vs Raindrop: what is the core difference?

Three shapes. Braintrust is the closed eval-first loop metered by scores. Langfuse is the open-source, OpenTelemetry-native tracer you self-host free or run usage-priced. Raindrop is the closed, hosted agent-native startup that gates Custom Signals and Triage to its $399/mo Pro tier. See what each is built for to pick.

How much do they cost?

Braintrust Starter is free (10k scores); Pro is $249/mo for 50k scores, no tier in between. Langfuse is free to self-host (MIT core) and usage-priced in the cloud (~$101/mo at 1M events). Raindrop has no free tier: Startup $59/mo + $0.001/event, Pro $399/mo + $0.0007/event, with the real features on Pro. The pricing section has the full math.

Which one is open source?

Only Langfuse, under an MIT core, so you can self-host it free and own your data. Braintrust self-hosting is Enterprise-only; Raindrop is hosted-only with no self-host. See open source vs closed, or the Braintrust vs Langfuse two-way breakdown.

Is Raindrop's Signals feature accurate enough for custom failures?

Signals are custom classifiers Raindrop runs over millions of events, the closest analog in this trio to a per-turn classifier, but they live on the $399/mo Pro tier and run on a general-purpose model, so accuracy on niche failures is limited and you tune out false positives by hand. A purpose-built classifier on a specialized backbone closes that gap. See semantic signals in production.

How do they compare to the rest of the field?

For the wider map, the best AI agent monitoring tools ranks the full set with pros and cons, Braintrust vs Langfuse goes two-way, and Langfuse alternatives covers the open-source neighbors.

Related comparisons

Score the turn inline, self-serve, on every request

Braintrust scores offline, Langfuse traces post-hoc, and Raindrop alerts after the fact; a Reflex returns a semantic label over an API in under 90 milliseconds, self-serve, composing with all three.