Most teams hit this comparison the hard way: a Braintrust bill climbs past what the data they keep would justify, or a Langfuse free tier runs out and the unit meter starts counting every trace, observation, and score. Braintrust is a closed eval platform that bills scores; Langfuse is an open trace platform you can self-host for free. The real question is where the eval and trace data lives and who holds the meter. Pricing verified against each vendor's published page as of June 2026.
TL;DR
The search usually starts with a bill or a free tier running out, and the answer splits cleanly. Pick Braintrust when your bottleneck is "is this prompt change better," you want a managed eval product, and running on someone else's cloud is fine. Scoring is its product, so it meters scores (10k free, 50k on the $249/mo Pro plan) plus processed data and tokens, and the source is closed. Pick Langfuse when you want the trace data in your own infrastructure with no vendor controlling the meter: the core is MIT, self-hosting is free, and Langfuse Cloud meters ingested units (50k/mo free, 100k on the $29/mo Core plan). Braintrust if evaluation is the job and the cloud is acceptable; Langfuse if open source or owning your data is non-negotiable.
What teams actually run into
The two products fail you in different places. On Braintrust the jump is steep: one pricing teardown notes "there's no mid-tier: teams that outgrow Starter go straight to $249/month with nothing in between." On Langfuse the cost moves to ops: a maintainer-tracked self-hosting thread has users saying it "feels weird to deploy using docker-compose ... not really for production."
Skip onboarding another dashboard
Integrate Reflexes into your monitoring with one prompt.
Quick Comparison
Braintrust and Langfuse split on two axes: license and metered unit. Braintrust is closed source and bills scores plus processed data and tokens; Langfuse is MIT-core, self-hosts for free, and bills ingested units (any trace, observation, or score). The table below lays out free tiers, first paid tiers, overage, and ownership side by side as of June 2026.
| Dimension | Braintrust | Langfuse |
|---|---|---|
| Built for | Eval-first: scoring is the product | Observability-first: tracing is the product |
| License | Closed source | MIT core (~28.8k stars) |
| Metered unit | Scores (+ processed data, tokens) | Ingested units (trace, observation, or score) |
| Free tier | Starter: $10 credits, 1 GB, 10k scores, 14-day | Cloud: 50k units/mo, 2 users, 30-day |
| First paid tier | Pro $249/mo: 5 GB, 50k scores, 30-day, RBAC | Core $29/mo: 100k units, unlimited users |
| Overage | Starter $4/GB + $2.50/1k scores; Pro $3/GB + $1.50/1k scores | $8/100k units (drops to $6 at 50M+) |
| Token cost | $0.06/M input, $0.40/M output | Included in unit meter |
| Self-hosting | Hybrid / on-prem on Enterprise | Free: Postgres + ClickHouse + Redis/Valkey + S3 |
| Ownership | Braintrust (closed) | Acquired by ClickHouse (Jan 2026) |
Pricing: Scores vs Ingested Units
Braintrust and Langfuse price on opposite axes. Braintrust meters scores, the output of evaluating, plus processed data and tokens; Langfuse Cloud meters ingested units, meaning any trace, observation, or score sent to the platform. So a Braintrust bill grows with how much you evaluate, and a Langfuse bill grows with how much you send. Compare them against your own traffic, because the meters charge for different things.
Braintrust meters scores, plus processed data and tokens, with the full grid on Braintrust's pricing page. The free Starter plan ships $10 of credits, 1 GB of processed data, 10k scores, 14-day retention, and unlimited users. Pro is $249/mo for 5 GB, 50k scores, 30-day retention, and RBAC. Overage on Starter is $4/GB plus $2.50 per 1k scores; on Pro it drops to $3/GB plus $1.50 per 1k scores. Tokens are billed separately at $0.06/M input and $0.40/M output. One pricing teardown flags the missing ceiling: "no hard spending cap means you'll exceed that amount without realizing it." Keeping Braintrust in your own environment means an Enterprise hybrid or on-prem contract at custom pricing.
Langfuse Cloud meters ingested units, where a unit is any event sent to the platform: a trace, an observation, or a score, as its pricing page spells out. The free tier covers 50k units a month, 2 users, and 30-day access. Core is $29/mo for 100k units with unlimited users; overage is $8 per 100k units, dropping to $6 at 50M+. Pro is $199/mo. The catch is that one multi-step agent request can ingest dozens of units before any eval runs, so the meter tracks activity, not evaluations. Self-hosting Langfuse costs no license fee at all; you run Postgres, ClickHouse, Redis or Valkey, and S3-compatible storage and pay only for the infrastructure.
The denominators are not the same, so do not compare a score to a unit one-for-one. A Braintrust score is the output of evaluating: a team that scores ten variants against a 500-row dataset generates 5,000 scores from a handful of traces. A Langfuse unit is the act of ingesting: a single agent trace with several observations burns several units before anyone runs an eval. To put a real number on the Langfuse side, 1M units a month on Core lands around $101/mo: $29 base plus nine $8 increments over the included 100k. Run the same math on your own traffic shape, because the two meters charge for opposite axes.
Before you commit a pipeline to one meter
The recurring r/LangChain thread on the best open-source observability tools is worth reading before you route your whole pipeline through one vendor's meter, especially when owning the data is part of the decision. It is where teams compare what they actually pay once volume scales.
Closed Eval Platform vs Open Trace Platform
Braintrust is the closed, polished eval platform; Langfuse is the open trace platform you can own. If your question is "is this prompt change better," you live in the offline-eval loop of fixed datasets, scoring functions, and CI regression gates, and that is Braintrust's center of gravity. If your question is "where does this data live and who can change the price," that is Langfuse's, because the MIT core lets you self-host the whole platform for free.
The cost of that polish is that Braintrust is closed. Langfuse's own teardown of Braintrust alternatives describes a proprietary closed-source core with a proprietary Brainstore engine and a BTQL query language in place of standard REST or OpenTelemetry, and self-hosting only on Enterprise. You run on Braintrust Cloud, and keeping the data in-house is an Enterprise line item. That is a competitor's framing, but the closed-source and Enterprise-only self-host facts hold up against Braintrust's own pages.
If the question is "where does this data live and who can change the price," you are choosing for control, and that is Langfuse's center of gravity. The MIT core means you can read it, fork it, and self-host the whole platform for free, so the trace data sits in your own Postgres, ClickHouse, and object storage rather than a vendor's account. Langfuse still ships a managed cloud if you want it, but the open core is the reason a team picks it: no single vendor holds the meter or the data hostage.
This is also why the comparison is not really Braintrust-versus-Langfuse on evals alone. If you want the LangChain-native eval-and-trace product instead, that is the Braintrust vs LangSmith decision, and if you are weighing the two open-leaning trace tools, that is Langfuse vs LangSmith. Here the axis is closed eval platform versus open trace platform, and it turns on data ownership more than on dashboards.
Self-Hosting and Owning Your Data
Langfuse self-hosts for free; Braintrust does not. Langfuse v3 is a multi-service platform (web and worker containers, Postgres for transactional data, ClickHouse for analytics, Redis or Valkey for queues and cache, and S3-compatible storage for event payloads), which is why it scales to high ingest volume and also why it takes more than a single binary to stand up. ClickHouse acquired Langfuse in January 2026, so the OLAP engine under the product and the company shipping it are now the same.
That operational weight is the recurring complaint. In Langfuse's own self-hosting discussion, users say things like "I really do not want to move off serverless infra to a dedicated VM." If you cannot run ClickHouse, the self-host path is harder, though Langfuse Cloud is still there as a fallback for teams that want the open data model without the ops.
Braintrust has no free self-host path. The product is closed, you run on Braintrust Cloud, and the only way to keep it in your own VPC or data center is a hybrid or on-prem Enterprise contract at custom pricing. For a team whose procurement draws a hard line at data leaving its boundary, that single fact decides it: Langfuse self-hosts for free, Braintrust does not. If Braintrust's meter or its closed posture is the blocker, the broader set of options is in Braintrust alternatives.
When Braintrust Wins
Braintrust wins when evaluation is the actual job and running on its cloud is acceptable. Its scoring tooling and experiment UI are more polished than Langfuse's, and run-to-run diffs and CI regression gates are first-class objects rather than things you assemble. One engineer who evaluated Langfuse on Hacker News "found the data model and UI to be both cumbersome and unintuitive" and picked a competitor; if the UX of the eval loop matters, that gap favors Braintrust.
- Your bottleneck is offline evaluation: "is this prompt or model change better" on fixed datasets.
- You want a polished managed eval product with run-to-run diffs and CI regression gates as first-class objects.
- Running on the vendor's cloud is acceptable, and per-score billing matches the work you actually generate.
- You are framework-agnostic and do not want a tool that assumes one library.
When Langfuse Wins
Langfuse wins when ownership and open source are the requirement. The MIT core means you can read it, fork it, and self-host the full platform for free, so the trace data sits in your own Postgres, ClickHouse, and object storage instead of a vendor's account. If your bottleneck is production observability (span trees, latency, token counts, replaying a bad session), tracing is what Langfuse is built around.
- Open source is a requirement: MIT core, source you can read, fork, and self-host for free.
- You want the trace data in your own infrastructure, with no vendor controlling the meter.
- Your bottleneck is production observability: span trees, latency, token counts, and replaying a bad session.
- You can run ClickHouse, or you are fine starting on Langfuse Cloud's 50k-free-units tier.
The Third Option: Own the Stack
A third path skips both the closed eval meter and the managed trace platform: instrument yourself and own the whole stack. You give up the polished experiment UI and managed retention, and you take on the operational weight, in return for no per-unit meter and full control of the data. It fits teams that already run their own observability and want the eval and trace data to never leave their boundary.
The build is concrete: instrument with OpenLLMetry (Apache-2.0, ~7.2k stars, free), store the spans in your own ClickHouse (Cloud Basic around $66/mo, or self-host the Apache-2.0 build for free), and dashboard it with Grafana OSS. We walk through the whole build in build your own LLM observability.
What Both Miss: Semantic Signals in Production
Neither Braintrust nor Langfuse catches the meaning of a turn while it is happening. A response that quotes the wrong refund policy returns a 200 with normal latency and a normal token count, a quietly angry user produces the same span as a delighted one, and an agent stuck in a three-step loop looks like one doing work. The eval that would have flagged it runs tonight, on a sample, after the user already left.
That timing gap is structural. Both Braintrust and Langfuse approximate semantic checks with LLM-as-judge evals, and those run offline on samples: a fraction of traffic, scored after the fact. A Reflex runs inline on every turn instead. It is a classifier that returns the label as an API response, in under 90 milliseconds, cheap enough to score every turn rather than a sample.
The labels are the ones an eval would otherwise check tonight: is_user_frustrated, stuck-in-a-loop, leaked-thinking, jailbreak, or a signal specific to your product. The built-in signals cover jailbreak, guardrail, leaked-thinking, stuck-in-a-loop, incomplete-thought, ambiguity, difficulty, and domain, and you can train a custom signal in under an hour in the Reflex dashboard. Pricing is realtime at $0.001 per event, where one event is 2,048 tokens, which is what makes scoring every turn affordable instead of sampling. It is live and self-serve today.
Score a turn inline, then attach it to your trace
curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
-H "Authorization: Bearer $MORPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "stuck-in-a-loop", "text": "<the agent turn>"}'
# {
# "model": "stuck-in-a-loop",
# "mode": "single_label",
# "classes": [
# { "class_id": 0, "label": "progressing", "score": 0.04, "selected": false },
# { "class_id": 1, "label": "looping", "score": 0.96, "selected": true }
# ],
# "inference_time_ms": 88
# }The label with "selected": true is the answer; there is no separate top-level field to read. Because it comes back as an API response and not a dashboard panel, it composes with whichever tool you picked: write it onto the Braintrust span as a score, attach it to a Langfuse trace as an observation, alert on it in Slack, or route on it inline. It complements an eval or tracing platform; it does not replace one. Full docs are at docs.morphllm.com.
Frequently Asked Questions
Braintrust vs Langfuse: what is the core difference?
Braintrust is eval-first and closed source, metering scores; Langfuse is observability-first and MIT core, free to self-host, metering ingested units. See closed eval platform vs open trace platform for how to choose on data ownership.
How much do they cost?
Braintrust Starter is free ($10 credits, 10k scores); Pro is $249/mo for 50k scores. Langfuse Cloud is free for 50k units/mo; Core is $29/mo for 100k units, then $8 per 100k. Self-hosting Langfuse is free. The pricing section has the full overage math.
Is either one open source?
Langfuse is: its core repo is MIT and self-hosts for free. Braintrust is closed source, with self-hosting only on an Enterprise hybrid or on-prem contract. If open source is required, look at self-hosting or the DIY stack.
Is a Braintrust score the same as a Langfuse unit?
No. A score is the output of one evaluation; a unit is any ingested event (a trace, an observation, or a score). The quantities are not 1:1, so compare each meter against your own workload, covered in pricing.
Do they catch wrong answers and frustrated users in production?
Not in real time. Both rely on offline, sampled LLM-as-judge evals, so a regression shows up in the next batch, not on the turn it happened. Catching it inline needs a per-turn classifier, covered in semantic signals.
Related comparisons
Braintrust Alternatives
When the $0-to-$249 cliff, closed-source lock-in, or eval-first gaps push you off Braintrust: Langfuse, Phoenix, Opik, Helicone, Galileo, Maxim, Vellum, PromptLayer, and the DIY route, ranked by use case.
LangSmith Alternatives
Seven alternatives by use case: Langfuse, Helicone, Phoenix, Braintrust, Weave, plus the OpenTelemetry + ClickHouse DIY route.
Langfuse vs LangSmith
MIT-core and self-hostable vs first-party LangChain. Full pricing math: 1M traces costs $101/mo on Langfuse, $2,514/mo on LangSmith.
Langfuse vs Helicone
Two open-source paths: Langfuse's SDK + ClickHouse stack vs Helicone's drop-in gateway. Both run on ClickHouse; you can too.
Arize Phoenix vs Langfuse
OTel-native, no event caps, one process vs the heavier MIT platform. The self-host footprint and lock-in question, settled.
Braintrust vs LangSmith
Eval-first scoring vs trace-first monitoring. Where per-score billing beats per-trace, and where it doesn't.
Add the signal the eval batch catches tonight, on every turn now
Braintrust scores offline and Langfuse traces requests; a Reflex returns a semantic label inline in under 90 milliseconds, over an API that composes with both.
