Braintrust vs Arize Phoenix vs Opik

You want Braintrust's eval loop, but open source and free to self-host. Arize Phoenix and Comet Opik are the two real self-host options. Braintrust is closed with a $249/mo Pro tier; Phoenix is source-available under the Elastic License 2.0 and OpenTelemetry-native; Opik is Apache-2.0 and runs on ClickHouse. Full pricing, the license split, the honest operational rough edges of running Phoenix or Opik yourself, and the DIY option.

June 25, 2026 · 1 min read
Braintrust vs Arize Phoenix vs Opik

You want Braintrust's eval loop, but open source and free to self-host. The two real self-host options are Arize Phoenix and Comet Opik. Braintrust is closed, with self-hosting locked to Enterprise and a $249/mo first paid tier. Phoenix runs as one OpenTelemetry-native process under the Elastic License 2.0; Opik is Apache-2.0 and runs on ClickHouse. Both self-host free, and both have operational rough edges at volume that you own. Pricing verified against each vendor's published page as of June 2026.

Closed
Braintrust: no free self-host
ELv2
Arize Phoenix: source-available, free self-host
Apache-2.0
Comet Opik: OSI open source, free self-host

TL;DR

Pick Braintrust for a managed eval loop when sending data to a vendor cloud is fine and the $249/mo Pro tier fits your score volume. Pick Arize Phoenix or Comet Opik to self-host the same eval-plus-tracing surface for free. Phoenix is source-available under the Elastic License 2.0, OpenTelemetry-native, and runs as one process. Opik is OSI-approved Apache-2.0 and runs on ClickHouse. Both carry operational weight you own.

The eval engine is not the tiebreaker, because all three run datasets, scorers, and LLM-as-judge evals. The decision is which trade you are buying. Braintrust takes the operations off your plate and locks self-hosting to Enterprise, with a price cliff at $249/mo. Phoenix is the lightest self-host, one OTel-native process, but a team on Arize's own forum hit a wall near 200M spans. Opik gives you a true Apache-2.0 license on your own ClickHouse, with default docker-compose bugs that are real and documented. If OSI open source is a hard requirement, only Opik qualifies; if simple footprint and span portability matter more, Phoenix wins.

What self-host teams actually hit

Both self-host routes have a documented ceiling. On Phoenix, a user running at volume on Arize's own community forum wrote that after they peaked at about 200M spans their deployment was "more or less non functional" with timeouts in the UI and API. On Opik, the default docker-compose shipped a ClickHouse disk-bloat outage where "system.trace_log alone grew to 66 GiB (3 billion rows)" and span ingestion returned HTTP 500 (Opik issue #6224), plus a separate bug where the container becomes unhealthy and fails to start. Braintrust takes the operations off your plate but has a price cliff: teams that outgrow Starter "go straight to $249/month with nothing in between", with no hard spending cap.

Skip onboarding another dashboard

Integrate Reflexes into your monitoring with one prompt.

Quick Comparison

Braintrust is closed and managed; Phoenix and Opik both self-host for free. Phoenix is source-available under the Elastic License 2.0 and runs as one OpenTelemetry-native process. Opik is OSI Apache-2.0 and runs on ClickHouse via Docker Compose or Helm. Braintrust's first paid tier is $249/mo; the two open-source tools have separate managed clouds at $50/mo (Arize AX) and $19/mo (Opik Cloud).

DimensionBraintrustArize PhoenixOpik
Built forEval-first: managed experiment loopOTel-native tracing + evals, self-hostEval + observability, open source
LicenseClosed sourceElastic License 2.0 (source-available)Apache-2.0 (OSI open source)
Self-host footprintEnterprise only (hybrid / on-prem)One process: pip / Docker / Helm, no capsDocker Compose or K8s/Helm, on ClickHouse
GitHub starsClosed source~10.1k~19.8k
Free tierStarter: $10 credits, 1 GB, 10k scores, 14-daySelf-host free, no caps; AX Free 25k spansSelf-host free; Cloud 25k spans/mo, 60-day
First paid tierPro $249/mo: 50k scores, 30-day, RBACFree self-host; AX Pro $50/mo (separate cloud)Free self-host; Opik Cloud $19/mo: 100k spans
Metered unitScores (+ processed data, tokens)Spans (on Arize AX cloud)Spans (on Opik Cloud)
OpenTelemetryFramework-agnostic SDKOTel-native via OpenInferenceOpik SDK (ClickHouse-backed)
Eval depthDatasets, scorers, run-to-run diffs, CI gatesLLM-as-judge evals, prompt playground, datasetsLLM-as-judge + 30+ heuristic metrics, prompt playground, experiments

Pricing: Scores vs Free Self-Host

Braintrust meters scores: 10k free on Starter, 50k on the $249/mo Pro. Self-hosting Phoenix or Opik costs nothing per event, so you pay only for the box. Both open-source tools have separate managed clouds metered in spans: Arize AX Free covers 25k spans and AX Pro is $50/mo, while Opik Cloud Free covers 25k spans and Pro is $19/mo for 100k. Scores and spans count different work.

Braintrust sizes its plans in scores, plus processed data and tokens. The free Starter plan ships $10 of credits, 1 GB of processed data, 10k scores, 14-day retention, and unlimited users. Pro is $249/mo for 5 GB, 50k scores, 30-day retention, and RBAC. Overage on Starter is $4/GB plus $2.50 per 1k scores; on Pro it drops to $3/GB plus $1.50 per 1k scores. Tokens are billed separately at $0.06/M input and $0.40/M output. Self-hosting is Enterprise: there is no path to run Braintrust in your own infrastructure on the self-serve plans.

Arize Phoenix self-hosts with no price at all: pip install arize-phoenix, one Docker container, or a Helm chart, no event caps, and you pay only for the infrastructure underneath it. The metered numbers live in the separate Arize AX cloud: a free dev tier (25k spans, 1 GB, 15-day retention), AX Pro at $50/mo (50k spans, 10 GB, 30-day retention), and quote-based AX Enterprise. Moving from Phoenix to AX is a new commercial contract, not a tier upgrade, and AX meters spans, which adds up fast on agent workloads that emit many spans per turn.

Comet Opik has two front doors. Self-hosted is the Apache-2.0 build: free, unlimited spans, unlimited retention, unlimited members, run on your own Docker Compose or a Kubernetes Helm chart, no per-unit meter. Opik Cloud is the hosted option, sized in spans: the free plan covers 25k spans/mo, 60-day retention, and up to 10 members; Pro is $19/mo for 100k spans, 60-day retention, and up to 50 members, with overage at $5 per 100k spans. The current Cloud numbers are on Comet's pricing page.

Do not line up Braintrust's $249 against AX's $50 or Opik Cloud's $19 and conclude one is five or thirteen times cheaper. They count different denominators. A score is the output of one evaluation; a span is one step inside a trace, and a single agent request emits many spans before anything is evaluated. The clean comparison is Braintrust's per-score Pro plan against free self-hosted Phoenix or Opik, where no per-unit meter exists, weighing a vendor cloud against the box you run. For the eval-versus-trace metering framing in more depth, see Braintrust vs LangSmith.

Closed vs Two Open-Source Self-Host Routes

The split is license and where data lands. Braintrust is closed source, with self-hosting only on Enterprise. Phoenix is source-available under the Elastic License 2.0: you can read, fork, and self-host it free, but ELv2 forbids offering it as a managed service to third parties, so it is not OSI open source. Opik is OSI-approved Apache-2.0, with no such restriction. If OSI open source is a hard requirement, only Opik qualifies.

Braintrust's closed posture goes deep. Langfuse's teardown flags its proprietary Brainstore engine and BTQL query language, with self-hosting reserved for Enterprise, against standard REST and OpenTelemetry interfaces elsewhere. You get zero operational weight in return, which is the right trade when the experiment loop is your bottleneck and procurement is fine with a vendor cloud. The cost is portability: the platform is the destination, not a backend you swap.

Phoenix and Opik take the opposite posture, and they differ from each other on license precision. Phoenix instruments through OpenInference conventions, so the spans you emit are standard OpenTelemetry data that stays in your infrastructure and points wherever you want. Its Elastic License 2.0 is source-available, not OSI-approved: free to self-host, restricted from being resold as a managed service. Opik ships the same eval-plus-observability surface as Apache-2.0 source with no resale restriction at all. Call Phoenix source-available and Opik open source to stay precise.

Phoenix vs Opik: The Operational Reality

Both self-host free, and both strain in ways you own. Phoenix runs as a single process, the simplest to stand up, but a team on Arize's own forum hit a ceiling near 200M spans. Opik runs on ClickHouse with more moving parts, and the default docker-compose has shipped disk-bloat and container-health bugs that teams hit in production. Phoenix is one OTel-native process; Opik is a ClickHouse-backed stack you operate.

Phoenix's strength is its footprint. One process, pip install arize-phoenix or a single container, OpenTelemetry-native via OpenInference, and no event caps because it is an app you run rather than a metered SaaS. The weakness shows at high span volume: the same forum user reports that after peaking near 200M spans their deployment became "more or less non functional" with timeouts in the UI and API. That ceiling is often what pushes a team toward Arize AX, which is a new contract, not a tier upgrade.

Opik runs on ClickHouse, which gives it the headroom Phoenix lacks and the operational surface Phoenix avoids. The data never leaves your network, there is no span meter, and there is no seat cap. The honest caveat is that the self-host is younger, and the default docker-compose has open bugs: one outage where ClickHouse's system.trace_log grew to 66 GiB (3 billion rows) before span ingestion started returning HTTP 500, and a separate report where the container becomes unhealthy and fails to start. Running Opik means owning ClickHouse and the Opik backend; running Phoenix means owning one process that can stall at volume. Neither is free of operations, and that is the trade against Braintrust's managed cloud.

When Braintrust Wins

Braintrust wins when the offline experiment loop is your bottleneck and you want zero infrastructure to run. The managed platform gives you datasets, run-to-run diffs, and CI gates as first-class objects, unlimited users, and a mature UI, as long as the $249/mo Pro tier fits your score volume and sending data to a vendor cloud is acceptable.

  • Your bottleneck is offline evaluation: "is this prompt or model change better" on fixed datasets, with run-to-run diffs and CI gates as first-class objects.
  • You want a managed platform with zero infrastructure to operate, and sending data to a vendor cloud is fine.
  • You do not need free self-hosting, and the $249/mo Pro plan fits your eval volume without hitting the price cliff.
  • You want unlimited users with no per-seat charge and a mature experiment UI out of the box.

When Arize Phoenix Wins

Phoenix wins when you want the lightest free self-host and span portability over OpenTelemetry. It runs as a single process you stand up in minutes, emits OpenInference-native spans that stay in your infrastructure, and has no event caps. The accepted trade is that it can stall at very high span volume and that graduating to Arize AX is a separate commercial contract.

  • You want the simplest self-host footprint: one process via pip, Docker, or Helm, no event caps.
  • You are committed to OpenTelemetry and want OpenInference-native spans that stay portable across backends.
  • Procurement or data residency rules out sending spans to a closed vendor cloud on a self-serve plan.
  • Your span volume is moderate, or you accept the ~200M-span ceiling and have a path to Arize AX if you outgrow it.

When Opik Wins

Opik wins when you need an OSI-approved license and a unified eval-plus-observability platform on your own boxes. Apache-2.0 lets you read, fork, and even resell it, the ClickHouse backend gives it headroom Phoenix lacks, and self-host has no span meter or seat cap. The caveat is operating ClickHouse and the open docker-compose bugs that come with a younger self-host.

  • OSI-approved open source is a hard requirement: Apache-2.0, no resale restriction, Docker or Kubernetes, no span meter.
  • You want eval plus observability in one tool: tracing, LLM-as-judge and 30+ heuristic metrics, prompt playground, experiments, and datasets.
  • Eval and trace data must stay on your infrastructure, and you can run ClickHouse and absorb the open self-host bugs.
  • You want the option to start on Opik Cloud's free 25k spans and move to self-host later without changing platforms.

The Fourth Option: Own the Stack

A fourth path skips the eval platform entirely: instrument yourself with open OpenTelemetry libraries, store the spans in a database you run, and dashboard them in Grafana, with no per-score or per-span bill. A growing number of teams take this route once eval volume makes a managed meter the most expensive line item. The trade is operational weight in exchange for full ownership and no meter.

Concretely: instrument with OpenLLMetry (Apache-2.0, free), store the spans in your own ClickHouse (Cloud Basic around $66/mo, or self-host the Apache-2.0 build for free), and dashboard with Grafana OSS. Opik already runs on ClickHouse, so this is the same engine without the application layer on top, and because Phoenix is OTel-native the path composes with it: instrument once, point spans at Phoenix today and your own store later. All three tools approximate the always-on quality signals with LLM-as-judge evals that run offline on samples, so for the signals you want on every request this stack carries no per-unit meter. We walk through the whole build in build your own LLM observability.

What They All Miss: Semantic Signals in Production

None of the three catches the meaning of a turn while it is happening. All three approximate semantic quality with LLM-as-judge evals that run offline on a sample, scored after the fact. A response that quotes the wrong refund policy returns a 200 with normal latency and a normal token count. A user who is quietly getting angry produces the same span as a delighted one. An agent stuck in a three-step loop looks like an agent doing work. The eval suite that would have flagged it runs tonight, on a sample, after the user already left.

That timing gap is structural. Braintrust, Phoenix, and Opik approximate semantic checks with LLM-as-judge evals, and those run offline on samples: a fraction of traffic, scored after the fact. Opik can run online evaluation rules against production traces, but those are LLM-as-judge calls applied to logged traces after they land, not an inline classifier that returns before the turn completes. A Reflex runs inline on every turn instead. It is a classifier that returns the label as an API response, in under 90 milliseconds, cheap enough to score every turn rather than a sample.

The labels are the ones an eval would otherwise check tonight: is_user_frustrated, stuck-in-a-loop, leaked-thinking, jailbreak, or a signal specific to your product. The built-in signals cover jailbreak, guardrail, leaked-thinking, stuck-in-a-loop, incomplete-thought, ambiguity, difficulty, and domain, and you can train a custom signal in under an hour in the Reflex dashboard. Pricing is realtime at $0.001 per event, where one event is 2,048 tokens, which is what makes scoring every turn affordable instead of sampling. It is live and self-serve today.

Score a turn inline, then attach it to your trace

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "stuck-in-a-loop", "text": "<the agent turn>"}'

# {
#   "model": "stuck-in-a-loop",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "progressing", "score": 0.04, "selected": false },
#     { "class_id": 1, "label": "looping",     "score": 0.96, "selected": true }
#   ],
#   "inference_time_ms": 88
# }

The label with "selected": true is the answer; there is no separate top-level field to read. Because it comes back as an API response and not a dashboard panel, it composes with whichever tool you picked: write it onto the Braintrust span as a score, attach it to a Phoenix span as an OpenInference attribute, attach it to an Opik trace, alert on it in Slack, or route on it inline. It complements an eval or observability platform; it does not replace one. Full docs are at docs.morphllm.com.

Frequently Asked Questions

Braintrust vs Arize Phoenix vs Opik: which should I self-host?

Phoenix or Opik, because Braintrust has no free self-host. Self-host Phoenix for the lightest footprint and OTel portability; self-host Opik for an OSI Apache-2.0 license and a unified platform on ClickHouse. See the operational reality of each route before you commit.

Which of the three is actually open source?

Only Opik is OSI-approved (Apache-2.0, ~19.8k stars). Phoenix is source-available under the Elastic License 2.0, which restricts reselling it as a service. Braintrust is closed source. The license section spells out the distinction.

How much do they cost?

Braintrust Starter is free ($10 credits, 10k scores); Pro is $249/mo for 50k scores. Phoenix and Opik self-host free with no meter; their clouds are Arize AX ($50/mo Pro) and Opik Cloud ($19/mo Pro). The pricing section has the full overage math.

Is Phoenix or Opik easier to run at scale, and what breaks?

Both have a documented ceiling. Phoenix is one process but went "more or less non functional" near 200M spans; Opik runs on ClickHouse and the default docker-compose shipped a 66 GiB disk-bloat outage and a container-health bug. See the operational reality. For more options, see Braintrust alternatives.

Do any of them catch wrong answers and frustrated users in real time?

Not on the turn it happens. All three rely on offline, sampled LLM-as-judge evals, and Opik's online rules score traces after they land. Catching it inline needs a per-turn classifier, covered in semantic signals.

Related comparisons

Add the signal the eval batch catches tonight, on every turn now

Braintrust scores offline, Phoenix and Opik trace and score after the fact; a Reflex returns a semantic label inline in under 90 milliseconds, over an API that composes with all three.