Braintrust vs LangSmith (2026): Scores vs Traces, Eval-First vs Trace-First

Two tools that look like one category, metering two different things. Braintrust bills scores because evaluation is its product; LangSmith bills traces because monitoring is its product. The purchase decision is which activity dominates your month, not which dashboard looks nicer. Pricing verified against each vendor's published page as of June 2026.

Scores

What Braintrust meters

Traces

What LangSmith meters

Both closed

Neither is open source

TL;DR

Pick Braintrust when your bottleneck is "is this prompt change better," the offline-eval question. Scoring is its product, so it meters scores (10k free, 50k on the $249/mo Pro plan) and tracing exists to feed evaluation. Pick LangSmith when your bottleneck is "what happened in this request," the production-tracing question, especially inside LangChain or LangGraph. Tracing is its product, so it meters traces (5k free, 10k on the $39/seat Plus plan). Both are closed source; both run self-hosting on Enterprise only. Match the meter to your dominant activity and the cost takes care of itself.

Skip onboarding another dashboard

Integrate Reflex into your monitoring with one prompt.

Quick Comparison

Braintrust vs LangSmith (June 2026)

Dimension	Braintrust	LangSmith
Built for	Eval-first: scoring is the product	Trace-first: monitoring is the product
License	Closed source	Closed source
Metered unit	Scores (+ processed data, tokens)	Base traces
Free tier	Starter: $10 credits, 1 GB, 10k scores, 14-day	Developer: 5k base traces/mo, 1 seat
First paid tier	Pro $249/mo: 5 GB, 50k scores, 30-day, RBAC	Plus $39/seat/mo: 10k base traces incl.
Overage	Starter $4/GB + $2.50/1k scores; Pro $3/GB + $1.50/1k scores	$2.50/1k base traces (14-day); $5/1k extended (400-day)
Token cost	$0.06/M input, $0.40/M output	Included in trace meter
Self-hosting	Hybrid / on-prem on Enterprise	Enterprise plan only, custom pricing
Ecosystem fit	Framework-agnostic, experiment-centric	First-party LangChain / LangGraph

Pricing: Scores vs Traces

The pricing pages read similarly until you notice they charge for different things. Braintrust's plans are sized in scores, plus processed data and tokens. The free Starter plan ships $10 of credits, 1 GB of processed data, 10k scores, 14-day retention, and unlimited users. Pro is $249/mo for 5 GB, 50k scores, 30-day retention, and RBAC. Overage on Starter is $4/GB plus $2.50 per 1k scores; on Pro it drops to $3/GB plus $1.50 per 1k scores. Tokens are billed separately at $0.06/M input and $0.40/M output. On-prem and hybrid deployments are Enterprise.

LangSmith's plans are sized in base traces. The free Developer plan covers 5k base traces a month and 1 seat. Plus is $39 per seat and includes 10k base traces, with overage at $2.50 per 1k base traces on 14-day retention or $5 per 1k on extended 400-day retention. Self-hosting is Enterprise.

The two $2.50-per-1k overage numbers look identical and measure opposite work. Braintrust's $2.50 buys a thousand scores, the output of evaluating; LangSmith's $2.50 buys a thousand base traces, the record of a request. A team that scores ten variants against a 500-row dataset generates 5,000 scores from a handful of traces. A team serving 5,000 production requests generates 5,000 traces and zero scores until someone runs an eval. Same dollar, different denominator.

The cost cliff that starts these searches

Both tools land on a buyer's radar the same way: the free tier runs out and the meter starts. The r/LangChain thread that kicked off when LangSmith left free is a useful tell for how teams react, and the broader r/MachineLearning discussion on whether these tools get used in production is worth reading before you commit a meter to your whole pipeline.

Eval-First vs Trace-First

The cleanest way to choose is to name your bottleneck out loud. If the question keeping you up is "is this prompt change actually better," you are in the offline-eval loop: fixed datasets, scoring functions, A/B between prompt or model variants, regression gates in CI. That is Braintrust's center of gravity. Tracing is present, but it exists to feed evaluation, and the plans are priced in the unit that loop produces.

If the question is "what actually happened in this request," you are in the production-monitoring loop: span trees, latency, token counts, error rates, replaying a single bad session. That is LangSmith's center of gravity, and it is sharpest when the request ran through LangChain or LangGraph, since the tracing is first-party rather than instrumented by hand.

This is why per-score billing beats per-trace billing for an experiment-heavy team and loses for a traffic-heavy one. Offline experiments generate many scores from few traces, so a scoring meter bills the work you do and the trace count stays small. Production traffic generates a trace on every request whether or not anyone is evaluating, so a trace meter bills the work you do and the score count stays small. The wrong meter for your workload is the one that charges for the axis you barely move on.

Ecosystem Fit

LangSmith is first-party to LangChain and LangGraph. If your app already runs on that stack, tracing, Prompt Hub, annotation queues, and monitoring light up with no glue code, and the framework binding that makes integration effortless is also the lock-in. Braintrust is framework-agnostic and organized around the experiment: datasets, scorers, and the diff between runs are the primary objects, and it does not assume you write in any one library. If you run more than one framework, or none, Braintrust's neutrality is the easier fit; if you never leave LangChain, LangSmith's depth is hard to beat.

When Braintrust Wins

Your bottleneck is offline evaluation: "is this prompt or model change better" on fixed datasets.
You run large eval suites and want billing tied to scores, the unit you actually generate.
You want regression gates and run-to-run diffs as first-class objects, not bolted on.
You are framework-agnostic and do not want a tool that assumes one library.

When LangSmith Wins

Your bottleneck is production tracing: "what happened in this request" on live traffic.
You are all-in on LangChain or LangGraph and want first-party tracing with no glue code.
You want Prompt Hub, annotation queues, and monitoring as part of one product.
Your trace volume is steady and predictable, so the per-trace meter stays bounded.

The Third Option: Own the Stack

There is a path neither vendor markets, and a growing number of teams take it: skip the per-score and per-trace meters and instrument yourself. Both Braintrust and LangSmith lean on LLM-as-judge evals that run offline on samples, so for the always-on signals you instrument with OpenLLMetry (Apache-2.0, ~7.2k stars, free), store the spans in your own ClickHouse (Cloud Basic around $66/mo, or self-host the Apache-2.0 build for free), and dashboard it with Grafana OSS. You give up the polished experiment UI and the managed retention, and you take on the operational weight in return for no per-unit meter at all. We walk through the whole build in build your own LLM observability.

What Both Miss: Semantic Signals in Production

Everything above is about the offline experiment and the production trace. Neither catches the meaning of a turn while it is happening. A response that quotes the wrong refund policy returns a 200 with normal latency and a normal token count. A user who is quietly getting angry produces the same span as a delighted one. An agent stuck in a three-step loop looks like an agent doing work. The eval suite that would have flagged it runs tonight, on a sample, after the user already left.

That timing gap is structural. Both Braintrust and LangSmith approximate semantic checks with LLM-as-judge evals, and those run offline on samples: a fraction of traffic, scored after the fact. A Reflex runs inline on every turn instead. It is a classifier that returns the label as an API response, in under 90 milliseconds, cheap enough to score every turn rather than a sample.

The labels are the ones an eval would otherwise check tonight: is_user_frustrated, stuck-in-a-loop, leaked-thinking, jailbreak, or a signal specific to your product. The built-in signals cover jailbreak, guardrail, leaked-thinking, stuck-in-a-loop, incomplete-thought, ambiguity, difficulty, and domain, and you can train a custom signal in under an hour in the Reflex dashboard. Pricing is realtime at $0.001 per event, where one event is 2,048 tokens, which is what makes scoring every turn affordable instead of sampling. It is live and self-serve today.

Score a turn inline, then attach it to your trace

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "stuck-in-a-loop", "text": "<the agent turn>"}'

# {
#   "model": "stuck-in-a-loop",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "progressing", "score": 0.04, "selected": false },
#     { "class_id": 1, "label": "looping",     "score": 0.96, "selected": true }
#   ],
#   "inference_time_ms": 88
# }

The label with "selected": true is the answer; there is no separate top-level field to read. Because it comes back as an API response and not a dashboard panel, it composes with whichever tool you picked: write it onto the Braintrust span as a score, attach it to a LangSmith trace, alert on it in Slack, or route on it inline. It complements an eval or tracing platform; it does not replace one. Full docs are at docs.morphllm.com.

Frequently Asked Questions

Braintrust vs LangSmith: what is the core difference?

They meter different units because they do different jobs. Braintrust is eval-first and bills scores; LangSmith is trace-first and bills traces. See eval-first vs trace-first for how to match the meter to your bottleneck.

How much do they cost?

Braintrust Starter is free ($10 credits, 10k scores); Pro is $249/mo for 50k scores. LangSmith Developer is free (5k base traces); Plus is $39/seat for 10k base traces, then $2.50 per 1k. The pricing section has the full overage math.

Is either one open source?

No. Both Braintrust and LangSmith are closed source, with self-hosting on Enterprise only. If open source is a requirement, look at Langfuse or the DIY stack.

Do they catch wrong answers and frustrated users in production?

Not in real time. Both rely on offline, sampled LLM-as-judge evals, so a regression shows up in the next batch, not on the turn it happened. Catching it inline needs a per-turn classifier, covered in semantic signals.

Add the signal the eval batch catches tonight, on every turn now

Braintrust scores offline and LangSmith traces requests; a Reflex returns a semantic label inline in under 90 milliseconds, over an API that composes with both.

Read the Reflexes docs

Build your own stack

Kimi K3

GLM-5.2

Qwen

MiniMax

DeepSeek

Reflex

Fast Apply

WarpGrep

Compact

Model Router

Blog

Startup Credits

Contact Us

About

Careers

Braintrust vs LangSmith (2026): Two Billing Philosophies for Two Different Jobs

TL;DR

Quick Comparison

Pricing: Scores vs Traces

Eval-First vs Trace-First

Ecosystem Fit

When Braintrust Wins

When LangSmith Wins

The Third Option: Own the Stack

What Both Miss: Semantic Signals in Production

Score a turn inline, then attach it to your trace

Frequently Asked Questions

Braintrust vs LangSmith: what is the core difference?

How much do they cost?

Is either one open source?

Do they catch wrong answers and frustrated users in production?

Related comparisons

Braintrust Alternatives

LangSmith Alternatives

Langfuse vs LangSmith

Braintrust vs Langfuse

Braintrust vs Langfuse vs Raindrop

Langfuse Alternatives

Add the signal the eval batch catches tonight, on every turn now