AI Agent Evaluation Frameworks (2026): DeepEval, Braintrust, Phoenix, RAGAS, LangSmith, OpenAI Evals, and Galileo Compared

Seven agent evaluation frameworks compared on open source, offline vs online, trajectory eval, LLM-as-judge, and pricing. Plus the gap they share: a super-majority of YC agent builders say evals under-deliver because keeping offline suites up to date is an impossible task, and none of them run per-turn in production.

June 26, 2026 · 1 min read

An agent evaluation framework scores whether your agent did its job: did it complete the task, call the right tools, take a clean path, and stay grounded. Seven come up when teams choose: DeepEval, Braintrust, Arize Phoenix, OpenAI Evals, RAGAS, LangSmith, and Galileo. They split on license, on trajectory versus final-output scoring, on offline versus online, and on price. This page compares them on exactly those axes.

7
frameworks compared on the axes that decide the pick
super-majority
of YC agent builders: evals under-deliver as upkeep gets impossible
offline-first
every framework here scores in CI or batch, not per-turn
<90ms
the production per-turn layer the offline suite can't reach

The frameworks also share one weakness, and it is the reason this page exists. In Voker's State of YC AI Agents 2026 survey of YC agent builders, a super-majority of respondents said evals often under-deliver because keeping them up to date becomes an impossible task. About 38% explicitly raised evaluation challenges: building suites, running A/B tests, improving behavior over time.

The mechanism is simple. An offline suite is a snapshot of the traffic you had when you wrote it. Production is a live distribution that shifts every day. None of the seven frameworks below classify the meaning of each turn while the agent is still running, which is where the failures a snapshot misses actually show up.

So the right setup is two layers, not one: an offline framework from the list below for reproducible regression coverage in CI, and a per-turn classifier that runs in production on every turn. We compare the seven first, then cover the production layer and where it fits. For the metrics themselves (task completion, tool-call accuracy, trajectory match, groundedness) and the offline-versus-online split in depth, see AI agent evaluation.

What Agent Evaluation Is

Before the table, three distinctions decide which framework fits and how far it gets you.

  • Offline vs online. Offline evaluation runs the agent against a fixed dataset of known tasks, in CI or before a release. It is reproducible and catches regressions on cases you already collected. Online evaluation scores live production traffic: the long-tail inputs no dataset contains and the conversations that drift over many turns. Most frameworks here are offline-first; a few add online scoring on sampled traces.
  • Trajectory vs final-output. Final-output evaluation scores only the last message. Trajectory evaluation scores the full sequence of model calls, tool calls, and arguments that produced it. A correct answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory a final-output score marks green. This is the line that separates a real agent evaluator from a prompt-output checker.
  • LLM-as-judge. Almost every framework scores with a model: give it the output and a rubric, get back a score. It is flexible and fine for offline scoring on a sample. It also carries length, position, and self-preference biases, is non-deterministic, and costs a full model call per judgment, which makes it the wrong tool for scoring every turn in production.

Hold those three in mind reading the table: a framework can be excellent at offline trajectory scoring and still leave the production per-turn layer empty.

The 7 Agent Evaluation Frameworks, Compared (2026)

One row per framework. "Offline / Online" is whether scoring runs against a fixed dataset, live production traffic, or both. "Trajectory eval" is whether it scores the step sequence, not just the final answer. The last row is the production per-turn layer that complements all seven.

Agent Evaluation Frameworks at a Glance
FrameworkOpen source?Offline / OnlineTrajectory eval?LLM-as-judge?Pricing / free tier
DeepEvalYes, Apache 2.0 (~16.5k stars)Both (Confident AI for online)Yes, agentic + span-levelYes (G-Eval, DAG)OSS free; Confident AI from $9.99/user/mo
BraintrustNo (autoevals MIT)BothYesYes (autoevals)Free: 1 GB + 10k scores/mo; Pro $249/mo
Arize PhoenixSource-available, Elastic 2.0 (~10k stars)BothYes, trajectory + path convergenceYes (phoenix.evals)Phoenix OSS free; AX from $0, Pro $50/mo
OpenAI EvalsYes, MIT (~18.5k stars)OfflineLimited (completion protocol)Yes (model-graded YAML)OSS free; hosted product retiring late 2026
RAGASYes, Apache 2.0 (~14.5k stars)Offline (library)Partial (tool + goal metrics)Yes (most metrics)Fully free OSS
LangSmithNo (AgentEvals helpers MIT)BothYes (trajectory match modes)Yes (OpenEvals)Free: 5k traces/mo, 1 seat; Plus $39/seat/mo
GalileoNo (Agent Control OSS)Both, agent-firstYes (per-step actions/tools)Yes, distilled (Luna-2)Free: 5k traces/mo; Pro $100/mo
Reflex (production per-turn)API, morph-reflex-v1Online, in the request pathPer-turn semantic labelNo, trainable classifier$0.001/event (~$0.49 / 1M tok)

Stars, licenses, and pricing reflect each project's published pages and repositories as of June 2026. Braintrust's free tier is 1 GB processed data and 10,000 eval scores per month with 14-day retention; LangSmith's free Developer plan is 5,000 traces per month, 1 seat, 14-day retention. Reflex bills per event where one event is 2048 tokens; realtime is $0.001/event for the first 1M events, then $0.0005, and batch is half that. See pricing.

1. DeepEval

Apache 2.0
Fully open source, ~16.5k stars
50+ metrics
G-Eval, task completion, tool correctness
pytest-style
Evals as unit tests, local + CI

DeepEval, from Confident AI, is the open-source default for teams that want evals to feel like tests. You write assertions in a pytest-style harness and run them locally or in CI. The metric set is broad (50+), including agentic ones: task completion, tool correctness, argument correctness, and step-level scoring, plus multi-turn conversational metrics. Most metrics are LLM-as-judge via G-Eval. The hosted Confident AI platform adds tracing and online evals on live traffic.

Strengths
  • Fully open source, Apache 2.0, runs locally
  • Largest ready-made metric set in the list
  • Span-level agentic and tool-call metrics
  • Familiar pytest workflow, easy in CI
Limitations
  • Online/production evals need the hosted platform
  • LLM-as-judge metrics cost a model call each
  • Offline-first; not a per-turn live guardrail
  • Suite still needs upkeep as traffic drifts

Best for: teams that want open-source, code-first evals in CI with deep agent metric coverage.

2. Braintrust

Eval-first
Datasets, experiments, scorers in one place
Free tier
1 GB data + 10k scores/mo
autoevals
MIT scorer library, custom code scorers

Braintrust is the commercial tool to beat if your problem is evaluation as a workflow. Build datasets from production traces, write scorers (LLM-as-judge via the MIT-licensed autoevals library, or custom code), run experiments, and gate releases on the results in CI. Observability is wired to the evals, so a regression surfaces as a failing scorer. The free tier is 1 GB of processed data and 10,000 eval scores per month with 14-day retention and unlimited users; Pro is $249/mo.

Strengths
  • Deepest eval workflow: datasets + experiments + scorers
  • Observability tied to eval results
  • Generous free tier, unlimited seats
  • Strong CI gating story
Limitations
  • Core platform is closed-source SaaS
  • Scorers are work to write and keep current
  • Online scoring is async, not in the request path
  • Cost climbs with data and score volume

Best for: teams whose core need is measuring quality with rigorous, versioned evals. See Braintrust vs Langfuse.

3. Arize Phoenix

Elastic 2.0
Source-available, ~10k stars
OTel-native
OpenTelemetry / OpenInference tracing
Trajectory
Tool-call, trajectory, path-convergence evals

Phoenix is Arize's source-available evaluation and tracing layer; AX is the hosted enterprise platform. It is genuinely agent-aware: tool-calling evals (selection plus invocation), an ordered-trajectory eval that sends the tool-call sequence to an LLM judge, and a path-convergence eval that compares steps taken against the minimum needed. The license is Elastic License 2.0, so you can self-host internally for free but cannot resell it as a managed service. phoenix.evals ships pre-tested LLM-as-judge templates across 20+ providers.

Strengths
  • Real agent-trajectory and path-convergence evals
  • OpenTelemetry-native, self-host with no event caps
  • Tested LLM-judge templates across many providers
  • Clear OSS-to-enterprise (AX) path
Limitations
  • Elastic 2.0 is source-available, not OSI-permissive
  • Setup and concepts are heavy
  • Oriented to ML/data-science teams
  • Offline eval, not live per-turn blocking

Best for: ML teams that want rigorous, agent-aware evaluation with an OpenTelemetry-native, self-hostable base. See Arize Phoenix vs Langfuse.

4. OpenAI Evals

MIT
Open source, ~18.5k stars
Registry
JSONL data + YAML eval params
Offline
Benchmark-style grading, reproducible

OpenAI Evals is the MIT-licensed framework for registry-based, reproducible, benchmark-style grading: define a dataset as JSONL and an eval as YAML, then grade model outputs, exact-match or model-graded. Agent support is limited to the Completion Function Protocol for prompt chains and tool-using agents; it is not a trajectory evaluator like Phoenix or LangSmith.

One caveat to plan around: OpenAI announced on June 3, 2026 that the hosted Evals product becomes read-only on October 31, 2026 and the dashboard and API shut down November 30, 2026, with migration recommended to Promptfoo. The open-source repo stays usable.

Strengths
  • MIT, simple registry format, reproducible
  • Good for benchmark-style offline grading
  • Model-graded evals via YAML
  • Large, well-known repo
Limitations
  • Hosted product retiring in late 2026
  • Weak trajectory / tool-call evaluation
  • Offline only, no production scoring
  • Low maintenance activity on the repo

Best for: reproducible, benchmark-style offline grading where you control the dataset and do not need trajectory scoring.

5. RAGAS

Apache 2.0
Free OSS, ~14.5k stars
Reference-free
Faithfulness, context precision/recall
RAG-first
Now with agent goal + tool metrics

RAGAS is the open-source metrics library for retrieval-augmented generation: reference-free scores for faithfulness, context precision, context recall, and answer relevancy, most computed with an LLM judge. It has since added agentic metrics (agent goal accuracy, tool-call accuracy and F1, topic adherence), so it reaches into agent evaluation, but its center of gravity is grounding and RAG-component quality. It is a library you call from your own harness, not a platform.

Strengths
  • Fully free, Apache 2.0, no paid tier
  • Best-in-class reference-free RAG metrics
  • Added agent goal and tool-call metrics
  • Drops into any eval harness
Limitations
  • RAG-first; lighter on full trajectory scoring
  • No native dashboard, tracing, or CI gating
  • LLM-judge metrics cost a model call each
  • Offline library, not a production scorer

Best for: RAG-heavy agents that need rigorous groundedness and retrieval metrics from a free library.

6. LangSmith

LangChain-native
First-party LangChain / LangGraph
Trajectory
create_trajectory_match_evaluator modes
Free dev
5k traces/mo, 1 seat

LangSmith is the eval-plus-tracing platform from the LangChain team, and the natural pick if your agent is already on LangChain or LangGraph. Its trajectory evaluation is the strongest framework-native story here: create_trajectory_match_evaluator with strict, unordered, subset, and superset modes, plus a reference-free create_trajectory_llm_as_judge.

Online evaluators can run on live traces from the UI. The platform is proprietary (self-host is enterprise-only), though the AgentEvals and OpenEvals helper libraries are MIT. The free Developer plan is 5,000 traces per month, 1 seat, 14-day retention; Plus is $39/seat/mo.

Strengths
  • Best framework-native trajectory-match evaluators
  • First-party LangChain / LangGraph integration
  • Online evaluators on live traces from the UI
  • MIT helper libraries (AgentEvals, OpenEvals)
Limitations
  • Platform is closed; self-host is enterprise-only
  • Most natural inside the LangChain stack
  • Online eval is sampled scoring, not per-turn blocking
  • Trace + eval cost grows with volume

Best for: teams already on LangChain/LangGraph that want trajectory evals wired into their tracing. See LangSmith and Langfuse alternatives.

7. Galileo

Agent-first
Action completion, tool selection quality
Luna-2
Distilled eval models, not frontier judges
Free tier
5k traces/mo; Pro $100/mo

Galileo is the enterprise-leaning, agent-first platform. Its Agentic Evaluations score the logged trajectory per step: tool error rate, tool selection quality, action advancement, and action completion. The notable design choice is the scorer: instead of a frontier LLM-as-judge, Galileo uses its distilled Luna-2 small evaluation models (fine-tuned Llama, 3B and 8B), which it positions as cheaper and faster than frontier judging while still scoring without ground truth in production.

The core platform is closed; a separate Agent Control runtime governance project is Apache 2.0. Free tier is 5,000 traces per month; Pro is $100/mo; Luna-2 access is enterprise.

Strengths
  • Agent-first metrics: action completion, tool quality
  • Distilled Luna-2 scorers, cheaper than frontier judges
  • Production observability over full trajectories
  • No-ground-truth scoring in production
Limitations
  • Core platform closed, enterprise-leaning
  • Luna-2 scorers gated to enterprise
  • Scoring on its platform, not a primitive you compose
  • Pricing detail (retention/seats) thin publicly

Best for: enterprises that want an agent-first platform with distilled, production-grade scorers built in.

Also worth a look

Promptfoo (the migration target OpenAI now recommends for hosted Evals), Langfuse (open-source tracing with an eval layer bolted on), and the benchmark side: tau-bench and tau2-bench (customer-service tool use, verified against database state), SWE-Bench (2,294 GitHub issues, execution-based), and AgentBench. A benchmark compares models on a fixed task set; a framework tests your own agent. See AI agent evaluation for the benchmark breakdown.

Why Offline Evals Go Stale

Every framework above is excellent at the same thing and blind to the same thing. They score a dataset you assembled. The moment production traffic moves past that dataset, the suite measures a world that no longer exists. This is not a tooling flaw you can patch; it is structural. An offline suite is reactive by construction: it measures the system after it has already changed.

That is the finding teams keep reporting. In Voker's State of YC AI Agents 2026 survey, a super-majority said evals under-deliver because keeping them current becomes an impossible task, and about 38% of respondents raised evaluation as a live challenge. The complaint is not that the frameworks are bad. It is that an eval suite is a standing chore that competes with shipping, and the chore never ends because the distribution never stops moving.

What teams actually say
  • Voker, State of YC AI Agents 2026: "a super-majority of respondents said evals often under-deliver because keeping them up to date becomes an impossible task."
  • Fintool (Nicolas Bustamante): generic NLP metrics like BLEU and ROUGE "don't work for finance." They built numeric-precision evals (a response saying revenue was "4.2" with no unit fails even when 4.2B is right) and adversarial-grounding tests that inject fake numbers to check the model cites the real source. A PR is blocked if the eval score drops more than 5%.
  • Practitioners (Hamel Husain, and the "evals will break" thread): you must constantly update tests as you observe new data; the evaluation infrastructure is "structurally reactive," measuring the system only after it has changed.

The pattern across all of these: the useful signal is in production, on the turns you have not seen yet, and an offline framework reaches that signal only after you have already labeled it and folded it back into the dataset. Closing that loop fast is the unsolved part, and it is a different shape of problem from offline scoring. It is a classification problem you run on every turn.

Production Per-Turn Evaluation

<90ms
Per-turn label, in the request path
up to 10x
cheaper than an LLM-as-judge per turn
<1 hr
train a custom evaluator on your labels

Per-turn evaluation is a classification problem: take one turn, return a label. The labels are the semantic events an offline suite reaches too late and a trace never reaches at all: jailbreak_attempt, is_agent_looping, policy_violation, is_user_frustrated, or a signal specific to your product. A classifier scores these in one forward pass, which is what makes running it on every turn affordable instead of on a sample.

A Morph Reflex is that classifier, exposed as an API rather than a platform. It returns a label in under 90 milliseconds end-to-end, one forward pass, up to 64k context. The economics are why it runs per-turn where an LLM-as-judge cannot: a frontier judge is a full model call per turn (1 to 3 seconds, $3 to $25 per million tokens), while Reflex bills per event (1 event = 2048 tokens) at $0.001 for realtime, roughly $0.49 per million tokens classified, up to 10x cheaper.

You train a custom evaluator from a prompt, a labeled dataset, or synthetic data in under an hour, over an API (/v1/fine_tuning/* and /v1/reflex/predict, base model morph-reflex-v1) that is OpenAI-fine-tuning-compatible.

This complements the frameworks above; it does not replace them. Keep an offline framework from the list for reproducible regression coverage in CI, where LLM-as-judge on a sample is the right cost. Add a per-turn classifier in production for the turns the snapshot has not seen yet.

Every label it returns is also data you feed back into the offline set, the fine-tune, and the RL reward, so the turn that flagged today becomes behavior the agent learns tomorrow. That loop, production label to training signal, is the part the seven frameworks leave to you, and it is the part that decides whether an agent improves or plateaus.

Where Reflex fits next to a framework
Offline framework (CI/batch)
  • Reproducible regression coverage on known tasks
  • Trajectory + final-output scoring before release
  • LLM-as-judge on a sample is fine here
  • DeepEval, Braintrust, Phoenix, LangSmith, etc.
Reflex (production per-turn)
  • Labels every live turn in under 90ms
  • Catches drift the snapshot has not seen
  • Trainable on your own labeled failures in under an hour
  • Labels feed back into evals, fine-tunes, and RL reward

How to Choose

Pick Based on Your Priority
Your priorityBest choiceRunner-up
Open-source, code-first evals in CIDeepEvalRAGAS
Eval-first workflow with a free tierBraintrustLangSmith
Self-hosted, OTel-native, trajectory evalArize PhoenixDeepEval
Already on LangChain / LangGraphLangSmithBraintrust
RAG groundedness metricsRAGASDeepEval
Reproducible benchmark-style gradingOpenAI EvalsPromptfoo
Agent-first platform, distilled scorersGalileoArize AX
Score every production turn in real timeReflex (<90ms)self-host classifier + rules

Most teams that run agents in production end up with two layers: an offline framework for reproducible coverage of what they already know, and a real-time classifier for the failures a snapshot cannot reach. The framework answers "did the agent pass the tests we wrote." The classifier answers "is this turn okay, right now, on traffic we have never seen."

The evaluation layer your offline suite can't reach

Offline frameworks score the dataset you wrote. Reflex scores the turns you haven't seen yet: a per-turn classifier in under 90ms, trained on your own failures in under an hour, with every label feeding back into your evals, fine-tunes, and RL reward.

Frequently Asked Questions

What is an AI agent evaluation framework?

Tooling you point at your own agent to score whether it did its job: task completion, tool-call accuracy, trajectory quality, groundedness, and cost. A framework (DeepEval, Braintrust, Arize Phoenix, OpenAI Evals, RAGAS, LangSmith, Galileo) tests your agent; a benchmark (tau-bench, SWE-Bench) compares models on a fixed task set. Most frameworks score offline against a held-out dataset; some also score live production traces.

What are the best AI agent evaluation frameworks in 2026?

DeepEval (Apache 2.0, pytest-style, 50+ metrics) for code-first CI evals; Braintrust (free tier of 1 GB and 10k scores/mo) for eval-first scoring; Arize Phoenix (Elastic 2.0, OTel-native, trajectory evals) for self-hosted observability plus eval; OpenAI Evals (MIT) for offline grading, though the hosted product retires late 2026; RAGAS (Apache 2.0) for RAG metrics; LangSmith for trajectory evals native to LangChain; Galileo for agent-first evals with distilled Luna-2 scorers.

What is the difference between offline and online agent evaluation?

Offline runs against a fixed dataset of known tasks in CI and catches regressions reproducibly. Online scores real production traffic and catches drift, novel failures, frustrated users, and jailbreaks. Most frameworks are offline-first. A super-majority of YC agent builders report offline suites under-deliver as keeping them current becomes impossible, which is why production needs a per-turn layer too. More in AI agent evaluation.

What is trajectory evaluation versus final-output evaluation?

Final-output evaluation scores only the last message. Trajectory evaluation scores the full sequence of model calls, tool calls, and arguments. A correct answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory a final-output score marks green. LangSmith, Arize Phoenix, DeepEval, and Galileo support trajectory evaluation; OpenAI Evals and RAGAS are weighted toward output and RAG-component scoring.

Do agent evaluation frameworks use LLM-as-a-judge?

Almost all do: DeepEval (G-Eval), Braintrust (autoevals), Phoenix (phoenix.evals), OpenAI Evals, RAGAS, and LangSmith all score with a model. Galileo uses distilled Luna-2 small evaluation models instead of frontier judges. LLM-as-judge has length, position, and self-preference biases, is non-deterministic, and costs a full model call per judgment, which makes it too slow and expensive for every production turn.

Can an evaluation framework run on every production turn in real time?

Offline frameworks are built for CI and batch, not the request path, and an online LLM-as-judge is too slow and expensive to run on every turn. A specialized per-turn classifier is the production-shaped tool. Morph Reflex returns a label inline in under 90 milliseconds at $0.001 per event (about $0.49 per million tokens classified), up to 10x cheaper than an LLM-as-judge, and complements an offline framework rather than replacing it. See agent monitoring tools.

Sources