AI Agent Evaluation Frameworks (2026): 7 Compared

An agent evaluation framework scores whether your agent did its job: did it complete the task, call the right tools, take a clean path, and stay grounded. Seven come up when teams choose: DeepEval, Braintrust, Arize Phoenix, OpenAI Evals, RAGAS, LangSmith, and Galileo. They split on license, on trajectory versus final-output scoring, on offline versus online, and on price. This page compares them on exactly those axes.

frameworks compared on the axes that decide the pick

super-majority

of YC agent builders: evals under-deliver as upkeep gets impossible

offline-first

every framework here scores in CI or batch, not per-turn

<90ms

the production per-turn layer the offline suite can't reach

The frameworks also share one weakness, and it is the reason this page exists. In Voker's State of YC AI Agents 2026 survey of YC agent builders, a super-majority of respondents said evals often under-deliver because keeping them up to date becomes an impossible task. About 38% explicitly raised evaluation challenges: building suites, running A/B tests, improving behavior over time.

The mechanism is simple. An offline suite is a snapshot of the traffic you had when you wrote it. Production is a live distribution that shifts every day. None of the seven frameworks below classify the meaning of each turn while the agent is still running, which is where the failures a snapshot misses actually show up.

So the right setup is two layers, not one: an offline framework from the list below for reproducible regression coverage in CI, and a per-turn classifier that runs in production on every turn. We compare the seven first, then cover the production layer and where it fits. For the metrics themselves (task completion, tool-call accuracy, trajectory match, groundedness) and the offline-versus-online split in depth, see AI agent evaluation.

What Agent Evaluation Is

Before the table, three distinctions decide which framework fits and how far it gets you.

Offline vs online. Offline evaluation runs the agent against a fixed dataset of known tasks, in CI or before a release. It is reproducible and catches regressions on cases you already collected. Online evaluation scores live production traffic: the long-tail inputs no dataset contains and the conversations that drift over many turns. Most frameworks here are offline-first; a few add online scoring on sampled traces.
Trajectory vs final-output. Final-output evaluation scores only the last message. Trajectory evaluation scores the full sequence of model calls, tool calls, and arguments that produced it. A correct answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory a final-output score marks green. This is the line that separates a real agent evaluator from a prompt-output checker.
LLM-as-judge. Almost every framework scores with a model: give it the output and a rubric, get back a score. It is flexible and fine for offline scoring on a sample. It also carries length, position, and self-preference biases, is non-deterministic, and costs a full model call per judgment, which makes it the wrong tool for scoring every turn in production.

Hold those three in mind reading the table: a framework can be excellent at offline trajectory scoring and still leave the production per-turn layer empty.

The 7 Agent Evaluation Frameworks, Compared (2026)

One row per framework. "Offline / Online" is whether scoring runs against a fixed dataset, live production traffic, or both. "Trajectory eval" is whether it scores the step sequence, not just the final answer. The last row is the production per-turn layer that complements all seven.

Agent Evaluation Frameworks at a Glance

Framework	Open source?	Offline / Online	Trajectory eval?	LLM-as-judge?	Pricing / free tier
DeepEval	Yes, Apache 2.0 (~16.5k stars)	Both (Confident AI for online)	Yes, agentic + span-level	Yes (G-Eval, DAG)	OSS free; Confident AI from $9.99/user/mo
Braintrust	No (autoevals MIT)	Both	Yes	Yes (autoevals)	Free: 1 GB + 10k scores/mo; Pro $249/mo
Arize Phoenix	Source-available, Elastic 2.0 (~10k stars)	Both	Yes, trajectory + path convergence	Yes (phoenix.evals)	Phoenix OSS free; AX from $0, Pro $50/mo
OpenAI Evals	Yes, MIT (~18.5k stars)	Offline	Limited (completion protocol)	Yes (model-graded YAML)	OSS free; hosted product retiring late 2026
RAGAS	Yes, Apache 2.0 (~14.5k stars)	Offline (library)	Partial (tool + goal metrics)	Yes (most metrics)	Fully free OSS
LangSmith	No (AgentEvals helpers MIT)	Both	Yes (trajectory match modes)	Yes (OpenEvals)	Free: 5k traces/mo, 1 seat; Plus $39/seat/mo
Galileo	No (Agent Control OSS)	Both, agent-first	Yes (per-step actions/tools)	Yes, distilled (Luna-2)	Free: 5k traces/mo; Pro $100/mo
Reflex (production per-turn)	API, morph-reflex-v1	Online, in the request path	Per-turn semantic label	No, trainable classifier	$0.001/event (~$0.49 / 1M tok)

Stars, licenses, and pricing reflect each project's published pages and repositories as of June 2026. Braintrust's free tier is 1 GB processed data and 10,000 eval scores per month with 14-day retention; LangSmith's free Developer plan is 5,000 traces per month, 1 seat, 14-day retention. Reflex bills per event where one event is 2048 tokens; realtime is $0.001/event for the first 1M events, then $0.0005, and batch is half that. See pricing.

1. DeepEval

Apache 2.0

Fully open source, ~16.5k stars

50+ metrics

G-Eval, task completion, tool correctness

pytest-style

Evals as unit tests, local + CI

DeepEval, from Confident AI, is the open-source default for teams that want evals to feel like tests. You write assertions in a pytest-style harness and run them locally or in CI. The metric set is broad (50+), including agentic ones: task completion, tool correctness, argument correctness, and step-level scoring, plus multi-turn conversational metrics. Most metrics are LLM-as-judge via G-Eval. The hosted Confident AI platform adds tracing and online evals on live traffic.

Strengths

Fully open source, Apache 2.0, runs locally
Largest ready-made metric set in the list
Span-level agentic and tool-call metrics
Familiar pytest workflow, easy in CI

Limitations

Online/production evals need the hosted platform
LLM-as-judge metrics cost a model call each
Offline-first; not a per-turn live guardrail
Suite still needs upkeep as traffic drifts

Best for: teams that want open-source, code-first evals in CI with deep agent metric coverage.

2. Braintrust

Eval-first

Datasets, experiments, scorers in one place

Free tier

1 GB data + 10k scores/mo

autoevals

MIT scorer library, custom code scorers

Braintrust is the commercial tool to beat if your problem is evaluation as a workflow. Build datasets from production traces, write scorers (LLM-as-judge via the MIT-licensed autoevals library, or custom code), run experiments, and gate releases on the results in CI. Observability is wired to the evals, so a regression surfaces as a failing scorer. The free tier is 1 GB of processed data and 10,000 eval scores per month with 14-day retention and unlimited users; Pro is $249/mo.

Strengths

Deepest eval workflow: datasets + experiments + scorers
Observability tied to eval results
Generous free tier, unlimited seats
Strong CI gating story

Limitations

Core platform is closed-source SaaS
Scorers are work to write and keep current
Online scoring is async, not in the request path
Cost climbs with data and score volume

Best for: teams whose core need is measuring quality with rigorous, versioned evals. See Braintrust vs Langfuse.

3. Arize Phoenix

Elastic 2.0

Source-available, ~10k stars

OTel-native

OpenTelemetry / OpenInference tracing

Trajectory

Tool-call, trajectory, path-convergence evals

Phoenix is Arize's source-available evaluation and tracing layer; AX is the hosted enterprise platform. It is genuinely agent-aware: tool-calling evals (selection plus invocation), an ordered-trajectory eval that sends the tool-call sequence to an LLM judge, and a path-convergence eval that compares steps taken against the minimum needed. The license is Elastic License 2.0, so you can self-host internally for free but cannot resell it as a managed service. phoenix.evals ships pre-tested LLM-as-judge templates across 20+ providers.

Strengths

Real agent-trajectory and path-convergence evals
OpenTelemetry-native, self-host with no event caps
Tested LLM-judge templates across many providers
Clear OSS-to-enterprise (AX) path

Limitations

Elastic 2.0 is source-available, not OSI-permissive
Setup and concepts are heavy
Oriented to ML/data-science teams
Offline eval, not live per-turn blocking

Best for: ML teams that want rigorous, agent-aware evaluation with an OpenTelemetry-native, self-hostable base. See Arize Phoenix vs Langfuse.

4. OpenAI Evals

MIT

Open source, ~18.5k stars

Registry

JSONL data + YAML eval params

Offline

Benchmark-style grading, reproducible

OpenAI Evals is the MIT-licensed framework for registry-based, reproducible, benchmark-style grading: define a dataset as JSONL and an eval as YAML, then grade model outputs, exact-match or model-graded. Agent support is limited to the Completion Function Protocol for prompt chains and tool-using agents; it is not a trajectory evaluator like Phoenix or LangSmith.

One caveat to plan around: OpenAI announced on June 3, 2026 that the hosted Evals product becomes read-only on October 31, 2026 and the dashboard and API shut down November 30, 2026, with migration recommended to Promptfoo. The open-source repo stays usable.

Strengths

MIT, simple registry format, reproducible
Good for benchmark-style offline grading
Model-graded evals via YAML
Large, well-known repo

Limitations

Hosted product retiring in late 2026
Weak trajectory / tool-call evaluation
Offline only, no production scoring
Low maintenance activity on the repo

Best for: reproducible, benchmark-style offline grading where you control the dataset and do not need trajectory scoring.

5. RAGAS

Apache 2.0

Free OSS, ~14.5k stars

Reference-free

Faithfulness, context precision/recall

RAG-first

Now with agent goal + tool metrics

RAGAS is the open-source metrics library for retrieval-augmented generation: reference-free scores for faithfulness, context precision, context recall, and answer relevancy, most computed with an LLM judge. It has since added agentic metrics (agent goal accuracy, tool-call accuracy and F1, topic adherence), so it reaches into agent evaluation, but its center of gravity is grounding and RAG-component quality. It is a library you call from your own harness, not a platform.

Strengths

Fully free, Apache 2.0, no paid tier
Best-in-class reference-free RAG metrics
Added agent goal and tool-call metrics
Drops into any eval harness

Limitations

RAG-first; lighter on full trajectory scoring
No native dashboard, tracing, or CI gating
LLM-judge metrics cost a model call each
Offline library, not a production scorer

Best for: RAG-heavy agents that need rigorous groundedness and retrieval metrics from a free library.

6. LangSmith

LangChain-native

First-party LangChain / LangGraph

Trajectory

create_trajectory_match_evaluator modes

Free dev

5k traces/mo, 1 seat

LangSmith is the eval-plus-tracing platform from the LangChain team, and the natural pick if your agent is already on LangChain or LangGraph. Its trajectory evaluation is the strongest framework-native story here: create_trajectory_match_evaluator with strict, unordered, subset, and superset modes, plus a reference-free create_trajectory_llm_as_judge.

Online evaluators can run on live traces from the UI. The platform is proprietary (self-host is enterprise-only), though the AgentEvals and OpenEvals helper libraries are MIT. The free Developer plan is 5,000 traces per month, 1 seat, 14-day retention; Plus is $39/seat/mo.

Strengths

Best framework-native trajectory-match evaluators
First-party LangChain / LangGraph integration
Online evaluators on live traces from the UI
MIT helper libraries (AgentEvals, OpenEvals)

Limitations

Platform is closed; self-host is enterprise-only
Most natural inside the LangChain stack
Online eval is sampled scoring, not per-turn blocking
Trace + eval cost grows with volume

Best for: teams already on LangChain/LangGraph that want trajectory evals wired into their tracing. See LangSmith and Langfuse alternatives.

7. Galileo

Agent-first

Action completion, tool selection quality

Luna-2

Distilled eval models, not frontier judges

Free tier

5k traces/mo; Pro $100/mo

Galileo is the enterprise-leaning, agent-first platform. Its Agentic Evaluations score the logged trajectory per step: tool error rate, tool selection quality, action advancement, and action completion. The notable design choice is the scorer: instead of a frontier LLM-as-judge, Galileo uses its distilled Luna-2 small evaluation models (fine-tuned Llama, 3B and 8B), which it positions as cheaper and faster than frontier judging while still scoring without ground truth in production.

The core platform is closed; a separate Agent Control runtime governance project is Apache 2.0. Free tier is 5,000 traces per month; Pro is $100/mo; Luna-2 access is enterprise.

Strengths

Agent-first metrics: action completion, tool quality
Distilled Luna-2 scorers, cheaper than frontier judges
Production observability over full trajectories
No-ground-truth scoring in production

Limitations

Core platform closed, enterprise-leaning
Luna-2 scorers gated to enterprise
Scoring on its platform, not a primitive you compose
Pricing detail (retention/seats) thin publicly

Best for: enterprises that want an agent-first platform with distilled, production-grade scorers built in.

Also worth a look

Promptfoo (the migration target OpenAI now recommends for hosted Evals), Langfuse (open-source tracing with an eval layer bolted on), and the benchmark side: tau-bench and tau2-bench (customer-service tool use, verified against database state), SWE-Bench (2,294 GitHub issues, execution-based), and AgentBench. A benchmark compares models on a fixed task set; a framework tests your own agent. See AI agent evaluation for the benchmark breakdown.

Why Offline Evals Go Stale

Every framework above is excellent at the same thing and blind to the same thing. They score a dataset you assembled. The moment production traffic moves past that dataset, the suite measures a world that no longer exists. This is not a tooling flaw you can patch; it is structural. An offline suite is reactive by construction: it measures the system after it has already changed.

That is the finding teams keep reporting. In Voker's State of YC AI Agents 2026 survey, a super-majority said evals under-deliver because keeping them current becomes an impossible task, and about 38% of respondents raised evaluation as a live challenge. The complaint is not that the frameworks are bad. It is that an eval suite is a standing chore that competes with shipping, and the chore never ends because the distribution never stops moving.

What teams actually say

Voker, State of YC AI Agents 2026: "a super-majority of respondents said evals often under-deliver because keeping them up to date becomes an impossible task."
Fintool (Nicolas Bustamante): generic NLP metrics like BLEU and ROUGE "don't work for finance." They built numeric-precision evals (a response saying revenue was "4.2" with no unit fails even when 4.2B is right) and adversarial-grounding tests that inject fake numbers to check the model cites the real source. A PR is blocked if the eval score drops more than 5%.
Practitioners (Hamel Husain, and the "evals will break" thread): you must constantly update tests as you observe new data; the evaluation infrastructure is "structurally reactive," measuring the system only after it has changed.

The pattern across all of these: the useful signal is in production, on the turns you have not seen yet, and an offline framework reaches that signal only after you have already labeled it and folded it back into the dataset. Closing that loop fast is the unsolved part, and it is a different shape of problem from offline scoring. It is a classification problem you run on every turn.

Production Per-Turn Evaluation

<90ms

Per-turn label, in the request path

up to 10x

cheaper than an LLM-as-judge per turn

<1 hr

train a custom evaluator on your labels

Per-turn evaluation is a classification problem: take one turn, return a label. The labels are the semantic events an offline suite reaches too late and a trace never reaches at all: jailbreak_attempt, is_agent_looping, policy_violation, is_user_frustrated, or a signal specific to your product. A classifier scores these in one forward pass, which is what makes running it on every turn affordable instead of on a sample.

A Morph Reflex is that classifier, exposed as an API rather than a platform. It returns a label in under 90 milliseconds end-to-end, one forward pass, up to 64k context. The economics are why it runs per-turn where an LLM-as-judge cannot: a frontier judge is a full model call per turn (1 to 3 seconds, $3 to $25 per million tokens), while Reflex bills per event (1 event = 2048 tokens) at $0.001 for realtime, roughly $0.49 per million tokens classified, up to 10x cheaper.

You train a custom evaluator from a prompt, a labeled dataset, or synthetic data in under an hour, over an API (/v1/fine_tuning/* and /v1/reflex/predict, base model morph-reflex-v1) that is OpenAI-fine-tuning-compatible.

This complements the frameworks above; it does not replace them. Keep an offline framework from the list for reproducible regression coverage in CI, where LLM-as-judge on a sample is the right cost. Add a per-turn classifier in production for the turns the snapshot has not seen yet.

Every label it returns is also data you feed back into the offline set, the fine-tune, and the RL reward, so the turn that flagged today becomes behavior the agent learns tomorrow. That loop, production label to training signal, is the part the seven frameworks leave to you, and it is the part that decides whether an agent improves or plateaus.

Where Reflex fits next to a framework

Offline framework (CI/batch)

Reproducible regression coverage on known tasks
Trajectory + final-output scoring before release
LLM-as-judge on a sample is fine here
DeepEval, Braintrust, Phoenix, LangSmith, etc.

Reflex (production per-turn)

Labels every live turn in under 90ms
Catches drift the snapshot has not seen
Trainable on your own labeled failures in under an hour
Labels feed back into evals, fine-tunes, and RL reward

How to Choose

Pick Based on Your Priority

Your priority	Best choice	Runner-up
Open-source, code-first evals in CI	DeepEval	RAGAS
Eval-first workflow with a free tier	Braintrust	LangSmith
Self-hosted, OTel-native, trajectory eval	Arize Phoenix	DeepEval
Already on LangChain / LangGraph	LangSmith	Braintrust
RAG groundedness metrics	RAGAS	DeepEval
Reproducible benchmark-style grading	OpenAI Evals	Promptfoo
Agent-first platform, distilled scorers	Galileo	Arize AX
Score every production turn in real time	Reflex (<90ms)	self-host classifier + rules

Most teams that run agents in production end up with two layers: an offline framework for reproducible coverage of what they already know, and a real-time classifier for the failures a snapshot cannot reach. The framework answers "did the agent pass the tests we wrote." The classifier answers "is this turn okay, right now, on traffic we have never seen."

The evaluation layer your offline suite can't reach

Offline frameworks score the dataset you wrote. Reflex scores the turns you haven't seen yet: a per-turn classifier in under 90ms, trained on your own failures in under an hour, with every label feeding back into your evals, fine-tunes, and RL reward.

Explore Reflex

See Pricing

Frequently Asked Questions

What is an AI agent evaluation framework?

Tooling you point at your own agent to score whether it did its job: task completion, tool-call accuracy, trajectory quality, groundedness, and cost. A framework (DeepEval, Braintrust, Arize Phoenix, OpenAI Evals, RAGAS, LangSmith, Galileo) tests your agent; a benchmark (tau-bench, SWE-Bench) compares models on a fixed task set. Most frameworks score offline against a held-out dataset; some also score live production traces.

What are the best AI agent evaluation frameworks in 2026?

DeepEval (Apache 2.0, pytest-style, 50+ metrics) for code-first CI evals; Braintrust (free tier of 1 GB and 10k scores/mo) for eval-first scoring; Arize Phoenix (Elastic 2.0, OTel-native, trajectory evals) for self-hosted observability plus eval; OpenAI Evals (MIT) for offline grading, though the hosted product retires late 2026; RAGAS (Apache 2.0) for RAG metrics; LangSmith for trajectory evals native to LangChain; Galileo for agent-first evals with distilled Luna-2 scorers.

What is the difference between offline and online agent evaluation?

Offline runs against a fixed dataset of known tasks in CI and catches regressions reproducibly. Online scores real production traffic and catches drift, novel failures, frustrated users, and jailbreaks. Most frameworks are offline-first. A super-majority of YC agent builders report offline suites under-deliver as keeping them current becomes impossible, which is why production needs a per-turn layer too. More in AI agent evaluation.

What is trajectory evaluation versus final-output evaluation?

Final-output evaluation scores only the last message. Trajectory evaluation scores the full sequence of model calls, tool calls, and arguments. A correct answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory a final-output score marks green. LangSmith, Arize Phoenix, DeepEval, and Galileo support trajectory evaluation; OpenAI Evals and RAGAS are weighted toward output and RAG-component scoring.

Do agent evaluation frameworks use LLM-as-a-judge?

Almost all do: DeepEval (G-Eval), Braintrust (autoevals), Phoenix (phoenix.evals), OpenAI Evals, RAGAS, and LangSmith all score with a model. Galileo uses distilled Luna-2 small evaluation models instead of frontier judges. LLM-as-judge has length, position, and self-preference biases, is non-deterministic, and costs a full model call per judgment, which makes it too slow and expensive for every production turn.

Can an evaluation framework run on every production turn in real time?

Offline frameworks are built for CI and batch, not the request path, and an online LLM-as-judge is too slow and expensive to run on every turn. A specialized per-turn classifier is the production-shaped tool. Morph Reflex returns a label inline in under 90 milliseconds at $0.001 per event (about $0.49 per million tokens classified), up to 10x cheaper than an LLM-as-judge, and complements an offline framework rather than replacing it. See agent monitoring tools.

Sources

DeepEval (Apache 2.0) and Confident AI pricing
Braintrust pricing and the autoevals library (MIT)
Arize Phoenix (Elastic License 2.0) and Arize AX pricing
OpenAI Evals (MIT) and the OpenAI deprecations page (hosted Evals retiring late 2026)
RAGAS (Apache 2.0) and its agent metrics docs
LangSmith trajectory evals and LangSmith pricing
Galileo Luna-2 and Galileo pricing
Voker: The State of YC AI Agents 2026 (evals "impossible to keep up to date")
Fintool: lessons from building AI agents for financial services (BLEU/ROUGE don't work for finance)
Hamel Husain: your AI product needs evals
Morph Reflex capabilities and pricing

Fast Apply

WarpGrep

Compact

Reflex

Model Router

DeepSeek

MiniMax

Qwen

GLM

Blog

Startup Credits

Contact Us

About

Careers

AI Agent Evaluation Frameworks (2026): DeepEval, Braintrust, Phoenix, RAGAS, LangSmith, OpenAI Evals, and Galileo Compared

What Agent Evaluation Is

The 7 Agent Evaluation Frameworks, Compared (2026)

1. DeepEval

2. Braintrust

3. Arize Phoenix

4. OpenAI Evals

5. RAGAS

6. LangSmith

7. Galileo

Why Offline Evals Go Stale

Production Per-Turn Evaluation

How to Choose

The evaluation layer your offline suite can't reach

Frequently Asked Questions

What is an AI agent evaluation framework?

What are the best AI agent evaluation frameworks in 2026?

What is the difference between offline and online agent evaluation?

What is trajectory evaluation versus final-output evaluation?

Do agent evaluation frameworks use LLM-as-a-judge?

Can an evaluation framework run on every production turn in real time?

Sources