AI Agent Evaluation (2026): Metrics, Frameworks, and Production Failures

Most teams evaluate an agent by running it against a held-out set of tasks and checking whether the final answer is correct. That number tells you how the agent did on tasks you already knew about. It tells you almost nothing about how the agent behaves on the traffic you have not seen: whether it looped before answering, called the wrong tool and recovered, leaked its reasoning, or left the user frustrated three turns in. This page covers the metrics, the frameworks, and the per-turn labels that catch what offline scoring misses. Framework numbers verified against published pages as of June 2026.

3 layers

Final-answer, trajectory, per-turn

< 90ms

Per-turn classifier latency, one forward pass

2,294 issues

SWE-Bench agent task set

What Agent Evaluation Covers

An agent turns one user request into a sequence: model calls, tool calls, retries, sometimes subagents. Evaluation has to cover three layers, and they answer different questions.

Final-answer evaluation. Score the last message against an expected result. This is the layer every benchmark and most eval tutorials cover. It is necessary and insufficient: the answer can be right while the path to it was wrong, expensive, or unsafe.
Trajectory evaluation. Score the sequence of steps that produced the answer. Did the agent call the right tools with the right arguments? How many steps did a 3-step task take? Did it loop? Did it recover after a wrong tool call? A correct final answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory.
Per-turn evaluation. Score the meaning of each individual turn as it runs in production. A turn can be a jailbreak attempt, a leaked system prompt, a policy violation, or a moment a user gets frustrated. None of these change the status code, the latency, or the token count. They are invisible to logs and traces.

The trap is treating layer one as the whole job. A held-out pass/fail can be green while the trajectory was a mess and three turns drifted off policy. The signal that improves an agent lives in layers two and three, on real traffic, not on the held-out set.

The Metrics That Matter

Six metrics carry most of the weight. The first four measure behavior, the last two measure cost and faithfulness.

Metric	What it measures	How it is computed
Task completion / success rate	Did the agent reach the correct end state	Verify the resulting database/system state, not just the final text. tau-bench checks both the final answer and the resulting DB state.
Tool-call accuracy	Right tool, right arguments	Per-call check of tool selection and argument correctness against a reference or schema (Phoenix's agent function-calling eval, DeepEval span-level).
Step / loop count	Efficiency and looping	Steps taken versus minimum needed; flag repeated identical actions. Loops hide inside a correct final answer.
Trajectory match	Did the step sequence match a reference	Compare the agent's trajectory against a reference trajectory (AgentEvals create_trajectory_match_evaluator, strict or superset modes).
Cost and latency	Tokens, dollars, wall time per task	Per-step attribution so a 20-step loop is visible, not buried in a trace total.
Groundedness	Are claims supported by retrieved context	Reference-free RAG metrics: faithfulness, context precision, context recall, answer relevancy (Ragas).

Two of these are easy to get wrong. Task completion measured on the final text alone passes an agent that says it booked the flight without booking it; tau-bench exists precisely because it verifies the database state instead. Step and loop count is invisible in a trace total: a 20-step agent turn and a 3-step one both return a single answer, so you only see the loop if you attribute cost per step.

Offline vs Online Evaluation

Offline and online evaluation are not competing methods. They answer different questions, and shipping a reliable agent needs both.

Dimension	Offline (held-out set)	Online (production traffic)
Input	Fixed dataset of tasks with known answers	Live requests, the long tail no dataset contains
When it runs	CI, pre-release, on every prompt change	Continuously, on real turns as they happen
What it catches	Regressions on known cases	Drift, novel failures, frustrated users, jailbreaks
Reproducible	Yes, same inputs every run	No, traffic changes; you label and aggregate
Scorer	LLM-as-judge, code checks, human review on a sample	Per-turn classifier cheap enough for every turn

Offline evaluation tells you the agent did not get worse on yesterday's tasks. It is reproducible and it belongs in CI. What it cannot do is anticipate the inputs you have not seen, the multi-turn conversations that drift, or the tool failures that only happen against live systems. Those show up in production, once, and a held-out set written last month does not contain them.

Online evaluation closes that gap by scoring real traffic as it arrives. The catch is the scorer: LLM-as-judge is too slow and too expensive to run on every turn (covered below), so online evaluation at scale needs a classifier that returns a label in milliseconds. The payoff is that every labeled production turn becomes data you can feed back into the offline set, the fine-tune, and the reward function. That loop, not a bigger held-out set, is what moves an agent forward. More on the production-monitoring side in agent observability.

Frameworks and Benchmarks Compared

Two different things get called "agent evaluation." Frameworks are the tooling you point at your own agent to score it. Benchmarks are fixed task sets with leaderboards that measure how a model or agent does on a standard problem. You use a framework to test your agent; you cite a benchmark to compare models.

Frameworks (point at your own agent)

Framework	What it scores	License	Free tier
LangSmith	Trajectory evals, LLM-as-judge, heuristic checks, annotation queues; first-party LangChain/LangGraph	Closed source	Developer: 5k traces/mo, 1 seat, 14-day retention
Braintrust	Eval-first scoring: LLM-as-judge, autoevals, custom code scorers, human review	Closed source	1 GB data, 10k scores/mo, 14-day retention, unlimited seats
OpenAI Evals	Registry-based, reproducible benchmark-style evals; Completion Function Protocol for tool-using agents	MIT (~18.5k stars)	Open source; hosted platform read-only after Oct 31 2026
Arize Phoenix	OpenTelemetry-native tracing plus evals: hallucination, agent function-calling eval, toxicity, RAG	Elastic License 2.0 (source-available, ~10k+ stars)	Self-host, no event caps
DeepEval	Pytest-style local evals; 50+ metrics (G-Eval, task completion, faithfulness); span-level agent scoring	Apache 2.0 (~16.3k stars)	Fully open source; runs locally
Ragas	Reference-free RAG metrics: faithfulness, context precision, context recall, answer relevancy	Apache 2.0	Fully open source

Benchmarks (standard task sets)

Benchmark	Domain	What it verifies
tau-bench	Customer service (retail, airline)	Multi-turn tool use against API tools and policy guidelines; checks the final answer AND the resulting database state, not just tool-call syntax
tau2-bench	Customer service + telecom troubleshooting	Extends tau-bench with a dual-control setup and a pass^k metric measuring reliability across repeated attempts (Sierra Research / Princeton)
SWE-Bench	Software engineering	2,294 real GitHub issues; execution-based, runs the patched repo's tests
AgentBench	Multi-environment (OS, DB, web, games)	Agent reasoning and decision-making across distinct interactive environments

The pattern across the serious benchmarks is execution-based verification. tau-bench checks the database state, SWE-Bench runs the test suite, tau2-bench's pass^k measures whether the agent succeeds reliably across attempts rather than once. A benchmark that only checks tool-call syntax or final text would pass agents that look right and do the wrong thing. The same standard applies to your own framework setup: verify the end state, not the last message.

LLM-as-Judge and Its Failure Modes

LLM-as-judge is the default scorer in LangSmith, Braintrust, Phoenix, and DeepEval. You give a model the agent's output and a rubric, and it returns a score. It is flexible and it works for offline scoring on a sample. It has four failure modes that matter when you lean on it harder than that.

Length and verbosity bias. Judges systematically rate longer, more detailed answers higher, independent of whether the extra length is correct or relevant.
Position bias. In pairwise comparisons, the answer shown first or last gets an edge from its position, not its quality. Swapping the order can flip the verdict.
Self-preference. A judge rates answers that match its own writing style and the model family it comes from higher, regardless of correctness.
Cost and non-determinism. Each judgment is a full model call, so it is too expensive to run on every production turn, and the same trajectory can score differently across runs.

The practical conclusion: LLM-as-judge is fine for offline scoring on a held-out sample, where cost and variance are bounded and you can ensemble or human-review the edge cases. It is the wrong tool for the per-turn job, where you need a deterministic, cheap label on every turn in production. That is a classification problem, and a classifier is the right shape for it.

Private beta

Reflexes is currently in private beta. The API below is live in the docs and may change before general availability. See the Reflexes docs and the Reflex product page.

The Per-Turn Classifier Loop

Per-turn evaluation is a classification problem: take one turn, return a label. The labels are the semantic events a trace and a final-answer score both miss: jailbreak_attempt, is_agent_looping, policy_violation, is_user_frustrated, or a signal specific to your product. A classifier scores these in one forward pass, which is what makes running it on every turn affordable.

A Morph Reflex is that classifier. It returns a label as an API response in under 90 milliseconds end-to-end, one forward pass, up to 64k context. You train a custom evaluator from a prompt, a labeled dataset, or synthetic data generation, in under an hour, on the base model morph-reflex-v1 over an API that is OpenAI-fine-tuning-compatible. The label logs to your eval set, fires an alert, or routes inline.

Score a single agent turn for looping

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "agent-looping", "text": "<the agent turn to score>"}'

# {
#   "model": "agent-looping",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "looping", "score": 0.94, "selected": true },
#     { "class_id": 1, "label": "progressing", "score": 0.06, "selected": false }
#   ],
#   "inference_time_ms": 11
# }

The reason this matters for evaluation, not just monitoring: every Reflex output feeds three things at once. It feeds your evals, so the labeled turn joins the offline set and the next regression run covers it. It feeds your fine-tunes, so the agent learns from the corrected behavior. And it feeds your RL reward terms, so "do not loop" or "do not leak the system prompt" becomes a gradient, not a hope. The turn that flagged yesterday becomes behavior the main agent learns tomorrow. That closed loop, production label to training signal, is what separates an agent that improves from one that plateaus.

Because the signal comes back as an API response rather than a dashboard panel, it composes with whatever eval stack you already run. Write the label onto a LangSmith or Braintrust span as an attribute, gate a release on it in CI, alert on it from Slack, or route on it inline. It complements an eval framework; it does not replace one. Train a custom signal at the custom Reflexes docs or in the dashboard.

Frequently Asked Questions

What is AI agent evaluation?

Measuring whether a tool-using LLM agent does its job correctly, across three layers: final-answer (score the last message), trajectory (score the sequence of steps and tool calls), and per-turn (score the meaning of each turn in production). A held-out benchmark covers the first layer; production signal comes from the third. See what agent evaluation covers.

What are the main agent evaluation metrics?

Task completion (success rate, verified on the end state), tool-call accuracy, step and loop count, trajectory match, cost and latency, and groundedness. The metrics section has how each is computed.

What is the difference between offline and online agent evaluation?

Offline runs against a fixed dataset of known tasks in CI and catches regressions reproducibly. Online scores real production traffic and catches drift, novel failures, frustrated users, and jailbreaks. You need both. Full breakdown in offline vs online.

What are the best agent evaluation frameworks in 2026?

LangSmith (trajectory evals, LangChain-native), Braintrust (eval-first scoring), OpenAI Evals (MIT, registry-based), Arize Phoenix (OTel-native, agent function-calling eval), DeepEval (pytest-style, Apache 2.0), and Ragas (reference-free RAG). For benchmarks: tau-bench, tau2-bench, SWE-Bench, AgentBench. The frameworks section has the comparison table.

How do you evaluate a trajectory and not just the final answer?

Capture the full nested span tree of model calls, tool calls, and arguments, then score it: trajectory-match evaluators against a reference, plus span-level scoring for tool selection and step-level faithfulness. Watch tool-call accuracy, step count versus the minimum, loops, and recovery. See the metrics.

Why does LLM-as-judge fail for agent evaluation?

It has length, position, and self-preference biases, it is non-deterministic, and a full model call per judgment is too expensive to run on every production turn. Fine offline on a sample; wrong for per-turn labeling. See LLM-as-judge failure modes.

How do you turn production failures into training signal?

Label every production turn with a per-turn classifier, then route those labels into your eval set, fine-tune data, and RL reward terms. A Morph Reflex returns the label inline in under 90 milliseconds, cheap enough to run on every turn. See the per-turn classifier loop.

Go deeper

Agent observability: tracing the trajectory in production, and where evaluation plugs in
LLM guardrails: turning the same per-turn labels into inline blocks and routes
LLM observability tools: 8 platforms compared, every free tier limit
Build your own observability: OpenTelemetry, ClickHouse, and per-turn labels
Reflex: the per-turn classifier that feeds evals, fine-tunes, and RL reward

Evaluate the turns that pass/fail scoring misses

A final-answer score can be green while the trajectory looped and three turns drifted off policy. Reflexes returns a label on every turn in under 90 milliseconds, and the label feeds your evals, fine-tunes, and RL reward.

Read the Reflexes docs

Train a custom evaluator

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Glance

Blog

Startup Credits

Students

Contact Us

About

Careers

AI Agent Evaluation (2026): The Metrics, the Frameworks, and Why Offline Evals Miss Production Failures