AI Agent Evaluation (2026): The Metrics, the Frameworks, and Why Offline Evals Miss Production Failures

Most agent evaluation stops at a held-out benchmark and a final-answer pass/fail. That misses trajectory quality, tool-call correctness, looping, and recovery. The metrics that matter, the frameworks compared (LangSmith, Braintrust, OpenAI Evals, Phoenix, DeepEval, Ragas, plus tau-bench and SWE-Bench), LLM-as-judge failure modes, and the per-turn labels that feed evals, fine-tunes, and RL reward.

June 20, 2026 · 2 min read
AI Agent Evaluation (2026): The Metrics, the Frameworks, and Why Offline Evals Miss Production Failures

Most teams evaluate an agent by running it against a held-out set of tasks and checking whether the final answer is correct. That number tells you how the agent did on tasks you already knew about. It tells you almost nothing about how the agent behaves on the traffic you have not seen: whether it looped before answering, called the wrong tool and recovered, leaked its reasoning, or left the user frustrated three turns in. This page covers the metrics, the frameworks, and the per-turn labels that catch what offline scoring misses. Framework numbers verified against published pages as of June 2026.

3 layers
Final-answer, trajectory, per-turn
< 90ms
Per-turn classifier latency, one forward pass
2,294 issues
SWE-Bench agent task set

What Agent Evaluation Covers

An agent turns one user request into a sequence: model calls, tool calls, retries, sometimes subagents. Evaluation has to cover three layers, and they answer different questions.

  • Final-answer evaluation. Score the last message against an expected result. This is the layer every benchmark and most eval tutorials cover. It is necessary and insufficient: the answer can be right while the path to it was wrong, expensive, or unsafe.
  • Trajectory evaluation. Score the sequence of steps that produced the answer. Did the agent call the right tools with the right arguments? How many steps did a 3-step task take? Did it loop? Did it recover after a wrong tool call? A correct final answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory.
  • Per-turn evaluation. Score the meaning of each individual turn as it runs in production. A turn can be a jailbreak attempt, a leaked system prompt, a policy violation, or a moment a user gets frustrated. None of these change the status code, the latency, or the token count. They are invisible to logs and traces.

The trap is treating layer one as the whole job. A held-out pass/fail can be green while the trajectory was a mess and three turns drifted off policy. The signal that improves an agent lives in layers two and three, on real traffic, not on the held-out set.

The Metrics That Matter

Six metrics carry most of the weight. The first four measure behavior, the last two measure cost and faithfulness.

MetricWhat it measuresHow it is computed
Task completion / success rateDid the agent reach the correct end stateVerify the resulting database/system state, not just the final text. tau-bench checks both the final answer and the resulting DB state.
Tool-call accuracyRight tool, right argumentsPer-call check of tool selection and argument correctness against a reference or schema (Phoenix's agent function-calling eval, DeepEval span-level).
Step / loop countEfficiency and loopingSteps taken versus minimum needed; flag repeated identical actions. Loops hide inside a correct final answer.
Trajectory matchDid the step sequence match a referenceCompare the agent's trajectory against a reference trajectory (AgentEvals create_trajectory_match_evaluator, strict or superset modes).
Cost and latencyTokens, dollars, wall time per taskPer-step attribution so a 20-step loop is visible, not buried in a trace total.
GroundednessAre claims supported by retrieved contextReference-free RAG metrics: faithfulness, context precision, context recall, answer relevancy (Ragas).

Two of these are easy to get wrong. Task completion measured on the final text alone passes an agent that says it booked the flight without booking it; tau-bench exists precisely because it verifies the database state instead. Step and loop count is invisible in a trace total: a 20-step agent turn and a 3-step one both return a single answer, so you only see the loop if you attribute cost per step.

Offline vs Online Evaluation

Offline and online evaluation are not competing methods. They answer different questions, and shipping a reliable agent needs both.

DimensionOffline (held-out set)Online (production traffic)
InputFixed dataset of tasks with known answersLive requests, the long tail no dataset contains
When it runsCI, pre-release, on every prompt changeContinuously, on real turns as they happen
What it catchesRegressions on known casesDrift, novel failures, frustrated users, jailbreaks
ReproducibleYes, same inputs every runNo, traffic changes; you label and aggregate
ScorerLLM-as-judge, code checks, human review on a samplePer-turn classifier cheap enough for every turn

Offline evaluation tells you the agent did not get worse on yesterday's tasks. It is reproducible and it belongs in CI. What it cannot do is anticipate the inputs you have not seen, the multi-turn conversations that drift, or the tool failures that only happen against live systems. Those show up in production, once, and a held-out set written last month does not contain them.

Online evaluation closes that gap by scoring real traffic as it arrives. The catch is the scorer: LLM-as-judge is too slow and too expensive to run on every turn (covered below), so online evaluation at scale needs a classifier that returns a label in milliseconds. The payoff is that every labeled production turn becomes data you can feed back into the offline set, the fine-tune, and the reward function. That loop, not a bigger held-out set, is what moves an agent forward. More on the production-monitoring side in agent observability.

Frameworks and Benchmarks Compared

Two different things get called "agent evaluation." Frameworks are the tooling you point at your own agent to score it. Benchmarks are fixed task sets with leaderboards that measure how a model or agent does on a standard problem. You use a framework to test your agent; you cite a benchmark to compare models.

Frameworks (point at your own agent)

FrameworkWhat it scoresLicenseFree tier
LangSmithTrajectory evals, LLM-as-judge, heuristic checks, annotation queues; first-party LangChain/LangGraphClosed sourceDeveloper: 5k traces/mo, 1 seat, 14-day retention
BraintrustEval-first scoring: LLM-as-judge, autoevals, custom code scorers, human reviewClosed source1 GB data, 10k scores/mo, 14-day retention, unlimited seats
OpenAI EvalsRegistry-based, reproducible benchmark-style evals; Completion Function Protocol for tool-using agentsMIT (~18.5k stars)Open source; hosted platform read-only after Oct 31 2026
Arize PhoenixOpenTelemetry-native tracing plus evals: hallucination, agent function-calling eval, toxicity, RAGElastic License 2.0 (source-available, ~10k+ stars)Self-host, no event caps
DeepEvalPytest-style local evals; 50+ metrics (G-Eval, task completion, faithfulness); span-level agent scoringApache 2.0 (~16.3k stars)Fully open source; runs locally
RagasReference-free RAG metrics: faithfulness, context precision, context recall, answer relevancyApache 2.0Fully open source

Benchmarks (standard task sets)

BenchmarkDomainWhat it verifies
tau-benchCustomer service (retail, airline)Multi-turn tool use against API tools and policy guidelines; checks the final answer AND the resulting database state, not just tool-call syntax
tau2-benchCustomer service + telecom troubleshootingExtends tau-bench with a dual-control setup and a pass^k metric measuring reliability across repeated attempts (Sierra Research / Princeton)
SWE-BenchSoftware engineering2,294 real GitHub issues; execution-based, runs the patched repo's tests
AgentBenchMulti-environment (OS, DB, web, games)Agent reasoning and decision-making across distinct interactive environments

The pattern across the serious benchmarks is execution-based verification. tau-bench checks the database state, SWE-Bench runs the test suite, tau2-bench's pass^k measures whether the agent succeeds reliably across attempts rather than once. A benchmark that only checks tool-call syntax or final text would pass agents that look right and do the wrong thing. The same standard applies to your own framework setup: verify the end state, not the last message.

LLM-as-Judge and Its Failure Modes

LLM-as-judge is the default scorer in LangSmith, Braintrust, Phoenix, and DeepEval. You give a model the agent's output and a rubric, and it returns a score. It is flexible and it works for offline scoring on a sample. It has four failure modes that matter when you lean on it harder than that.

  • Length and verbosity bias. Judges systematically rate longer, more detailed answers higher, independent of whether the extra length is correct or relevant.
  • Position bias. In pairwise comparisons, the answer shown first or last gets an edge from its position, not its quality. Swapping the order can flip the verdict.
  • Self-preference. A judge rates answers that match its own writing style and the model family it comes from higher, regardless of correctness.
  • Cost and non-determinism. Each judgment is a full model call, so it is too expensive to run on every production turn, and the same trajectory can score differently across runs.

The practical conclusion: LLM-as-judge is fine for offline scoring on a held-out sample, where cost and variance are bounded and you can ensemble or human-review the edge cases. It is the wrong tool for the per-turn job, where you need a deterministic, cheap label on every turn in production. That is a classification problem, and a classifier is the right shape for it.

Private beta

Reflexes is currently in private beta. The API below is live in the docs and may change before general availability. See the Reflexes docs and the Reflex product page.

The Per-Turn Classifier Loop

Per-turn evaluation is a classification problem: take one turn, return a label. The labels are the semantic events a trace and a final-answer score both miss: jailbreak_attempt, is_agent_looping, policy_violation, is_user_frustrated, or a signal specific to your product. A classifier scores these in one forward pass, which is what makes running it on every turn affordable.

A Morph Reflex is that classifier. It returns a label as an API response in under 90 milliseconds end-to-end, one forward pass, up to 64k context. You train a custom evaluator from a prompt, a labeled dataset, or synthetic data generation, in under an hour, on the base model morph-reflex-v1 over an API that is OpenAI-fine-tuning-compatible. The label logs to your eval set, fires an alert, or routes inline.

Score a single agent turn for looping

curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
  -H "Authorization: Bearer $MORPH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "agent-looping", "text": "<the agent turn to score>"}'

# {
#   "model": "agent-looping",
#   "mode": "single_label",
#   "classes": [
#     { "class_id": 0, "label": "looping", "score": 0.94, "selected": true },
#     { "class_id": 1, "label": "progressing", "score": 0.06, "selected": false }
#   ],
#   "inference_time_ms": 11
# }

The reason this matters for evaluation, not just monitoring: every Reflex output feeds three things at once. It feeds your evals, so the labeled turn joins the offline set and the next regression run covers it. It feeds your fine-tunes, so the agent learns from the corrected behavior. And it feeds your RL reward terms, so "do not loop" or "do not leak the system prompt" becomes a gradient, not a hope. The turn that flagged yesterday becomes behavior the main agent learns tomorrow. That closed loop, production label to training signal, is what separates an agent that improves from one that plateaus.

Because the signal comes back as an API response rather than a dashboard panel, it composes with whatever eval stack you already run. Write the label onto a LangSmith or Braintrust span as an attribute, gate a release on it in CI, alert on it from Slack, or route on it inline. It complements an eval framework; it does not replace one. Train a custom signal at the custom Reflexes docs or in the dashboard.

Frequently Asked Questions

What is AI agent evaluation?

Measuring whether a tool-using LLM agent does its job correctly, across three layers: final-answer (score the last message), trajectory (score the sequence of steps and tool calls), and per-turn (score the meaning of each turn in production). A held-out benchmark covers the first layer; production signal comes from the third. See what agent evaluation covers.

What are the main agent evaluation metrics?

Task completion (success rate, verified on the end state), tool-call accuracy, step and loop count, trajectory match, cost and latency, and groundedness. The metrics section has how each is computed.

What is the difference between offline and online agent evaluation?

Offline runs against a fixed dataset of known tasks in CI and catches regressions reproducibly. Online scores real production traffic and catches drift, novel failures, frustrated users, and jailbreaks. You need both. Full breakdown in offline vs online.

What are the best agent evaluation frameworks in 2026?

LangSmith (trajectory evals, LangChain-native), Braintrust (eval-first scoring), OpenAI Evals (MIT, registry-based), Arize Phoenix (OTel-native, agent function-calling eval), DeepEval (pytest-style, Apache 2.0), and Ragas (reference-free RAG). For benchmarks: tau-bench, tau2-bench, SWE-Bench, AgentBench. The frameworks section has the comparison table.

How do you evaluate a trajectory and not just the final answer?

Capture the full nested span tree of model calls, tool calls, and arguments, then score it: trajectory-match evaluators against a reference, plus span-level scoring for tool selection and step-level faithfulness. Watch tool-call accuracy, step count versus the minimum, loops, and recovery. See the metrics.

Why does LLM-as-judge fail for agent evaluation?

It has length, position, and self-preference biases, it is non-deterministic, and a full model call per judgment is too expensive to run on every production turn. Fine offline on a sample; wrong for per-turn labeling. See LLM-as-judge failure modes.

How do you turn production failures into training signal?

Label every production turn with a per-turn classifier, then route those labels into your eval set, fine-tune data, and RL reward terms. A Morph Reflex returns the label inline in under 90 milliseconds, cheap enough to run on every turn. See the per-turn classifier loop.

Go deeper

Evaluate the turns that pass/fail scoring misses

A final-answer score can be green while the trajectory looped and three turns drifted off policy. Reflexes returns a label on every turn in under 90 milliseconds, and the label feeds your evals, fine-tunes, and RL reward.