Most teams evaluate an agent by running it against a held-out set of tasks and checking whether the final answer is correct. That number tells you how the agent did on tasks you already knew about. It tells you almost nothing about how the agent behaves on the traffic you have not seen: whether it looped before answering, called the wrong tool and recovered, leaked its reasoning, or left the user frustrated three turns in. This page covers the metrics, the frameworks, and the per-turn labels that catch what offline scoring misses. Framework numbers verified against published pages as of June 2026.
What Agent Evaluation Covers
An agent turns one user request into a sequence: model calls, tool calls, retries, sometimes subagents. Evaluation has to cover three layers, and they answer different questions.
- Final-answer evaluation. Score the last message against an expected result. This is the layer every benchmark and most eval tutorials cover. It is necessary and insufficient: the answer can be right while the path to it was wrong, expensive, or unsafe.
- Trajectory evaluation. Score the sequence of steps that produced the answer. Did the agent call the right tools with the right arguments? How many steps did a 3-step task take? Did it loop? Did it recover after a wrong tool call? A correct final answer reached in 20 steps with two policy-violating intermediate calls is a failing trajectory.
- Per-turn evaluation. Score the meaning of each individual turn as it runs in production. A turn can be a jailbreak attempt, a leaked system prompt, a policy violation, or a moment a user gets frustrated. None of these change the status code, the latency, or the token count. They are invisible to logs and traces.
The trap is treating layer one as the whole job. A held-out pass/fail can be green while the trajectory was a mess and three turns drifted off policy. The signal that improves an agent lives in layers two and three, on real traffic, not on the held-out set.
The Metrics That Matter
Six metrics carry most of the weight. The first four measure behavior, the last two measure cost and faithfulness.
| Metric | What it measures | How it is computed |
|---|---|---|
| Task completion / success rate | Did the agent reach the correct end state | Verify the resulting database/system state, not just the final text. tau-bench checks both the final answer and the resulting DB state. |
| Tool-call accuracy | Right tool, right arguments | Per-call check of tool selection and argument correctness against a reference or schema (Phoenix's agent function-calling eval, DeepEval span-level). |
| Step / loop count | Efficiency and looping | Steps taken versus minimum needed; flag repeated identical actions. Loops hide inside a correct final answer. |
| Trajectory match | Did the step sequence match a reference | Compare the agent's trajectory against a reference trajectory (AgentEvals create_trajectory_match_evaluator, strict or superset modes). |
| Cost and latency | Tokens, dollars, wall time per task | Per-step attribution so a 20-step loop is visible, not buried in a trace total. |
| Groundedness | Are claims supported by retrieved context | Reference-free RAG metrics: faithfulness, context precision, context recall, answer relevancy (Ragas). |
Two of these are easy to get wrong. Task completion measured on the final text alone passes an agent that says it booked the flight without booking it; tau-bench exists precisely because it verifies the database state instead. Step and loop count is invisible in a trace total: a 20-step agent turn and a 3-step one both return a single answer, so you only see the loop if you attribute cost per step.
Offline vs Online Evaluation
Offline and online evaluation are not competing methods. They answer different questions, and shipping a reliable agent needs both.
| Dimension | Offline (held-out set) | Online (production traffic) |
|---|---|---|
| Input | Fixed dataset of tasks with known answers | Live requests, the long tail no dataset contains |
| When it runs | CI, pre-release, on every prompt change | Continuously, on real turns as they happen |
| What it catches | Regressions on known cases | Drift, novel failures, frustrated users, jailbreaks |
| Reproducible | Yes, same inputs every run | No, traffic changes; you label and aggregate |
| Scorer | LLM-as-judge, code checks, human review on a sample | Per-turn classifier cheap enough for every turn |
Offline evaluation tells you the agent did not get worse on yesterday's tasks. It is reproducible and it belongs in CI. What it cannot do is anticipate the inputs you have not seen, the multi-turn conversations that drift, or the tool failures that only happen against live systems. Those show up in production, once, and a held-out set written last month does not contain them.
Online evaluation closes that gap by scoring real traffic as it arrives. The catch is the scorer: LLM-as-judge is too slow and too expensive to run on every turn (covered below), so online evaluation at scale needs a classifier that returns a label in milliseconds. The payoff is that every labeled production turn becomes data you can feed back into the offline set, the fine-tune, and the reward function. That loop, not a bigger held-out set, is what moves an agent forward. More on the production-monitoring side in agent observability.
Frameworks and Benchmarks Compared
Two different things get called "agent evaluation." Frameworks are the tooling you point at your own agent to score it. Benchmarks are fixed task sets with leaderboards that measure how a model or agent does on a standard problem. You use a framework to test your agent; you cite a benchmark to compare models.
Frameworks (point at your own agent)
| Framework | What it scores | License | Free tier |
|---|---|---|---|
| LangSmith | Trajectory evals, LLM-as-judge, heuristic checks, annotation queues; first-party LangChain/LangGraph | Closed source | Developer: 5k traces/mo, 1 seat, 14-day retention |
| Braintrust | Eval-first scoring: LLM-as-judge, autoevals, custom code scorers, human review | Closed source | 1 GB data, 10k scores/mo, 14-day retention, unlimited seats |
| OpenAI Evals | Registry-based, reproducible benchmark-style evals; Completion Function Protocol for tool-using agents | MIT (~18.5k stars) | Open source; hosted platform read-only after Oct 31 2026 |
| Arize Phoenix | OpenTelemetry-native tracing plus evals: hallucination, agent function-calling eval, toxicity, RAG | Elastic License 2.0 (source-available, ~10k+ stars) | Self-host, no event caps |
| DeepEval | Pytest-style local evals; 50+ metrics (G-Eval, task completion, faithfulness); span-level agent scoring | Apache 2.0 (~16.3k stars) | Fully open source; runs locally |
| Ragas | Reference-free RAG metrics: faithfulness, context precision, context recall, answer relevancy | Apache 2.0 | Fully open source |
Benchmarks (standard task sets)
| Benchmark | Domain | What it verifies |
|---|---|---|
| tau-bench | Customer service (retail, airline) | Multi-turn tool use against API tools and policy guidelines; checks the final answer AND the resulting database state, not just tool-call syntax |
| tau2-bench | Customer service + telecom troubleshooting | Extends tau-bench with a dual-control setup and a pass^k metric measuring reliability across repeated attempts (Sierra Research / Princeton) |
| SWE-Bench | Software engineering | 2,294 real GitHub issues; execution-based, runs the patched repo's tests |
| AgentBench | Multi-environment (OS, DB, web, games) | Agent reasoning and decision-making across distinct interactive environments |
The pattern across the serious benchmarks is execution-based verification. tau-bench checks the database state, SWE-Bench runs the test suite, tau2-bench's pass^k measures whether the agent succeeds reliably across attempts rather than once. A benchmark that only checks tool-call syntax or final text would pass agents that look right and do the wrong thing. The same standard applies to your own framework setup: verify the end state, not the last message.
LLM-as-Judge and Its Failure Modes
LLM-as-judge is the default scorer in LangSmith, Braintrust, Phoenix, and DeepEval. You give a model the agent's output and a rubric, and it returns a score. It is flexible and it works for offline scoring on a sample. It has four failure modes that matter when you lean on it harder than that.
- Length and verbosity bias. Judges systematically rate longer, more detailed answers higher, independent of whether the extra length is correct or relevant.
- Position bias. In pairwise comparisons, the answer shown first or last gets an edge from its position, not its quality. Swapping the order can flip the verdict.
- Self-preference. A judge rates answers that match its own writing style and the model family it comes from higher, regardless of correctness.
- Cost and non-determinism. Each judgment is a full model call, so it is too expensive to run on every production turn, and the same trajectory can score differently across runs.
The practical conclusion: LLM-as-judge is fine for offline scoring on a held-out sample, where cost and variance are bounded and you can ensemble or human-review the edge cases. It is the wrong tool for the per-turn job, where you need a deterministic, cheap label on every turn in production. That is a classification problem, and a classifier is the right shape for it.
Private beta
Reflexes is currently in private beta. The API below is live in the docs and may change before general availability. See the Reflexes docs and the Reflex product page.
The Per-Turn Classifier Loop
Per-turn evaluation is a classification problem: take one turn, return a label. The labels are the semantic events a trace and a final-answer score both miss: jailbreak_attempt, is_agent_looping, policy_violation, is_user_frustrated, or a signal specific to your product. A classifier scores these in one forward pass, which is what makes running it on every turn affordable.
A Morph Reflex is that classifier. It returns a label as an API response in under 90 milliseconds end-to-end, one forward pass, up to 64k context. You train a custom evaluator from a prompt, a labeled dataset, or synthetic data generation, in under an hour, on the base model morph-reflex-v1 over an API that is OpenAI-fine-tuning-compatible. The label logs to your eval set, fires an alert, or routes inline.
Score a single agent turn for looping
curl -X POST "https://api.morphllm.com/v1/reflex/predict" \
-H "Authorization: Bearer $MORPH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "agent-looping", "text": "<the agent turn to score>"}'
# {
# "model": "agent-looping",
# "mode": "single_label",
# "classes": [
# { "class_id": 0, "label": "looping", "score": 0.94, "selected": true },
# { "class_id": 1, "label": "progressing", "score": 0.06, "selected": false }
# ],
# "inference_time_ms": 11
# }The reason this matters for evaluation, not just monitoring: every Reflex output feeds three things at once. It feeds your evals, so the labeled turn joins the offline set and the next regression run covers it. It feeds your fine-tunes, so the agent learns from the corrected behavior. And it feeds your RL reward terms, so "do not loop" or "do not leak the system prompt" becomes a gradient, not a hope. The turn that flagged yesterday becomes behavior the main agent learns tomorrow. That closed loop, production label to training signal, is what separates an agent that improves from one that plateaus.
Because the signal comes back as an API response rather than a dashboard panel, it composes with whatever eval stack you already run. Write the label onto a LangSmith or Braintrust span as an attribute, gate a release on it in CI, alert on it from Slack, or route on it inline. It complements an eval framework; it does not replace one. Train a custom signal at the custom Reflexes docs or in the dashboard.
Frequently Asked Questions
What is AI agent evaluation?
Measuring whether a tool-using LLM agent does its job correctly, across three layers: final-answer (score the last message), trajectory (score the sequence of steps and tool calls), and per-turn (score the meaning of each turn in production). A held-out benchmark covers the first layer; production signal comes from the third. See what agent evaluation covers.
What are the main agent evaluation metrics?
Task completion (success rate, verified on the end state), tool-call accuracy, step and loop count, trajectory match, cost and latency, and groundedness. The metrics section has how each is computed.
What is the difference between offline and online agent evaluation?
Offline runs against a fixed dataset of known tasks in CI and catches regressions reproducibly. Online scores real production traffic and catches drift, novel failures, frustrated users, and jailbreaks. You need both. Full breakdown in offline vs online.
What are the best agent evaluation frameworks in 2026?
LangSmith (trajectory evals, LangChain-native), Braintrust (eval-first scoring), OpenAI Evals (MIT, registry-based), Arize Phoenix (OTel-native, agent function-calling eval), DeepEval (pytest-style, Apache 2.0), and Ragas (reference-free RAG). For benchmarks: tau-bench, tau2-bench, SWE-Bench, AgentBench. The frameworks section has the comparison table.
How do you evaluate a trajectory and not just the final answer?
Capture the full nested span tree of model calls, tool calls, and arguments, then score it: trajectory-match evaluators against a reference, plus span-level scoring for tool selection and step-level faithfulness. Watch tool-call accuracy, step count versus the minimum, loops, and recovery. See the metrics.
Why does LLM-as-judge fail for agent evaluation?
It has length, position, and self-preference biases, it is non-deterministic, and a full model call per judgment is too expensive to run on every production turn. Fine offline on a sample; wrong for per-turn labeling. See LLM-as-judge failure modes.
How do you turn production failures into training signal?
Label every production turn with a per-turn classifier, then route those labels into your eval set, fine-tune data, and RL reward terms. A Morph Reflex returns the label inline in under 90 milliseconds, cheap enough to run on every turn. See the per-turn classifier loop.
Go deeper
- Agent observability: tracing the trajectory in production, and where evaluation plugs in
- LLM guardrails: turning the same per-turn labels into inline blocks and routes
- LLM observability tools: 8 platforms compared, every free tier limit
- Build your own observability: OpenTelemetry, ClickHouse, and per-turn labels
- Reflex: the per-turn classifier that feeds evals, fine-tunes, and RL reward
Evaluate the turns that pass/fail scoring misses
A final-answer score can be green while the trajectory looped and three turns drifted off policy. Reflexes returns a label on every turn in under 90 milliseconds, and the label feeds your evals, fine-tunes, and RL reward.
