Terminal-Bench 2.0 Leaderboard (2026): 89 Tasks No Model Can Finish

Live Terminal-Bench 2.0 rankings: Gemini 3.1 Pro leads at 78.4%, GPT-5.3-Codex at 77.3%, Claude Opus 4.6 at 74.7%. 89 tasks across software engineering, security, biology, and gaming. Full leaderboard and analysis.

March 9, 2026 · 2 min read

Gemini 3.1 Pro scores 78.4% on Terminal-Bench 2.0. GPT-5.3-Codex scores 77.3%. Claude Opus 4.6 scores 74.7%. The benchmark measures something SWE-Bench does not: whether an AI agent can operate inside a real terminal, not just generate patches.

Terminal-Bench 2.0 contains 89 tasks across software engineering, biology, security, and gaming. Each task runs in a Docker container with a unique environment, human-written solution, and automated verification. Below are the latest rankings from tbench.ai, followed by analysis of what separates the top agents from the rest.

Leaderboard data verified March 9, 2026

Terminal-Bench 2.0: Top 10 Submissions

Agent + model combinations, 89 tasks, Docker-containerized

1Forge Code + Gemini 3.1 Pro
78.4%
2Droid + GPT-5.3-Codex
77.3%
3Simple Codex + GPT-5.3-Codex
75.1%
4Terminus-KIRA + Gemini 3.1 Pro
74.8%
5Terminus-KIRA + Claude Opus 4.6
74.7%
6Mux + GPT-5.3-Codex
74.6%
7OB-1 (multi-model)
72.4%
8TongAgents + Claude Opus 4.6
71.9%
9Junie CLI (multi-model)
71%
10CodeBrain-1 + GPT-5.3-Codex
70.3%

Source: tbench.ai leaderboard. Scores are pass@1 averaged over multiple runs.

89
Tasks, manually verified
93
Open-source contributors
78.4%
Top score (March 2026)
3.1%
Lowest score (GPT-OSS-20B)

Terminal-Bench 2.0 Leaderboard (Top 15)

The leaderboard at tbench.ai has 115 submissions. Each entry pairs an agent framework with a model. The same model can appear multiple times with different agents, making scaffolding quality visible in the data.

RankAgent + ModelScoreMargin
1Forge Code + Gemini 3.1 Pro78.4%±1.8
2Droid + GPT-5.3-Codex77.3%±2.2
3Simple Codex + GPT-5.3-Codex75.1%±2.4
4Terminus-KIRA + Gemini 3.1 Pro74.8%±2.6
5Terminus-KIRA + Claude Opus 4.674.7%±2.6
6Mux + GPT-5.3-Codex74.6%±2.5
7OB-1 (multi-model)72.4%±2.3
8TongAgents + Claude Opus 4.671.9%±2.7
9Junie CLI (multi-model)71.0%±2.9
10CodeBrain-1 + GPT-5.3-Codex70.3%±2.6
...
113Mini-SWE-Agent + GPT-5-Nano7.0%±1.9
114Mini-SWE-Agent + GPT-OSS-20B3.4%±1.4
115Terminus 2 + GPT-OSS-20B3.1%±1.5

Source: tbench.ai. Scores are percentage of 89 tasks passed. Margin is 95% confidence interval. Full leaderboard has 115 entries.

Agent Framework Impact

The same model scores differently depending on the agent wrapping it. This is the most important pattern in the data: scaffolding quality contributes 2-6 points on top of raw model capability.

Same Model, Different Agent

GPT-5.3-Codex across 4 agent frameworks

1Droid (Factory)
77.3%
2Simple Codex (OpenAI)
75.1%
3Mux (Coder)
74.6%
4CodeBrain-1 (Feeling AI)
70.3%

7-point spread from the same model. Scaffolding is not a rounding error.

Factory's Droid agent beats OpenAI's own Simple Codex agent by 2.2 points using the same GPT-5.3-Codex model. CodeBrain-1 trails by 7 points. The pattern holds across models: KRAFTON AI's Terminus-KIRA agent scores 74.7% with Opus 4.6, while Bigai's TongAgents scores 71.9% with the same model.

Warp demonstrated this most dramatically on Terminal-Bench v0.1.1. Their agent scored 52%, over 20 points ahead of the next competitor, using Claude Sonnet 4 for execution and Claude Opus 4 for planning. The key features: long-running command control (agents could read output from interactive sessions like REPLs and vim), updatable todo lists for plan tracking, and a fallback mechanism using Gemini 2.5 Pro for failed requests.

What makes a good agent scaffold?

The top-scoring agents share three patterns: long-running command management (not just one-shot bash calls), explicit planning steps before execution, and fallback mechanisms for flaky tool calls. Terminal tasks require sustained interaction with the environment, not just code generation. An agent that can manage an interactive vim session or monitor a long-running compilation outperforms one that treats every command as fire-and-forget.

Terminal-Bench vs SWE-Bench

SWE-Bench and Terminal-Bench measure different skills. SWE-Bench asks a model to generate a patch file that resolves a GitHub issue. Terminal-Bench asks a model to operate a computer through its terminal. The tasks are complementary, not competing.

DimensionSWE-Bench (Verified)Terminal-Bench 2.0
What it measuresPatch generation for GitHub issuesShell fluency in real environments
Task formatGenerate a diff/patch fileInteractive terminal session in Docker
Tasks500 (Verified) / 1,865 (Pro)89
LanguagesPython-only (Verified)Language-agnostic
DomainsSoftware engineeringSWE, security, biology, gaming
Top score80.9% (Verified)78.4%
ContributorsPrinceton NLP93 open-source contributors
Contamination riskConfirmed (Verified)Low (canary strings, manual review)

A model that generates clean patches may fail at Terminal-Bench tasks because it cannot navigate an unfamiliar filesystem, debug a compilation error, or manage a long-running process. Terminal-Bench tasks require the agent to learn about its environment before solving the problem, not just reason about code.

Task Categories

Terminal-Bench 2.0's 89 tasks span domains that go well beyond writing code. The benchmark was designed to test whether agents can do the kind of technical work that requires both domain knowledge and shell fluency.

Software Engineering

Build systems, compilation, dependency resolution, git operations with merge conflicts, COBOL-to-Python rewrites, code coverage analysis with gcov.

Security

Differential cryptanalysis on cipher systems, password recovery, vulnerability identification, API key removal from codebases.

Machine Learning

Training a fastText model on Yelp data with accuracy and size constraints, neural network framework integration, model optimization.

Biology

Domain-specific computation tasks requiring biological knowledge alongside terminal operation skills.

Gaming

Chess engine move optimization, physics-based rendering implementations, game-related computational tasks.

System Administration

Server configuration, email systems, network services, debugging memory issues, async concurrency, race conditions.

The COBOL modernization task requires close to 10 minutes and 100+ tool calls. Simpler tasks complete in under a minute. Author-estimated completion times for a human expert range from under one hour to over one week. 93.3% of tasks rated "hard" by human reviewers also proved empirically difficult for models (r=0.436 correlation).

How Terminal-Bench 2.0 Works

A Terminal-Bench task is a folder containing an instruction, a Docker environment, and a test script. The Harbor harness connects a language model to a terminal sandbox, runs the task, and assigns a binary score: 0 for incorrect, 1 for correct.

Execution Flow

  1. Container launch: Harbor spins up a Docker container with the task's specific environment (packages, files, services).
  2. Instruction delivery: The agent receives the task description and begins exploring the environment.
  3. Interactive execution: The agent calls tools (bash commands, file edits) to complete the task. It can run multiple commands, inspect output, and adjust its approach.
  4. Verification: Harbor runs the task's test script. The test checks whether the agent's solution meets the task requirements.

Harbor supports popular agents including Claude Code, Codex CLI, OpenHands, Mini-SWE-Agent, and Terminus 2. It scales horizontally to 32-100 containers in parallel using cloud infrastructure, with sandbox providers like Daytona handling container lifecycle.

Supported Agents

AgentOrganizationBest Score
Forge CodeForge Code78.4%
DroidFactory77.3%
Simple CodexOpenAI75.1%
Terminus-KIRAKRAFTON AI74.8%
Claude CodeAnthropic~65%
Codex CLIOpenAI~63%
OpenHandsAll Hands AI~55%
Mini-SWE-AgentPrinceton~7%

What Changed in 2.0

Terminal-Bench 1.0 launched in May 2025 and quickly became the standard benchmark for agent evaluation across frontier labs. Version 2.0, released November 2025, addressed quality and reproducibility issues that surfaced over 6 months of heavy use.

DimensionVersion 1.0Version 2.0
Tasks229 submitted, ~150 active89 (curated from 229)
VerificationCommunity reviewManual + LM-assisted, 3 reviewers per task
Environment stabilitySome tasks broke (YouTube anti-bot)All environments verified stable
ExecutionEC2/Docker, turn-based limitsDaytona remote, time-based limits
Environment fixesAd hoc89 tasks with config fixes
Instruction fixesAd hoc11 tasks with instruction rewrites

The shift from turn-based to time-based limits is significant. Some tasks require waiting for a compilation or a training run. Turn-based limits penalize agents that monitor long-running processes, which is exactly the skill Terminal-Bench is designed to measure.

Limitations

Terminal-Bench 2.0 is more rigorous than its predecessor, but it is not without issues. Three limitations are worth noting.

Internet Access

Agents can access the internet during evaluation. They could theoretically find Terminal-Bench solutions on GitHub or the paper itself. The authors have not observed this behavior in tens of thousands of trajectories, but the risk is not zero. The repository includes Big-Bench canary strings to aid training data decontamination, but preventing intentional lookup during evaluation requires a private test set, which the project has not built.

Reproducibility

Some non-determinism remains. Task outcomes can vary with CPU architecture differences, external API stability, and infrastructure reliability. Long-running tasks are especially sensitive. The confidence intervals on the leaderboard (typically \u00b11.5-2.6) reflect this variance.

Scale

89 tasks is small compared to SWE-Bench Pro's 1,865. The confidence intervals are wider as a result. Two models separated by 1-2 points may not be meaningfully different. The top 5 entries span only 3.7 points.

Frequently Asked Questions

What is Terminal-Bench 2.0?

Terminal-Bench 2.0 is a benchmark of 89 tasks that evaluates AI agents in real terminal environments. Created by Mike Merrill at Stanford and the Laude Institute, it tests whether agents can navigate filesystems, compile code, install packages, and complete multi-step workflows inside Docker containers. Tasks span software engineering, security, biology, and gaming.

How does Terminal-Bench differ from SWE-Bench?

SWE-Bench measures patch generation for GitHub issues. Terminal-Bench measures shell fluency: whether an agent can operate inside a real terminal environment across multi-step workflows. SWE-Bench tasks are primarily single-file Python patches. Terminal-Bench tasks require exploring unknown environments, running commands, and validating results autonomously.

What is GPT-5.3-Codex's Terminal-Bench 2.0 score?

GPT-5.3-Codex scores 77.3% when paired with Factory's Droid agent, and 75.1% with OpenAI's Simple Codex agent. The 2.2-point gap between agents using the same model shows how much scaffolding matters.

What is Claude Opus 4.6's Terminal-Bench 2.0 score?

Claude Opus 4.6 scores 74.7% when paired with KRAFTON AI's Terminus-KIRA agent, and 71.9% with Bigai's TongAgents framework.

Which model leads Terminal-Bench 2.0?

As of March 2026, Forge Code with Gemini 3.1 Pro leads at 78.4%, followed by Factory's Droid with GPT-5.3-Codex at 77.3%. The top 5 entries are within 3.7 percentage points.

Faster Code Editing for Terminal Agents

Morph Fast Apply merges AI-generated edits at 10,500 tok/s with 98% accuracy. Terminal-Bench agents that rewrite entire files for every edit waste tokens and time. Fast Apply cuts token usage by 50-60% and latency by 90%+.