Gemini 3.1 Pro scores 78.4% on Terminal-Bench 2.0. GPT-5.3-Codex scores 77.3%. Claude Opus 4.6 scores 74.7%. The benchmark measures something SWE-Bench does not: whether an AI agent can operate inside a real terminal, not just generate patches.
Terminal-Bench 2.0 contains 89 tasks across software engineering, biology, security, and gaming. Each task runs in a Docker container with a unique environment, human-written solution, and automated verification. Below are the latest rankings from tbench.ai, followed by analysis of what separates the top agents from the rest.
Terminal-Bench 2.0: Top 10 Submissions
Agent + model combinations, 89 tasks, Docker-containerized
Source: tbench.ai leaderboard. Scores are pass@1 averaged over multiple runs.
Terminal-Bench 2.0 Leaderboard (Top 15)
The leaderboard at tbench.ai has 115 submissions. Each entry pairs an agent framework with a model. The same model can appear multiple times with different agents, making scaffolding quality visible in the data.
| Rank | Agent + Model | Score | Margin |
|---|---|---|---|
| 1 | Forge Code + Gemini 3.1 Pro | 78.4% | ±1.8 |
| 2 | Droid + GPT-5.3-Codex | 77.3% | ±2.2 |
| 3 | Simple Codex + GPT-5.3-Codex | 75.1% | ±2.4 |
| 4 | Terminus-KIRA + Gemini 3.1 Pro | 74.8% | ±2.6 |
| 5 | Terminus-KIRA + Claude Opus 4.6 | 74.7% | ±2.6 |
| 6 | Mux + GPT-5.3-Codex | 74.6% | ±2.5 |
| 7 | OB-1 (multi-model) | 72.4% | ±2.3 |
| 8 | TongAgents + Claude Opus 4.6 | 71.9% | ±2.7 |
| 9 | Junie CLI (multi-model) | 71.0% | ±2.9 |
| 10 | CodeBrain-1 + GPT-5.3-Codex | 70.3% | ±2.6 |
| ... | |||
| 113 | Mini-SWE-Agent + GPT-5-Nano | 7.0% | ±1.9 |
| 114 | Mini-SWE-Agent + GPT-OSS-20B | 3.4% | ±1.4 |
| 115 | Terminus 2 + GPT-OSS-20B | 3.1% | ±1.5 |
Source: tbench.ai. Scores are percentage of 89 tasks passed. Margin is 95% confidence interval. Full leaderboard has 115 entries.
Agent Framework Impact
The same model scores differently depending on the agent wrapping it. This is the most important pattern in the data: scaffolding quality contributes 2-6 points on top of raw model capability.
Same Model, Different Agent
GPT-5.3-Codex across 4 agent frameworks
7-point spread from the same model. Scaffolding is not a rounding error.
Factory's Droid agent beats OpenAI's own Simple Codex agent by 2.2 points using the same GPT-5.3-Codex model. CodeBrain-1 trails by 7 points. The pattern holds across models: KRAFTON AI's Terminus-KIRA agent scores 74.7% with Opus 4.6, while Bigai's TongAgents scores 71.9% with the same model.
Warp demonstrated this most dramatically on Terminal-Bench v0.1.1. Their agent scored 52%, over 20 points ahead of the next competitor, using Claude Sonnet 4 for execution and Claude Opus 4 for planning. The key features: long-running command control (agents could read output from interactive sessions like REPLs and vim), updatable todo lists for plan tracking, and a fallback mechanism using Gemini 2.5 Pro for failed requests.
What makes a good agent scaffold?
The top-scoring agents share three patterns: long-running command management (not just one-shot bash calls), explicit planning steps before execution, and fallback mechanisms for flaky tool calls. Terminal tasks require sustained interaction with the environment, not just code generation. An agent that can manage an interactive vim session or monitor a long-running compilation outperforms one that treats every command as fire-and-forget.
Terminal-Bench vs SWE-Bench
SWE-Bench and Terminal-Bench measure different skills. SWE-Bench asks a model to generate a patch file that resolves a GitHub issue. Terminal-Bench asks a model to operate a computer through its terminal. The tasks are complementary, not competing.
| Dimension | SWE-Bench (Verified) | Terminal-Bench 2.0 |
|---|---|---|
| What it measures | Patch generation for GitHub issues | Shell fluency in real environments |
| Task format | Generate a diff/patch file | Interactive terminal session in Docker |
| Tasks | 500 (Verified) / 1,865 (Pro) | 89 |
| Languages | Python-only (Verified) | Language-agnostic |
| Domains | Software engineering | SWE, security, biology, gaming |
| Top score | 80.9% (Verified) | 78.4% |
| Contributors | Princeton NLP | 93 open-source contributors |
| Contamination risk | Confirmed (Verified) | Low (canary strings, manual review) |
A model that generates clean patches may fail at Terminal-Bench tasks because it cannot navigate an unfamiliar filesystem, debug a compilation error, or manage a long-running process. Terminal-Bench tasks require the agent to learn about its environment before solving the problem, not just reason about code.
Task Categories
Terminal-Bench 2.0's 89 tasks span domains that go well beyond writing code. The benchmark was designed to test whether agents can do the kind of technical work that requires both domain knowledge and shell fluency.
Software Engineering
Build systems, compilation, dependency resolution, git operations with merge conflicts, COBOL-to-Python rewrites, code coverage analysis with gcov.
Security
Differential cryptanalysis on cipher systems, password recovery, vulnerability identification, API key removal from codebases.
Machine Learning
Training a fastText model on Yelp data with accuracy and size constraints, neural network framework integration, model optimization.
Biology
Domain-specific computation tasks requiring biological knowledge alongside terminal operation skills.
Gaming
Chess engine move optimization, physics-based rendering implementations, game-related computational tasks.
System Administration
Server configuration, email systems, network services, debugging memory issues, async concurrency, race conditions.
The COBOL modernization task requires close to 10 minutes and 100+ tool calls. Simpler tasks complete in under a minute. Author-estimated completion times for a human expert range from under one hour to over one week. 93.3% of tasks rated "hard" by human reviewers also proved empirically difficult for models (r=0.436 correlation).
How Terminal-Bench 2.0 Works
A Terminal-Bench task is a folder containing an instruction, a Docker environment, and a test script. The Harbor harness connects a language model to a terminal sandbox, runs the task, and assigns a binary score: 0 for incorrect, 1 for correct.
Execution Flow
- Container launch: Harbor spins up a Docker container with the task's specific environment (packages, files, services).
- Instruction delivery: The agent receives the task description and begins exploring the environment.
- Interactive execution: The agent calls tools (bash commands, file edits) to complete the task. It can run multiple commands, inspect output, and adjust its approach.
- Verification: Harbor runs the task's test script. The test checks whether the agent's solution meets the task requirements.
Harbor supports popular agents including Claude Code, Codex CLI, OpenHands, Mini-SWE-Agent, and Terminus 2. It scales horizontally to 32-100 containers in parallel using cloud infrastructure, with sandbox providers like Daytona handling container lifecycle.
Supported Agents
| Agent | Organization | Best Score |
|---|---|---|
| Forge Code | Forge Code | 78.4% |
| Droid | Factory | 77.3% |
| Simple Codex | OpenAI | 75.1% |
| Terminus-KIRA | KRAFTON AI | 74.8% |
| Claude Code | Anthropic | ~65% |
| Codex CLI | OpenAI | ~63% |
| OpenHands | All Hands AI | ~55% |
| Mini-SWE-Agent | Princeton | ~7% |
What Changed in 2.0
Terminal-Bench 1.0 launched in May 2025 and quickly became the standard benchmark for agent evaluation across frontier labs. Version 2.0, released November 2025, addressed quality and reproducibility issues that surfaced over 6 months of heavy use.
| Dimension | Version 1.0 | Version 2.0 |
|---|---|---|
| Tasks | 229 submitted, ~150 active | 89 (curated from 229) |
| Verification | Community review | Manual + LM-assisted, 3 reviewers per task |
| Environment stability | Some tasks broke (YouTube anti-bot) | All environments verified stable |
| Execution | EC2/Docker, turn-based limits | Daytona remote, time-based limits |
| Environment fixes | Ad hoc | 89 tasks with config fixes |
| Instruction fixes | Ad hoc | 11 tasks with instruction rewrites |
The shift from turn-based to time-based limits is significant. Some tasks require waiting for a compilation or a training run. Turn-based limits penalize agents that monitor long-running processes, which is exactly the skill Terminal-Bench is designed to measure.
Limitations
Terminal-Bench 2.0 is more rigorous than its predecessor, but it is not without issues. Three limitations are worth noting.
Internet Access
Agents can access the internet during evaluation. They could theoretically find Terminal-Bench solutions on GitHub or the paper itself. The authors have not observed this behavior in tens of thousands of trajectories, but the risk is not zero. The repository includes Big-Bench canary strings to aid training data decontamination, but preventing intentional lookup during evaluation requires a private test set, which the project has not built.
Reproducibility
Some non-determinism remains. Task outcomes can vary with CPU architecture differences, external API stability, and infrastructure reliability. Long-running tasks are especially sensitive. The confidence intervals on the leaderboard (typically \u00b11.5-2.6) reflect this variance.
Scale
89 tasks is small compared to SWE-Bench Pro's 1,865. The confidence intervals are wider as a result. Two models separated by 1-2 points may not be meaningfully different. The top 5 entries span only 3.7 points.
Frequently Asked Questions
What is Terminal-Bench 2.0?
Terminal-Bench 2.0 is a benchmark of 89 tasks that evaluates AI agents in real terminal environments. Created by Mike Merrill at Stanford and the Laude Institute, it tests whether agents can navigate filesystems, compile code, install packages, and complete multi-step workflows inside Docker containers. Tasks span software engineering, security, biology, and gaming.
How does Terminal-Bench differ from SWE-Bench?
SWE-Bench measures patch generation for GitHub issues. Terminal-Bench measures shell fluency: whether an agent can operate inside a real terminal environment across multi-step workflows. SWE-Bench tasks are primarily single-file Python patches. Terminal-Bench tasks require exploring unknown environments, running commands, and validating results autonomously.
What is GPT-5.3-Codex's Terminal-Bench 2.0 score?
GPT-5.3-Codex scores 77.3% when paired with Factory's Droid agent, and 75.1% with OpenAI's Simple Codex agent. The 2.2-point gap between agents using the same model shows how much scaffolding matters.
What is Claude Opus 4.6's Terminal-Bench 2.0 score?
Claude Opus 4.6 scores 74.7% when paired with KRAFTON AI's Terminus-KIRA agent, and 71.9% with Bigai's TongAgents framework.
Which model leads Terminal-Bench 2.0?
As of March 2026, Forge Code with Gemini 3.1 Pro leads at 78.4%, followed by Factory's Droid with GPT-5.3-Codex at 77.3%. The top 5 entries are within 3.7 percentage points.
Faster Code Editing for Terminal Agents
Morph Fast Apply merges AI-generated edits at 10,500 tok/s with 98% accuracy. Terminal-Bench agents that rewrite entire files for every edit waste tokens and time. Fast Apply cuts token usage by 50-60% and latency by 90%+.