Codex 5.3: Specs, Benchmarks, and What It Actually Does (2026)

GPT-5.3-Codex launched February 5, 2026. 77.3% Terminal-Bench, 56.8% SWE-bench Pro, 400K context, 25% faster than 5.2. Full specs, Codex-Spark on Cerebras at 1,000 tok/sec, and how it compares to Opus 4.6.

March 4, 2026 ยท 1 min read

TL;DR

GPT-5.3-Codex is OpenAI's top coding model. It merges GPT-5.2-Codex's code generation with GPT-5.2's reasoning, runs 25% faster, and uses 2-4x fewer output tokens than Opus 4.6 on equivalent tasks. A week after launch, OpenAI shipped Codex-Spark on Cerebras WSE-3 at 1,000+ tok/sec, their first production workload off Nvidia hardware.

77.3%
Terminal-Bench 2.0 (new high)
56.8%
SWE-bench Pro Public
400K
Context window (tokens)
1,000+
Codex-Spark tok/sec on Cerebras

Two models, not one

"Codex 5.3" refers to two distinct models. GPT-5.3-Codex is the full reasoning model with a 400K context window. GPT-5.3-Codex-Spark is a smaller, distilled variant running on Cerebras hardware at 15x the speed but with a 128K context window and less reasoning depth. Spark is in research preview for ChatGPT Pro subscribers only.

Key Specs

400K
Context window (tokens)
25%
Faster than GPT-5.2-Codex
2-4x
Fewer tokens than Opus on same tasks
SpecGPT-5.3-CodexGPT-5.3-Codex-Spark
Release dateFebruary 5, 2026February 12, 2026
Context window400,000 tokens128,000 tokens
Inference speed~65 tok/sec (standard)1,000+ tok/sec
HardwareNvidia GPUsCerebras WSE-3
ArchitectureFull GPT-5.3 reasoningDistilled, speed-optimized
MultimodalText + codeText only
API input price$1.75 / 1M tokensResearch preview only
API output price$14.00 / 1M tokensResearch preview only
AvailabilityChatGPT Plus/Pro, API, CLIChatGPT Pro only

OpenAI describes Codex 5.3 as moving from "an agent that can write and review code" to "an agent that can do nearly anything developers and professionals can do on a computer." The model handles long-running tasks involving research, tool use, and complex execution. You can steer and interact with it while it works without losing context.

Benchmark Results

Codex 5.3 sets new highs on Terminal-Bench 2.0 and SWE-bench Pro. It also shows strong results on OSWorld-Verified and GDPval, two benchmarks that test real-world computer use and professional knowledge work.

77.3%
Terminal-Bench 2.0
56.8%
SWE-bench Pro Public
64.7%
OSWorld-Verified
70.9%
GDPval (wins or ties)
BenchmarkGPT-5.3-CodexClaude Opus 4.6Notes
Terminal-Bench 2.077.3%65.4%Codex leads by 11.9 points
SWE-bench Pro Public56.8%55.4%Codex leads by 1.4 points
SWE-bench VerifiedN/R80.8%OpenAI reports Pro, not Verified
OSWorld-Verified64.7%72.7%Opus leads by 8 points
GDPval70.9%N/R44-occupation professional tasks

Benchmark context

OpenAI reports SWE-bench Pro Public (56.8%). Anthropic reports SWE-bench Verified (80.8%). These are different problem sets with different difficulty levels. Direct comparison across them is not valid. On the one benchmark where both report (SWE-bench Pro), Codex 5.3 edges Opus 4.6 by 1.4 points (56.8% vs 55.4%). On Terminal-Bench 2.0, the only other apples-to-apples comparison, Codex leads by 11.9 points. Opus wins SWE-bench Verified at 80.8%.

Token Efficiency

Codex 5.3 achieves its SWE-bench Pro scores with fewer output tokens than any prior model. On equivalent tasks, it uses 2-4x fewer tokens than Opus 4.6. This matters for cost: fewer tokens at $14/M output is often cheaper than more tokens at lower per-token rates.

What Terminal-Bench 2.0 Measures

Terminal-Bench tests real-world terminal tasks: system administration, deployment scripts, file manipulation, debugging shell pipelines. Codex 5.3's 77.3% represents a 13.3-point jump from GPT-5.2-Codex's 64%. This is the largest single-generation improvement on this benchmark.

What OSWorld-Verified Measures

OSWorld tests the ability to use a full computer environment: browsers, file managers, terminal, and desktop apps. Humans score ~72% on this benchmark. Codex 5.3 hits 64.7%, a 26.5-point jump from GPT-5.2-Codex. Opus 4.6 scores 72.7%, matching human performance.

Codex-Spark on Cerebras

On February 12, one week after Codex 5.3 launched, OpenAI released GPT-5.3-Codex-Spark. It is the first OpenAI model running on non-Nvidia hardware in production, deployed on Cerebras Wafer-Scale Engine 3 chips with 4 trillion transistors per wafer.

1,000+
Tokens per second
15x
Faster than standard Codex
128K
Context window (tokens)

What Spark is

A distilled, smaller version of Codex 5.3 purpose-built for low-latency code generation. It runs at 1,000+ tok/sec on Cerebras WSE-3, producing more capable responses than GPT-5.1-Codex-mini while completing tasks in a fraction of the time. Text-only, 128K context.

What Spark is not

Not a replacement for full Codex 5.3. Spark trades reasoning depth for throughput. It is designed for real-time coding feedback, inline completions, and rapid iteration. For complex multi-file refactoring or long-horizon agent tasks, full Codex 5.3 or Opus 4.6 is the better choice.

Why Cerebras?

The WSE-3 is a wafer-scale chip designed for inference with minimal memory bottlenecks. Cerebras can run the entire Spark model on-chip without the memory-transfer overhead that limits GPU-based inference speed. This is a hardware architecture advantage, not just a clock speed difference. OpenAI choosing Cerebras for a production model signals a strategic move toward hardware diversification.

Availability

Codex-Spark is in research preview for ChatGPT Pro ($200/mo) subscribers only. It is not available via the API at launch. Cerebras expects to bring this inference capability to larger frontier models later in 2026, including longer context lengths and multimodal inputs.

Architecture and Capabilities

Codex 5.3 is built on the GPT-5 architecture. OpenAI has not published parameter counts or detailed layer configurations. What they have disclosed: the model packs more reasoning capability per byte than predecessors, focusing on cognitive density over raw parameter count.

Key Capabilities

Long-Horizon Agent Tasks

Codex 5.3 handles multi-step tasks involving research, tool use, and complex execution. You can interact with the model mid-task without losing context. The Codex macOS app runs each task in an isolated cloud sandbox with its own container.

Cloud Sandbox Isolation

Each Codex task runs in its own cloud container. Internet access is disabled by default for security. The model can read and write files, run tests, and execute code in isolation. This makes it safe for autonomous, unattended execution on production codebases.

Self-Bootstrapping

Codex 5.3 is the first model that was instrumental in creating itself. The Codex team used early versions to debug training, manage deployment, and diagnose evaluation results. This recursive capability signals a qualitative shift in model development.

Professional Knowledge

70.9% on GDPval across 44 occupations. Codex 5.3 goes beyond code: creating presentations, writing reports, managing spreadsheets, and handling system administration. It combines coding and professional knowledge in one model.

Cybersecurity Rating

Codex 5.3 is the first OpenAI model rated "high" for cybersecurity under OpenAI's Preparedness Framework. This activated additional safeguards. Fortune reported that the model "raises unprecedented cybersecurity risks" due to its ability to autonomously research, plan, and execute complex multi-step operations in sandboxed environments.

Pricing and Access

Codex 5.3 is available through ChatGPT subscriptions, the Codex CLI, the macOS app, the VS Code extension, and the OpenAI API.

Access MethodPriceLimits
ChatGPT Plus$20/month30-150 messages per 5-hour window
ChatGPT Pro$200/month300-1,500 messages per 5-hour window
API (input)$1.75 / 1M tokensStandard rate limits
API (output)$14.00 / 1M tokensStandard rate limits
Codex CLIIncluded with ChatGPT planUses plan message allocation
Codex-SparkChatGPT Pro only ($200/mo)Research preview, no API

For a detailed breakdown of every plan, limit, and hidden cost, see our Codex pricing guide.

API vs subscription

The API makes sense if you need programmatic access, custom system prompts, or integration into CI/CD pipelines. The ChatGPT subscription makes sense for interactive use. At $14/M output tokens, a typical 10K-token response costs $0.14. Heavy users generating 100+ responses per day will find the $200/mo Pro plan cheaper.

Codex 5.3 vs Opus 4.6

Both models launched within hours of each other on February 5, 2026. They represent fundamentally different design philosophies: Codex optimizes for speed, token efficiency, and autonomous execution. Opus optimizes for reasoning depth, multi-file understanding, and deterministic outputs.

DimensionGPT-5.3-CodexClaude Opus 4.6
Terminal-Bench 2.077.3%65.4%
SWE-bench Pro56.8%55.4%
OSWorld-Verified64.7%72.7%
Context window400K tokens1M tokens (beta)
Token efficiency2-4x fewer tokensBaseline
Speed25% faster than 5.2Slower, more deliberate
Subagent modelCloud sandbox per taskAgent Teams with shared tasks
HardwareNvidia + Cerebras (Spark)Nvidia / AWS

The pattern: Codex wins on execution dimensions (terminal tasks, speed, token efficiency). Opus wins on understanding dimensions (reasoning, multi-file refactoring, context capacity). Neither dominates across all benchmarks. Your workflow determines which matters more.

For the full deep-dive comparison with subagent architecture analysis, usage limits, and a decision framework, see our Codex vs Claude Code comparison.

Best Use Cases for Codex 5.3

Autonomous task execution

Codex 5.3's cloud sandbox model is built for fire-and-forget. Specify the outcome, set up tests with clear pass/fail criteria, press go. Come back later. The sandbox isolation means it cannot break your local environment.

Terminal and DevOps work

77.3% on Terminal-Bench 2.0, the highest of any model. System administration, deployment scripts, CI/CD pipelines, debugging shell scripts. This is where Codex 5.3 has the widest lead over competitors.

Rapid prototyping

25% faster than 5.2, 2-4x fewer tokens than Opus. When you need to iterate quickly on ideas, Codex 5.3's speed advantage compounds across dozens of prompts per session.

Real-time feedback (Spark)

Codex-Spark at 1,000+ tok/sec enables near-instant inline completions and real-time code review. For pairing-style workflows where latency matters more than reasoning depth, Spark is the right choice.

Where Codex 5.3 Is Not the Best Choice

  • Complex multi-file refactoring: Opus 4.6's 1M context window and higher SWE-bench Verified score (80.8%) make it better for tasks that require understanding relationships across many files.
  • Tasks requiring deterministic output: Community reports note Codex can produce different results for the same prompt. Opus is more consistent.
  • Niche or domain-specific languages: Opus has historically performed better on less common programming languages and frameworks.

Limitations

Known limitations

  • Cybersecurity risk: First model rated "high" under OpenAI's Preparedness Framework. The autonomous execution capability introduces risks that triggered additional safeguards.
  • Consistency variance: Multiple developers report that the same prompt can produce different quality results across runs. Less deterministic than Opus 4.6.
  • Internet disabled in sandbox: Cloud sandbox tasks run without internet access for security. This limits use cases that require fetching external resources during execution.
  • Spark is research preview: Codex-Spark is only available to ChatGPT Pro ($200/mo) subscribers. No API access. The 128K context window is limiting for large codebases.
  • Still needs human oversight: Despite strong autonomous capability, Codex 5.3 still requires human review for architecture decisions, security boundaries, and dependency updates.
  • Smaller context than Opus: 400K tokens vs Opus 4.6's 1M. For very large codebases, this is a real constraint.

Frequently Asked Questions

What is Codex 5.3?

GPT-5.3-Codex is OpenAI's most capable coding model, released February 5, 2026. It combines coding and general reasoning in one model, runs 25% faster than its predecessor, and scores 77.3% on Terminal-Bench 2.0. It powers the Codex CLI, the Codex macOS app, and is available via the OpenAI API.

What is Codex-Spark?

A distilled version of Codex 5.3 running on Cerebras WSE-3 hardware at 1,000+ tokens per second. It is 15x faster than standard Codex but has a smaller 128K context window and less reasoning depth. Available in research preview for ChatGPT Pro subscribers.

How much does Codex 5.3 cost?

Through ChatGPT: $20/mo (Plus) or $200/mo (Pro) with message-based limits. Via API: $1.75/M input tokens and $14/M output tokens. Codex-Spark requires the $200/mo Pro plan. See our full pricing breakdown.

How does Codex 5.3 compare to Opus 4.6?

Codex leads Terminal-Bench 2.0 by 11.9 points, SWE-bench Pro by 1.4 points (56.8% vs 55.4%), and uses 2-4x fewer tokens. Opus leads SWE-bench Verified (80.8%), OSWorld-Verified by 8 points, and has a 1M token context window (2.5x Codex's 400K). Codex is faster and cheaper per task. Opus is more thorough and consistent. See our full comparison.

Is Codex 5.3 available via API?

Yes. The model ID is gpt-5.3-codex. Pricing is $1.75/M input and $14/M output. Codex-Spark is not available via API, only through ChatGPT Pro.

What context window does Codex 5.3 have?

400,000 tokens for GPT-5.3-Codex. 128,000 tokens for GPT-5.3-Codex-Spark. The Codex CLI effective context is slightly below the raw 400K due to system prompt overhead.

Can Codex 5.3 run autonomously?

Yes. The Codex macOS app and CLI support autonomous execution in cloud sandboxes with internet disabled for security. You specify the task, set pass/fail criteria, and let it run. Human review is still recommended for production-critical changes.

What hardware does Codex-Spark run on?

Cerebras Wafer-Scale Engine 3 (WSE-3), a purpose-built wafer-scale chip with 4 trillion transistors. Codex-Spark is OpenAI's first production model deployed on non-Nvidia hardware.

Related Articles

Use WarpGrep to Get Better Context into Codex 5.3

WarpGrep is an agentic code search tool that runs as an MCP server. Connect it to any Codex-powered workflow for high-precision codebase context, so Codex 5.3's 400K token window gets filled with the right code, not noise.

Sources