[good?]

OpenAI

GPT-5.3-Codex

7.2
out of 10

GPT-5.3-Codex is the best autonomous coding agent available right now — faster than its predecessor, more token-efficient than Claude Code, and now baked into GitHub Copilot. The catch: no API yet (delayed by an unprecedented 'High cybersecurity capability' classification), and it's an async delegation tool, not a pair programmer. If you want to fire off tasks and check back later, this is it.

Context window

400K tokens

API (blended)

$4.81/1M

Consumer access

$20/mo

Multimodal

Text only

Score Breakdown

71.7/100 → 7.2/10
Total71.7/100 → 7.2/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +τ²-bench: 90.9% (AA-measured) — second only to Gemini 3.1 Pro; best OpenAI model for agentic tool use
  • +GPQA Diamond: 91.5% and HLE: 39.9% (AA-measured) — strong science and reasoning for a coding-focused model
  • +~3× more token-efficient than Claude Code (72K vs 235K tokens on equivalent TypeScript tasks)
  • +25% faster than GPT-5.2-Codex — meaningful for long-horizon agentic runs
  • +First model 'instrumental in its own creation' — used to debug, deploy, and evaluate itself
  • +GitHub Copilot integration (Feb 9, 2026) — reaches millions of developers in existing workflows
  • +Supports parallel task delegation — run 7+ simultaneous Codex instances without context loss
  • +Low, medium, high, xhigh reasoning effort settings — tune cost vs depth per task
  • +40–60% chance of merge-ready code on minor tasks with no intervention

Weaknesses

  • -API not yet available — delayed by 'High capability' cybersecurity classification under Preparedness Framework
  • -First OpenAI model to acknowledge possible cyberattack enablement capability — Apollo Research sabotage score: 0.88/1.00
  • -Claude Code leads on complex refactors: ~23% fewer runtime errors in large TypeScript codebases
  • -Codex UI designed for task delegation, not pair programming — 'cumbersome' for interactive back-and-forth (Proser review)
  • -Desktop app is Mac-only at launch
  • -GPT-5.3-Codex-Spark (fastest variant, Cerebras hardware) is research preview only — Pro users only, 128K context, text-only
  • -Context window estimated at 400K — smaller than Gemini 3.1 Pro (1M) for very large repository analysis
  • -California SB 53 alleged violation filed — regulatory status unresolved as of Feb 2026

Best for

autonomous multi-step development workflows (research → implement → debug → test → PR)parallelizing multiple independent coding tasks simultaneouslydevelopers already in the GitHub / VS Code / ChatGPT ecosystemhigh-volume API coding tasks when API access opens (token efficiency advantage)teams delegating maintenance-level tickets and minor features autonomously

Not ideal for

interactive pair programming (use Claude Code instead)large complex refactors requiring tight reasoning controlAPI-based integration right now (no API yet)organizations with strict cybersecurity procurement requirementsvery large repository analysis requiring 1M+ context

This is not a chatbot — it's an autonomous coding agent

GPT-5.3-Codex powers the Codex product (chatgpt.com/codex). You delegate a task, it runs autonomously in a cloud sandbox pre-loaded with your repo — writing code, running tests, fixing bugs, opening PRs — for hours or even days. One developer ran it for 25 hours uninterrupted, generating ~30,000 lines of code across ~13 million tokens. You steer it mid-task but you're managing it, not typing alongside it.

Benchmark performance (AA-measured)

BenchmarkGPT-5.3-CodexClaude Opus 4.6Gemini 3.1 ProGPT-5.2
GPQA Diamond (PhD science)91.5%84.0%94.1%90.3%
HLE — standard mode39.9%18.6%44.7%35.4%
τ²-bench (tool use & agents)90.9%84.8%95.6%84.8%

All scores independently measured by Artificial Analysis in standard mode — consistent methodology across all models.

τ²-bench: 90.9% — second only to Gemini 3.1 Pro

On Artificial Analysis's τ²-bench (multi-turn agentic tool use), GPT-5.3-Codex scores 90.9% — ahead of Claude Opus 4.6 (84.8%) and GPT-5.2 (84.8%), trailing only Gemini 3.1 Pro (95.6%). For real-world autonomous coding pipelines that rely on tool calling and multi-step execution, this is the meaningful number.

Token efficiency vs Claude Code

TaskGPT-5.3-Codex tokensClaude Code tokensCodex advantage
TypeScript feature implementation72,579234,772~3.2× fewer
Figma-to-code conversion~1.5M~6.2M~4.1× fewer

Token usage translates directly to cost. On API-equivalent tasks, Codex's efficiency advantage is substantial — particularly on longer jobs.

Codex vs Claude Code: which one to use

DimensionGPT-5.3-CodexClaude Code (Opus 4.6)
Interaction modelAsync delegation — fire & check backSync pair programming — stay in loop
Token efficiency✓ ~3× fewer tokensHigher token usage
Speed advantage✓ 40% faster on simple tasksSlower
Complex refactorsMore errors reported✓ ~23% fewer runtime errors
τ²-bench (AA)✓ 90.9%84.8%
GPQA Diamond (AA)91.5%84.0%
Context window~400K (est)✓ 200K standard / 1M beta
API availabilityNot yet (Q1 2026 est)✓ Available now
Ecosystem✓ GitHub Copilot, VS Code, ChatGPTClaude.ai, Cursor, IDE extensions

These are complementary tools, not direct substitutes. Expert consensus is to use both.

The Codex model family (modern product — not the 2021 model)

ModelReleaseKey milestone
codex-1 (based on o3)May 16, 2025First agentic Codex; research preview
GPT-5-Codex~Sep 2025First GPT-5 variant for agentic coding
GPT-5.1-Codex~Nov 2025Incremental improvement
GPT-5.1-Codex-Max~Dec 2025Long-horizon variant
GPT-5.2-CodexJan 14, 2026Context compaction, Windows support, cybersecurity features
GPT-5.3-CodexFeb 5, 2026Current flagship — combines Codex + GPT-5.2 training stacks; 25% faster
GPT-5.3-Codex-SparkFeb 12, 20261,000+ t/s on Cerebras; 128K context; text-only; Pro preview only

The modern Codex product is unrelated to OpenAI's deprecated GPT-3-based Codex of 2021–2023. They share a name only.

⚠️ First OpenAI model classified 'High capability' in cybersecurity

OpenAI's System Card states this is the first model treated as High capability under their Preparedness Framework for Cybersecurity. Apollo Research found a mean best-of-10 sabotage score of 0.88/1.00 (vs 0.75 for GPT-5.2). OpenAI doesn't claim it can fully automate cyberattacks but "cannot rule out the possibility." This is why the API is delayed. A California SB 53 violation was alleged by a watchdog organization — OpenAI disputes the interpretation.

Access options

Access pathEntry priceWhat you get
ChatGPT Plus$20/moCodex Web + CLI + VS Code; standard usage limits
ChatGPT Pro$200/moHigher limits + Codex-Spark research preview (Cerebras)
GitHub Copilot~$10/moCodex in github.com, Mobile, VS, VS Code (from Feb 9, 2026)
ChatGPT Team$30/user/moShared workspace, admin controls
Enterprise / EduCustomSOC 2, HIPAA, zero data retention
APINot yet availableGPT-5.2-Codex API ($1.75/$14 per 1M) is current alternative

Key capabilities

Autonomous long-horizon executionRuns for hours or days without intervention. Writes features, fixes bugs, proposes PRs, executes tests — all in isolated cloud sandboxes pre-loaded with your repository.
Parallel task delegationRun 7+ simultaneous Codex instances on independent tasks. Each runs in its own sandbox with full repo context. Used at OpenAI DevDay 2025 to ship game implementations in parallel.
Reasoning effort controlLow / medium / high / xhigh reasoning effort settings. Dial cost vs depth per task — use low for routine tickets, xhigh for architectural decisions.
Mid-task steeringReprioritize, redirect, or ask questions while Codex is working. You're managing it asynchronously, not blocked waiting for a response.
Computer use / OSWorldCan operate computers end-to-end — navigate UIs, run terminal commands, interact with applications — not just write code.
GitHub Copilot integrationAvailable natively in GitHub.com, GitHub Mobile, Visual Studio, and VS Code from February 9, 2026. Reaches the millions of developers already on Copilot.
Codex-Spark (Cerebras, Pro only)Ultra-fast 1,000+ tokens/sec variant on Cerebras hardware. 128K context, text-only, research preview for Pro users. Separate rate limits. Does not affect standard Codex usage.
Self-referential developmentFirst model 'instrumental in creating itself' — the Codex team used early versions to debug training, manage deployment, and diagnose evaluations.

Enterprise adoption (named customers)

CompanyUse caseReported result
CiscoAccelerating engineering teamsNamed in OpenAI enterprise report
Virgin AtlanticDevelopment productivity"Markedly increased productivity"
TemporalFeature dev, debugging, refactoringNamed customer
KodiakDebugging tools for autonomous drivingNamed customer
GitHub Copilot users (all)IDE coding assistance1M+ developers on Codex product

Bottom line

GPT-5.3-Codex is the best tool for autonomous, parallelizable coding tasks — τ²-bench at 90.9% (AA-measured, second only to Gemini 3.1 Pro), ~3× token efficiency over Claude Code, and native GitHub integration put it ahead for async workflows. But it doesn't replace Claude Code for complex interactive refactoring, and the API delay blocks it from production pipelines today. If you're on ChatGPT Plus or GitHub Copilot, it's worth adding to your workflow now for ticket-level delegation. If you need API access or tight reasoning control, wait or use GPT-5.2-Codex.

Pricing details

Subscription plans

ChatGPT PlusGPT-5.3-Codex via Codex Web, Codex CLI, VS Code extension(Usage limits apply; Pro at $200/mo for higher limits)
$20/mo
ChatGPT ProFull Codex access + GPT-5.3-Codex-Spark research preview (1,000+ t/s on Cerebras)(Spark variant is research preview only; separate rate limits)
$200/mo
ChatGPT Team / BusinessCodex access for teams, shared workspace, admin controls(Per-user pricing)
$30/mo
GitHub CopilotGPT-5.3-Codex available in Copilot from Feb 9 2026 — github.com, GitHub Mobile, VS, VS Code(Copilot usage limits apply)
$10/mo
ChatGPT Enterprise / EduFull Codex access, SOC 2, HIPAA, zero data retention, admin dashboard(Custom pricing — contact OpenAI sales)
Free

API pricing

OpenAI (estimated floor — API not yet live)GPT-5.3-Codex API access delayed due to 'High capability' cybersecurity classification. Price shown is GPT-5.2-Codex current rate — likely the floor for 5.3-Codex. 90% cached input discount ($0.175/M) expected to carry over. Verify at platform.openai.com before budgeting.
$1.75/$14

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.