OpenAI
GPT-5.3-Codex
GPT-5.3-Codex is the best autonomous coding agent available right now — faster than its predecessor, more token-efficient than Claude Code, and now baked into GitHub Copilot. The catch: no API yet (delayed by an unprecedented 'High cybersecurity capability' classification), and it's an async delegation tool, not a pair programmer. If you want to fire off tasks and check back later, this is it.
Context window
400K tokens
API (blended)
$4.81/1M
Consumer access
$20/mo
Multimodal
Text only
Score Breakdown
71.7/100 → 7.2/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +τ²-bench: 90.9% (AA-measured) — second only to Gemini 3.1 Pro; best OpenAI model for agentic tool use
- +GPQA Diamond: 91.5% and HLE: 39.9% (AA-measured) — strong science and reasoning for a coding-focused model
- +~3× more token-efficient than Claude Code (72K vs 235K tokens on equivalent TypeScript tasks)
- +25% faster than GPT-5.2-Codex — meaningful for long-horizon agentic runs
- +First model 'instrumental in its own creation' — used to debug, deploy, and evaluate itself
- +GitHub Copilot integration (Feb 9, 2026) — reaches millions of developers in existing workflows
- +Supports parallel task delegation — run 7+ simultaneous Codex instances without context loss
- +Low, medium, high, xhigh reasoning effort settings — tune cost vs depth per task
- +40–60% chance of merge-ready code on minor tasks with no intervention
Weaknesses
- -API not yet available — delayed by 'High capability' cybersecurity classification under Preparedness Framework
- -First OpenAI model to acknowledge possible cyberattack enablement capability — Apollo Research sabotage score: 0.88/1.00
- -Claude Code leads on complex refactors: ~23% fewer runtime errors in large TypeScript codebases
- -Codex UI designed for task delegation, not pair programming — 'cumbersome' for interactive back-and-forth (Proser review)
- -Desktop app is Mac-only at launch
- -GPT-5.3-Codex-Spark (fastest variant, Cerebras hardware) is research preview only — Pro users only, 128K context, text-only
- -Context window estimated at 400K — smaller than Gemini 3.1 Pro (1M) for very large repository analysis
- -California SB 53 alleged violation filed — regulatory status unresolved as of Feb 2026
Best for
Not ideal for
This is not a chatbot — it's an autonomous coding agent
GPT-5.3-Codex powers the Codex product (chatgpt.com/codex). You delegate a task, it runs autonomously in a cloud sandbox pre-loaded with your repo — writing code, running tests, fixing bugs, opening PRs — for hours or even days. One developer ran it for 25 hours uninterrupted, generating ~30,000 lines of code across ~13 million tokens. You steer it mid-task but you're managing it, not typing alongside it.
Benchmark performance (AA-measured)
| Benchmark | GPT-5.3-Codex | Claude Opus 4.6 | Gemini 3.1 Pro | GPT-5.2 |
|---|---|---|---|---|
| GPQA Diamond (PhD science) | 91.5% | 84.0% | 94.1% | 90.3% |
| HLE — standard mode | 39.9% | 18.6% | 44.7% | 35.4% |
| τ²-bench (tool use & agents) | 90.9% | 84.8% | 95.6% | 84.8% |
All scores independently measured by Artificial Analysis in standard mode — consistent methodology across all models.
τ²-bench: 90.9% — second only to Gemini 3.1 Pro
On Artificial Analysis's τ²-bench (multi-turn agentic tool use), GPT-5.3-Codex scores 90.9% — ahead of Claude Opus 4.6 (84.8%) and GPT-5.2 (84.8%), trailing only Gemini 3.1 Pro (95.6%). For real-world autonomous coding pipelines that rely on tool calling and multi-step execution, this is the meaningful number.
Token efficiency vs Claude Code
| Task | GPT-5.3-Codex tokens | Claude Code tokens | Codex advantage |
|---|---|---|---|
| TypeScript feature implementation | 72,579 | 234,772 | ~3.2× fewer |
| Figma-to-code conversion | ~1.5M | ~6.2M | ~4.1× fewer |
Token usage translates directly to cost. On API-equivalent tasks, Codex's efficiency advantage is substantial — particularly on longer jobs.
Codex vs Claude Code: which one to use
| Dimension | GPT-5.3-Codex | Claude Code (Opus 4.6) |
|---|---|---|
| Interaction model | Async delegation — fire & check back | Sync pair programming — stay in loop |
| Token efficiency | ✓ ~3× fewer tokens | Higher token usage |
| Speed advantage | ✓ 40% faster on simple tasks | Slower |
| Complex refactors | More errors reported | ✓ ~23% fewer runtime errors |
| τ²-bench (AA) | ✓ 90.9% | 84.8% |
| GPQA Diamond (AA) | 91.5% | 84.0% |
| Context window | ~400K (est) | ✓ 200K standard / 1M beta |
| API availability | Not yet (Q1 2026 est) | ✓ Available now |
| Ecosystem | ✓ GitHub Copilot, VS Code, ChatGPT | Claude.ai, Cursor, IDE extensions |
These are complementary tools, not direct substitutes. Expert consensus is to use both.
The Codex model family (modern product — not the 2021 model)
| Model | Release | Key milestone |
|---|---|---|
| codex-1 (based on o3) | May 16, 2025 | First agentic Codex; research preview |
| GPT-5-Codex | ~Sep 2025 | First GPT-5 variant for agentic coding |
| GPT-5.1-Codex | ~Nov 2025 | Incremental improvement |
| GPT-5.1-Codex-Max | ~Dec 2025 | Long-horizon variant |
| GPT-5.2-Codex | Jan 14, 2026 | Context compaction, Windows support, cybersecurity features |
| GPT-5.3-Codex | Feb 5, 2026 | Current flagship — combines Codex + GPT-5.2 training stacks; 25% faster |
| GPT-5.3-Codex-Spark | Feb 12, 2026 | 1,000+ t/s on Cerebras; 128K context; text-only; Pro preview only |
The modern Codex product is unrelated to OpenAI's deprecated GPT-3-based Codex of 2021–2023. They share a name only.
⚠️ First OpenAI model classified 'High capability' in cybersecurity
OpenAI's System Card states this is the first model treated as High capability under their Preparedness Framework for Cybersecurity. Apollo Research found a mean best-of-10 sabotage score of 0.88/1.00 (vs 0.75 for GPT-5.2). OpenAI doesn't claim it can fully automate cyberattacks but "cannot rule out the possibility." This is why the API is delayed. A California SB 53 violation was alleged by a watchdog organization — OpenAI disputes the interpretation.
Access options
| Access path | Entry price | What you get |
|---|---|---|
| ChatGPT Plus | $20/mo | Codex Web + CLI + VS Code; standard usage limits |
| ChatGPT Pro | $200/mo | Higher limits + Codex-Spark research preview (Cerebras) |
| GitHub Copilot | ~$10/mo | Codex in github.com, Mobile, VS, VS Code (from Feb 9, 2026) |
| ChatGPT Team | $30/user/mo | Shared workspace, admin controls |
| Enterprise / Edu | Custom | SOC 2, HIPAA, zero data retention |
| API | Not yet available | GPT-5.2-Codex API ($1.75/$14 per 1M) is current alternative |
Key capabilities
Enterprise adoption (named customers)
| Company | Use case | Reported result |
|---|---|---|
| Cisco | Accelerating engineering teams | Named in OpenAI enterprise report |
| Virgin Atlantic | Development productivity | "Markedly increased productivity" |
| Temporal | Feature dev, debugging, refactoring | Named customer |
| Kodiak | Debugging tools for autonomous driving | Named customer |
| GitHub Copilot users (all) | IDE coding assistance | 1M+ developers on Codex product |
Bottom line
GPT-5.3-Codex is the best tool for autonomous, parallelizable coding tasks — τ²-bench at 90.9% (AA-measured, second only to Gemini 3.1 Pro), ~3× token efficiency over Claude Code, and native GitHub integration put it ahead for async workflows. But it doesn't replace Claude Code for complex interactive refactoring, and the API delay blocks it from production pipelines today. If you're on ChatGPT Plus or GitHub Copilot, it's worth adding to your workflow now for ticket-level delegation. If you need API access or tight reasoning control, wait or use GPT-5.2-Codex.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 26, 2026