OpenAI
GPT-5.4
NewGPT-5.4 is what happens when OpenAI stops releasing specialized models and puts everything into one. Released March 5, 2026, it combines the coding capabilities of GPT-5.3-Codex, the reasoning depth of GPT-5.2, and brand-new native computer-use into a single frontier model. It beats human experts on 83% of GDPval knowledge-work tasks, surpasses human performance on OSWorld desktop navigation, and introduces tool search that cuts MCP token usage by 47%. The price tag went up ($2.50/M input vs $1.75 for GPT-5.2), but the model uses fewer tokens per task. For enterprise and agentic workloads, it's the obvious new default.
Context window
1.0M tokens
API (blended)
$5.63/1M
Consumer access
Free (limited) / $20/mo
Multimodal
Yes
Score Breakdown
75.1/100 → 7.5/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +GDPval: 83% win/tie vs human experts across 44 occupations — 12 points above GPT-5.2
- +OSWorld-Verified: 75% — first model to surpass human performance (72.4%) on desktop computer use
- +Native computer-use via screenshots + keyboard/mouse — no separate CUA model needed
- +Tool search: 47% fewer tokens on MCP-heavy workflows — massive cost reduction for agentic pipelines
- +33% fewer false individual claims and 18% fewer error-containing responses vs GPT-5.2
- +1M context window (experimental in Codex) — enough for full production codebases
- +SWE-Bench Pro: 57.7% — matches GPT-5.3-Codex while adding knowledge-work and computer-use capabilities
- +Steerability: outlines its plan for complex queries so you can redirect mid-response
- +BrowseComp: 82.7% — 17 points above GPT-5.2 for persistent web research
- +GPQA Diamond: 92.8% — near state of the art on PhD-level science
Weaknesses
- -API pricing 43% higher per input token than GPT-5.2 ($2.50 vs $1.75/M input)
- -Pro tier ($200/mo) required for GPT-5.4 Pro maximum performance mode
- -No Artificial Analysis independent measurements yet — benchmarks are provider-reported only
- -Context window in ChatGPT unchanged from GPT-5.2 — 1M only available in Codex/API experimentally
- -High cybersecurity capability classification carries over — same monitoring and blocking concerns as GPT-5.3-Codex
- -GPT-5.2 retirement on June 5, 2026 — forced migration timeline for enterprise integrations
- -Copyright litigation exposure (NYT + authors) still unresolved — data sovereignty risk persists
Best for
Not ideal for
The first OpenAI model that can operate your computer
GPT-5.4 has native computer-use built in. It reads screenshots, clicks UI elements via coordinates, types into fields, and navigates desktop applications without a separate CUA model. On OSWorld-Verified it hits 75.0%, above human performance at 72.4%. This is not a gimmick. Mainstay reported 95% first-attempt success across 30K property tax portals, completing sessions 3x faster with 70% fewer tokens.
Benchmark Performance
All benchmarks below are OpenAI-reported (xhigh reasoning effort). Artificial Analysis has not independently measured GPT-5.4 yet.
Professional & Knowledge Work
| Benchmark | GPT-5.4 | GPT-5.2 | GPT-5.3-Codex |
|---|---|---|---|
| GDPval (win/tie vs human experts) | 83.0% | 70.9% | 70.9% |
| Investment Banking Modeling (internal) | 87.3% | 68.4% | 79.3% |
| OfficeQA | 68.1% | 63.1% | 65.1% |
| FinanceAgent v1.1 | 56.0% | 59.5% | 54.0% |
GDPval is OpenAI's own benchmark spanning 44 occupations. The 12-point jump over GPT-5.2 is the largest single-generation improvement OpenAI has shown on this eval. FinanceAgent is the one benchmark where GPT-5.2 still leads.
Coding
| Benchmark | GPT-5.4 | GPT-5.3-Codex | GPT-5.2 |
|---|---|---|---|
| SWE-Bench Pro (Public) | 57.7% | 56.8% | 55.6% |
| Terminal-Bench 2.0 | 75.1% | 77.3% | 62.2% |
GPT-5.4 matches GPT-5.3-Codex on SWE-Bench while being a general-purpose model. Terminal-Bench is slightly lower than 5.3-Codex, but 13 points above GPT-5.2.
Computer Use & Vision
| Benchmark | GPT-5.4 | GPT-5.2 | Human |
|---|---|---|---|
| OSWorld-Verified (desktop navigation) | 75.0% | 47.3% | 72.4% |
| WebArena-Verified (browser use) | 67.3% | 65.4% | -- |
| MMMU Pro (no tools) | 81.2% | 79.5% | -- |
| OmniDocBench (doc parsing, lower = better) | 0.109 | 0.140 | -- |
The OSWorld jump from 47.3% to 75.0% is the headline number. GPT-5.2 could barely use a computer; GPT-5.4 is better at it than humans.
Science, Math & Reasoning
| Benchmark | GPT-5.4 | GPT-5.4 Pro | GPT-5.2 |
|---|---|---|---|
| GPQA Diamond (PhD science) | 92.8% | 94.4% | 92.4% |
| HLE (no tools) | 39.8% | 42.7% | 34.5% |
| HLE (with tools) | 52.1% | 58.7% | 45.5% |
| FrontierMath Tier 1-3 | 47.6% | 50.0% | 40.7% |
| ARC-AGI-1 (Verified) | 93.7% | 94.5% | 86.2% |
| ARC-AGI-2 (Verified) | 73.3% | 83.3% | 52.9% |
| Frontier Science Research | 33.0% | 36.7% | 25.2% |
ARC-AGI-2 is the standout: 73.3% vs 52.9% for GPT-5.2 is a 20-point jump on what's considered the hardest abstract reasoning benchmark. GPT-5.4 Pro pushes that to 83.3%.
Tool Search: The Efficiency Feature That Matters Most
If you're building agents with many tools, this is the single most important new feature.
47% fewer tokens on MCP-heavy workloads
Previously, all tool definitions were stuffed into the prompt upfront. Tool search gives the model a lightweight list and lets it look up full definitions only when needed. On 250 MCP Atlas tasks with all 36 MCP servers enabled, this cut total token usage by 47% with identical accuracy. For production agentic pipelines, this directly reduces cost and latency.
Tool Use Benchmarks
| Benchmark | GPT-5.4 | GPT-5.2 |
|---|---|---|
| BrowseComp (persistent web search) | 82.7% | 65.8% |
| MCP Atlas (36 MCP servers) | 67.2% | 60.6% |
| Toolathlon (multi-step tool use) | 54.6% | 45.7% |
| τ²-bench Telecom (no reasoning) | 64.3% | 57.2% |
BrowseComp improved 17 points, meaning GPT-5.4 is substantially better at finding obscure information across the web. GPT-5.4 Pro pushes BrowseComp to 89.3%.
Pricing: Higher Per Token, Potentially Cheaper Per Task
The sticker price went up. Whether you pay more depends on how many tokens you burn.
API Pricing Comparison
| Model | Input/1M | Cached Input/1M | Output/1M |
|---|---|---|---|
| GPT-5.4 | $2.50 | $0.25 | $15.00 |
| GPT-5.2 | $1.75 | $0.175 | $14.00 |
| GPT-5.4 Pro | $30.00 | -- | $180.00 |
| GPT-5.2 Pro | $21.00 | -- | $168.00 |
GPT-5.4 is 43% more expensive on input and 7% more on output. But OpenAI claims it's 'significantly' more token-efficient than GPT-5.2 on reasoning tasks. Tool search alone saves 47% of tokens on MCP workloads. Whether the net cost goes up or down depends on your use case.
Batch and Flex pricing: half price
Both Batch and Flex processing are available at 50% of the standard API rate. Priority processing (the /fast mode equivalent in the API) runs at 2x. For non-latency-sensitive workloads, the effective cost drops to $1.25/M input and $7.50/M output.
New: steer the model mid-response
GPT-5.4 Thinking in ChatGPT now shows an upfront plan before generating long responses. You can redirect it while it's working, which means fewer wasted responses and less back-and-forth. Available now on chatgpt.com and Android; iOS coming soon.
33% fewer false claims than GPT-5.2
On a dataset of prompts where users had flagged factual errors, GPT-5.4 individual claims were 33% less likely to be false and full responses were 18% less likely to contain any errors. Harvey's BigLaw Bench scored it at 91% for legal document work. This is the most factual OpenAI model to date.
GPT-5.2 retirement: June 5, 2026
GPT-5.2 Thinking stays available for three months in the Legacy Models section, then gets retired. Enterprise and Edu users need to enable GPT-5.4 early access via admin settings. Plan your migration now, not in May.
Pricing details
Subscription plans
API pricing
Prices verified March 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: March 5, 2026