[good?]

OpenAI

GPT-5.4

New
7.5
out of 10

GPT-5.4 is what happens when OpenAI stops releasing specialized models and puts everything into one. Released March 5, 2026, it combines the coding capabilities of GPT-5.3-Codex, the reasoning depth of GPT-5.2, and brand-new native computer-use into a single frontier model. It beats human experts on 83% of GDPval knowledge-work tasks, surpasses human performance on OSWorld desktop navigation, and introduces tool search that cuts MCP token usage by 47%. The price tag went up ($2.50/M input vs $1.75 for GPT-5.2), but the model uses fewer tokens per task. For enterprise and agentic workloads, it's the obvious new default.

Context window

1.0M tokens

API (blended)

$5.63/1M

Consumer access

Free (limited) / $20/mo

Multimodal

Yes

Score Breakdown

75.1/100 → 7.5/10
Total75.1/100 → 7.5/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +GDPval: 83% win/tie vs human experts across 44 occupations — 12 points above GPT-5.2
  • +OSWorld-Verified: 75% — first model to surpass human performance (72.4%) on desktop computer use
  • +Native computer-use via screenshots + keyboard/mouse — no separate CUA model needed
  • +Tool search: 47% fewer tokens on MCP-heavy workflows — massive cost reduction for agentic pipelines
  • +33% fewer false individual claims and 18% fewer error-containing responses vs GPT-5.2
  • +1M context window (experimental in Codex) — enough for full production codebases
  • +SWE-Bench Pro: 57.7% — matches GPT-5.3-Codex while adding knowledge-work and computer-use capabilities
  • +Steerability: outlines its plan for complex queries so you can redirect mid-response
  • +BrowseComp: 82.7% — 17 points above GPT-5.2 for persistent web research
  • +GPQA Diamond: 92.8% — near state of the art on PhD-level science

Weaknesses

  • -API pricing 43% higher per input token than GPT-5.2 ($2.50 vs $1.75/M input)
  • -Pro tier ($200/mo) required for GPT-5.4 Pro maximum performance mode
  • -No Artificial Analysis independent measurements yet — benchmarks are provider-reported only
  • -Context window in ChatGPT unchanged from GPT-5.2 — 1M only available in Codex/API experimentally
  • -High cybersecurity capability classification carries over — same monitoring and blocking concerns as GPT-5.3-Codex
  • -GPT-5.2 retirement on June 5, 2026 — forced migration timeline for enterprise integrations
  • -Copyright litigation exposure (NYT + authors) still unresolved — data sovereignty risk persists

Best for

professional knowledge work (spreadsheets, presentations, documents)computer-use agents that operate across desktop applicationsagentic pipelines with large MCP tool ecosystemsautonomous coding workflows (via Codex)persistent web research requiring multi-source synthesisenterprise automation with high factual accuracy requirements

Not ideal for

budget-sensitive API workloads (GPT-5.2 is 43% cheaper per input token)tasks that only need chat speed (GPT-5.3 Instant is faster and cheaper)workflows requiring independently verified benchmark data right noworganizations with strict cybersecurity procurement requirements

The first OpenAI model that can operate your computer

GPT-5.4 has native computer-use built in. It reads screenshots, clicks UI elements via coordinates, types into fields, and navigates desktop applications without a separate CUA model. On OSWorld-Verified it hits 75.0%, above human performance at 72.4%. This is not a gimmick. Mainstay reported 95% first-attempt success across 30K property tax portals, completing sessions 3x faster with 70% fewer tokens.

Benchmark Performance

All benchmarks below are OpenAI-reported (xhigh reasoning effort). Artificial Analysis has not independently measured GPT-5.4 yet.

Professional & Knowledge Work

BenchmarkGPT-5.4GPT-5.2GPT-5.3-Codex
GDPval (win/tie vs human experts)83.0%70.9%70.9%
Investment Banking Modeling (internal)87.3%68.4%79.3%
OfficeQA68.1%63.1%65.1%
FinanceAgent v1.156.0%59.5%54.0%

GDPval is OpenAI's own benchmark spanning 44 occupations. The 12-point jump over GPT-5.2 is the largest single-generation improvement OpenAI has shown on this eval. FinanceAgent is the one benchmark where GPT-5.2 still leads.

Coding

BenchmarkGPT-5.4GPT-5.3-CodexGPT-5.2
SWE-Bench Pro (Public)57.7%56.8%55.6%
Terminal-Bench 2.075.1%77.3%62.2%

GPT-5.4 matches GPT-5.3-Codex on SWE-Bench while being a general-purpose model. Terminal-Bench is slightly lower than 5.3-Codex, but 13 points above GPT-5.2.

Computer Use & Vision

BenchmarkGPT-5.4GPT-5.2Human
OSWorld-Verified (desktop navigation)75.0%47.3%72.4%
WebArena-Verified (browser use)67.3%65.4%--
MMMU Pro (no tools)81.2%79.5%--
OmniDocBench (doc parsing, lower = better)0.1090.140--

The OSWorld jump from 47.3% to 75.0% is the headline number. GPT-5.2 could barely use a computer; GPT-5.4 is better at it than humans.

Science, Math & Reasoning

BenchmarkGPT-5.4GPT-5.4 ProGPT-5.2
GPQA Diamond (PhD science)92.8%94.4%92.4%
HLE (no tools)39.8%42.7%34.5%
HLE (with tools)52.1%58.7%45.5%
FrontierMath Tier 1-347.6%50.0%40.7%
ARC-AGI-1 (Verified)93.7%94.5%86.2%
ARC-AGI-2 (Verified)73.3%83.3%52.9%
Frontier Science Research33.0%36.7%25.2%

ARC-AGI-2 is the standout: 73.3% vs 52.9% for GPT-5.2 is a 20-point jump on what's considered the hardest abstract reasoning benchmark. GPT-5.4 Pro pushes that to 83.3%.

Tool Search: The Efficiency Feature That Matters Most

If you're building agents with many tools, this is the single most important new feature.

47% fewer tokens on MCP-heavy workloads

Previously, all tool definitions were stuffed into the prompt upfront. Tool search gives the model a lightweight list and lets it look up full definitions only when needed. On 250 MCP Atlas tasks with all 36 MCP servers enabled, this cut total token usage by 47% with identical accuracy. For production agentic pipelines, this directly reduces cost and latency.

Tool Use Benchmarks

BenchmarkGPT-5.4GPT-5.2
BrowseComp (persistent web search)82.7%65.8%
MCP Atlas (36 MCP servers)67.2%60.6%
Toolathlon (multi-step tool use)54.6%45.7%
τ²-bench Telecom (no reasoning)64.3%57.2%

BrowseComp improved 17 points, meaning GPT-5.4 is substantially better at finding obscure information across the web. GPT-5.4 Pro pushes BrowseComp to 89.3%.

Pricing: Higher Per Token, Potentially Cheaper Per Task

The sticker price went up. Whether you pay more depends on how many tokens you burn.

API Pricing Comparison

ModelInput/1MCached Input/1MOutput/1M
GPT-5.4$2.50$0.25$15.00
GPT-5.2$1.75$0.175$14.00
GPT-5.4 Pro$30.00--$180.00
GPT-5.2 Pro$21.00--$168.00

GPT-5.4 is 43% more expensive on input and 7% more on output. But OpenAI claims it's 'significantly' more token-efficient than GPT-5.2 on reasoning tasks. Tool search alone saves 47% of tokens on MCP workloads. Whether the net cost goes up or down depends on your use case.

Batch and Flex pricing: half price

Both Batch and Flex processing are available at 50% of the standard API rate. Priority processing (the /fast mode equivalent in the API) runs at 2x. For non-latency-sensitive workloads, the effective cost drops to $1.25/M input and $7.50/M output.

New: steer the model mid-response

GPT-5.4 Thinking in ChatGPT now shows an upfront plan before generating long responses. You can redirect it while it's working, which means fewer wasted responses and less back-and-forth. Available now on chatgpt.com and Android; iOS coming soon.

33% fewer false claims than GPT-5.2

On a dataset of prompts where users had flagged factual errors, GPT-5.4 individual claims were 33% less likely to be false and full responses were 18% less likely to contain any errors. Harvey's BigLaw Bench scored it at 91% for legal document work. This is the most factual OpenAI model to date.

GPT-5.2 retirement: June 5, 2026

GPT-5.2 Thinking stays available for three months in the Legacy Models section, then gets retired. Enterprise and Edu users need to enable GPT-5.4 early access via admin settings. Plan your migration now, not in May.

Pricing details

Subscription plans

PlusGPT-5.4 Thinking, DALL-E, browsing, Advanced Data Analysis(Message limits apply; xhigh reasoning not available)
$20/mo
ProGPT-5.4 Pro mode, unlimited GPT-5.4 Thinking, extended reasoning, priority access
$200/mo
TeamAll Plus features, admin console, shared workspace, higher rate limits
$25/mo (annual)
Enterprise / EduFull GPT-5.4 access, SOC 2, zero data retention, admin dashboard(Custom pricing via sales; early access toggle in admin settings)
Free

API pricing

OpenAIStandard mode. Cached input: 90% discount ($0.25/M). Batch/Flex: 50% off. Priority processing: 2x rate. GPT-5.4 Pro: $30/$180 per 1M tokens.
$2.5/$15
OpenRouterSlight markup over direct OpenAI pricing. Verify at openrouter.ai.
$2.6/$15.4

Prices verified March 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: March 5, 2026

Benchmark sources:OpenAI: Introducing GPT-5.4