OpenAI

GPT-5.4

New

7.5

out of 10

GPT-5.4 is what happens when OpenAI stops releasing specialized models and puts everything into one. Released March 5, 2026, it combines the coding capabilities of GPT-5.3-Codex, the reasoning depth of GPT-5.2, and brand-new native computer-use into a single frontier model. It beats human experts on 83% of GDPval knowledge-work tasks, surpasses human performance on OSWorld desktop navigation, and introduces tool search that cuts MCP token usage by 47%. The price tag went up ($2.50/M input vs $1.75 for GPT-5.2), but the model uses fewer tokens per task. For enterprise and agentic workloads, it's the obvious new default.

Context window

1.0M tokens

API (blended)

$5.63/1M

Consumer access

Free (limited) / $20/mo

Multimodal

Yes

Score Breakdown

75.1/100 → 7.5/10

Total75.1/100 → 7.5/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Try GPT-5.4 Compare

Strengths

+GDPval: 83% win/tie vs human experts across 44 occupations — 12 points above GPT-5.2
+OSWorld-Verified: 75% — first model to surpass human performance (72.4%) on desktop computer use
+Native computer-use via screenshots + keyboard/mouse — no separate CUA model needed
+Tool search: 47% fewer tokens on MCP-heavy workflows — massive cost reduction for agentic pipelines
+33% fewer false individual claims and 18% fewer error-containing responses vs GPT-5.2
+1M context window (experimental in Codex) — enough for full production codebases
+SWE-Bench Pro: 57.7% — matches GPT-5.3-Codex while adding knowledge-work and computer-use capabilities
+Steerability: outlines its plan for complex queries so you can redirect mid-response
+BrowseComp: 82.7% — 17 points above GPT-5.2 for persistent web research
+GPQA Diamond: 92.8% — near state of the art on PhD-level science

Weaknesses

-API pricing 43% higher per input token than GPT-5.2 ($2.50 vs $1.75/M input)
-Pro tier ($200/mo) required for GPT-5.4 Pro maximum performance mode
-No Artificial Analysis independent measurements yet — benchmarks are provider-reported only
-Context window in ChatGPT unchanged from GPT-5.2 — 1M only available in Codex/API experimentally
-High cybersecurity capability classification carries over — same monitoring and blocking concerns as GPT-5.3-Codex
-GPT-5.2 retirement on June 5, 2026 — forced migration timeline for enterprise integrations
-Copyright litigation exposure (NYT + authors) still unresolved — data sovereignty risk persists

Best for

professional knowledge work (spreadsheets, presentations, documents)computer-use agents that operate across desktop applicationsagentic pipelines with large MCP tool ecosystemsautonomous coding workflows (via Codex)persistent web research requiring multi-source synthesisenterprise automation with high factual accuracy requirements

Not ideal for

budget-sensitive API workloads (GPT-5.2 is 43% cheaper per input token)tasks that only need chat speed (GPT-5.3 Instant is faster and cheaper)workflows requiring independently verified benchmark data right noworganizations with strict cybersecurity procurement requirements

The first OpenAI model that can operate your computer

GPT-5.4 has native computer-use built in. It reads screenshots, clicks UI elements via coordinates, types into fields, and navigates desktop applications without a separate CUA model. On OSWorld-Verified it hits 75.0%, above human performance at 72.4%. This is not a gimmick. Mainstay reported 95% first-attempt success across 30K property tax portals, completing sessions 3x faster with 70% fewer tokens.

Benchmark Performance

All benchmarks below are OpenAI-reported (xhigh reasoning effort). Artificial Analysis has not independently measured GPT-5.4 yet.

Professional & Knowledge Work

Benchmark	GPT-5.4	GPT-5.2	GPT-5.3-Codex
GDPval (win/tie vs human experts)	83.0%	70.9%	70.9%
Investment Banking Modeling (internal)	87.3%	68.4%	79.3%
OfficeQA	68.1%	63.1%	65.1%
FinanceAgent v1.1	56.0%	59.5%	54.0%

GDPval is OpenAI's own benchmark spanning 44 occupations. The 12-point jump over GPT-5.2 is the largest single-generation improvement OpenAI has shown on this eval. FinanceAgent is the one benchmark where GPT-5.2 still leads.

Full write-up: GPT-5.4 vs GPT-5.2 full comparison →

Sources:OpenAI: Introducing GPT-5.4

Coding

Benchmark	GPT-5.4	GPT-5.3-Codex	GPT-5.2
SWE-Bench Pro (Public)	57.7%	56.8%	55.6%
Terminal-Bench 2.0	75.1%	77.3%	62.2%

GPT-5.4 matches GPT-5.3-Codex on SWE-Bench while being a general-purpose model. Terminal-Bench is slightly lower than 5.3-Codex, but 13 points above GPT-5.2.

Sources:OpenAI: Introducing GPT-5.4

Computer Use & Vision

Benchmark	GPT-5.4	GPT-5.2	Human
OSWorld-Verified (desktop navigation)	75.0%	47.3%	72.4%
WebArena-Verified (browser use)	67.3%	65.4%	--
MMMU Pro (no tools)	81.2%	79.5%	--
OmniDocBench (doc parsing, lower = better)	0.109	0.140	--

The OSWorld jump from 47.3% to 75.0% is the headline number. GPT-5.2 could barely use a computer; GPT-5.4 is better at it than humans.

Sources:OpenAI: Introducing GPT-5.4

Science, Math & Reasoning

Benchmark	GPT-5.4	GPT-5.4 Pro	GPT-5.2
GPQA Diamond (PhD science)	92.8%	94.4%	92.4%
HLE (no tools)	39.8%	42.7%	34.5%
HLE (with tools)	52.1%	58.7%	45.5%
FrontierMath Tier 1-3	47.6%	50.0%	40.7%
ARC-AGI-1 (Verified)	93.7%	94.5%	86.2%
ARC-AGI-2 (Verified)	73.3%	83.3%	52.9%
Frontier Science Research	33.0%	36.7%	25.2%

ARC-AGI-2 is the standout: 73.3% vs 52.9% for GPT-5.2 is a 20-point jump on what's considered the hardest abstract reasoning benchmark. GPT-5.4 Pro pushes that to 83.3%.

Sources:OpenAI: Introducing GPT-5.4

Tool Search: The Efficiency Feature That Matters Most

If you're building agents with many tools, this is the single most important new feature.

47% fewer tokens on MCP-heavy workloads

Previously, all tool definitions were stuffed into the prompt upfront. Tool search gives the model a lightweight list and lets it look up full definitions only when needed. On 250 MCP Atlas tasks with all 36 MCP servers enabled, this cut total token usage by 47% with identical accuracy. For production agentic pipelines, this directly reduces cost and latency.

Tool Use Benchmarks

Benchmark	GPT-5.4	GPT-5.2
BrowseComp (persistent web search)	82.7%	65.8%
MCP Atlas (36 MCP servers)	67.2%	60.6%
Toolathlon (multi-step tool use)	54.6%	45.7%
τ²-bench Telecom (no reasoning)	64.3%	57.2%

BrowseComp improved 17 points, meaning GPT-5.4 is substantially better at finding obscure information across the web. GPT-5.4 Pro pushes BrowseComp to 89.3%.

Sources:OpenAI: Introducing GPT-5.4

Pricing: Higher Per Token, Potentially Cheaper Per Task

The sticker price went up. Whether you pay more depends on how many tokens you burn.

API Pricing Comparison

Model	Input/1M	Cached Input/1M	Output/1M
GPT-5.4	$2.50	$0.25	$15.00
GPT-5.2	$1.75	$0.175	$14.00
GPT-5.4 Pro	$30.00	--	$180.00
GPT-5.2 Pro	$21.00	--	$168.00

GPT-5.4 is 43% more expensive on input and 7% more on output. But OpenAI claims it's 'significantly' more token-efficient than GPT-5.2 on reasoning tasks. Tool search alone saves 47% of tokens on MCP workloads. Whether the net cost goes up or down depends on your use case.

Sources:OpenAI: Introducing GPT-5.4

Batch and Flex pricing: half price

Both Batch and Flex processing are available at 50% of the standard API rate. Priority processing (the /fast mode equivalent in the API) runs at 2x. For non-latency-sensitive workloads, the effective cost drops to $1.25/M input and $7.50/M output.

New: steer the model mid-response

GPT-5.4 Thinking in ChatGPT now shows an upfront plan before generating long responses. You can redirect it while it's working, which means fewer wasted responses and less back-and-forth. Available now on chatgpt.com and Android; iOS coming soon.

33% fewer false claims than GPT-5.2

On a dataset of prompts where users had flagged factual errors, GPT-5.4 individual claims were 33% less likely to be false and full responses were 18% less likely to contain any errors. Harvey's BigLaw Bench scored it at 91% for legal document work. This is the most factual OpenAI model to date.

GPT-5.2 retirement: June 5, 2026

GPT-5.2 Thinking stays available for three months in the Legacy Models section, then gets retired. Enterprise and Edu users need to enable GPT-5.4 early access via admin settings. Plan your migration now, not in May.

Pricing details

Subscription plans

PlusGPT-5.4 Thinking, DALL-E, browsing, Advanced Data Analysis(Message limits apply; xhigh reasoning not available)

$20/mo

ProGPT-5.4 Pro mode, unlimited GPT-5.4 Thinking, extended reasoning, priority access

$200/mo

TeamAll Plus features, admin console, shared workspace, higher rate limits

$25/mo (annual)

Enterprise / EduFull GPT-5.4 access, SOC 2, zero data retention, admin dashboard(Custom pricing via sales; early access toggle in admin settings)

Free

API pricing

OpenAIStandard mode. Cached input: 90% discount ($0.25/M). Batch/Flex: 50% off. Priority processing: 2x rate. GPT-5.4 Pro: $30/$180 per 1M tokens.

$2.5/$15

OpenRouterSlight markup over direct OpenAI pricing. Verify at openrouter.ai.

$2.6/$15.4

Prices verified March 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: March 5, 2026

Benchmark sources:OpenAI: Introducing GPT-5.4