GPT-5.4 Is Here: What Actually Changed (and What Didn't)

OpenAI released GPT-5.4 today. It's the first time they've combined their coding model (GPT-5.3-Codex), their reasoning model (GPT-5.2), and native computer-use capabilities into a single release. The pitch: one model that handles professional knowledge work, autonomous coding, and operating your actual computer.

Here's what matters and what doesn't.

83%GDPval Win Rate

75%OSWorld (Beats Humans)

47%Token Savings (Tool Search)

The numbers that matter

GDPval jumped from 70.9% to 83.0%. That's the benchmark where AI attempts real knowledge work across 44 professional occupations, and human experts judge whether it passes. A 12-point improvement in one generation is large. OpenAI claims it matches or exceeds professionals in 83% of comparisons now.

GPT-5.4

83.0%

GPT-5.4 Pro

82.0%

GPT-5.2 Pro

74.1%

GPT-5.2

70.9%

GDPval: win or tie rate vs human industry experts across 44 occupations. Source: OpenAI.

BrowseComp hit 82.7%, up from 65.8%. If you use ChatGPT for research that requires pulling information from multiple obscure sources, this is a meaningful upgrade. The Pro variant pushes it to 89.3%.

Computer use is the real story

Previous OpenAI models needed a separate "Computer-Using Agent" model for anything involving screen interaction. GPT-5.4 does it natively. It reads screenshots, identifies UI elements, and clicks/types via coordinate-based actions.

GPT-5.4

75.0%

Human

72.4%

GPT-5.2

47.3%

OSWorld-Verified: desktop computer navigation via screenshots + mouse/keyboard. GPT-5.4 surpasses human performance. Source: OpenAI.

Mainstay (a property management company) reported 95% success rate on first attempt across 30,000 HOA and property tax portals, with 100% within three attempts. Sessions ran 3x faster and used 70% fewer tokens than their previous CUA setup.

For developers building browser automation or desktop agents, this changes the architecture. You don't need a specialized vision model in your pipeline anymore. One API call handles reasoning, tool use, and screen navigation.

Tool search saves money on MCP workloads

This is the feature enterprise teams should pay attention to. Previously, every tool definition got stuffed into the context window on every request. If you had 36 MCP servers with hundreds of functions, that could be tens of thousands of tokens before the model even started thinking.

Tool search reduced total token usage by 47% on 250 MCP Atlas benchmark tasks with all 36 MCP servers enabled, with identical accuracy. GPT-5.4 gets a lightweight index of available tools and looks up full definitions only when needed. For anyone running agentic pipelines with many tools, this is a direct cost reduction.

Coding: matches 5.3-Codex, adds everything else

SWE-Bench Pro: 57.7% (vs 56.8% for GPT-5.3-Codex). Terminal-Bench 2.0: 75.1% (vs 77.3% for 5.3-Codex). The coding performance is essentially the same, which makes sense since OpenAI says the coding capabilities are directly incorporated from 5.3-Codex.

GPT-5.4

57.7%

5.3-Codex

56.8%

GPT-5.2

55.6%

SWE-Bench Pro (Public): real-world software engineering tasks. GPT-5.4 matches the coding specialist. Source: OpenAI.

The difference: GPT-5.4 also does professional knowledge work, computer use, and multimodal reasoning. GPT-5.3-Codex was a specialist. GPT-5.4 is a generalist that codes at the same level.

Cursor's VP of Developer Education called it "the leader on our internal benchmarks" and noted it's "more natural and assertive than previous models."

Science and reasoning

The abstract reasoning numbers are the most impressive part of the release that nobody will talk about. ARC-AGI-2 went from 52.9% to 73.3%. That's a 20-point jump on what's considered the hardest abstract reasoning benchmark in the field. GPT-5.4 Pro pushes it to 83.3%.

ARC-AGI-2

73.3%

↳ GPT-5.2

52.9%

GPQA Diamond

92.8%

↳ GPT-5.2

92.4%

HLE (tools)

52.1%

↳ GPT-5.2

45.5%

GPT-5.4 (green) vs GPT-5.2 (gray) on science and reasoning benchmarks. All xhigh reasoning effort. Source: OpenAI.

GPQA Diamond barely moved (92.8% vs 92.4%). That benchmark may be hitting a ceiling for current architectures. HLE with tools went from 45.5% to 52.1%, a solid but not dramatic improvement.

Hallucinations are down

On a dataset of prompts where users had previously flagged factual errors: 33% fewer false claims per response, 18% fewer responses containing any errors. Harvey's BigLaw Bench scored it at 91% for legal document accuracy.

33%Fewer False Claims

18%Fewer Error Responses

91%BigLaw Bench Score

That tracks with GPT-5.2's hallucination rate already being under 1% with browsing active. The trend line is good.

What about the price?

API input went from $1.75/M tokens (GPT-5.2) to $2.50/M (GPT-5.4). That's a 43% increase. Output went from $14 to $15 per million, a 7% bump. The Pro variant is $30/$180, up from $21/$168.

Model	Input / 1M	Cached / 1M	Output / 1M
GPT-5.4	$2.50	$0.25	$15.00
GPT-5.2	$1.75	$0.175	$14.00
GPT-5.4 Pro	$30.00	n/a	$180.00
GPT-5.2 Pro	$21.00	n/a	$168.00

OpenAI's counter-argument: GPT-5.4 uses fewer reasoning tokens than GPT-5.2 to solve the same problems, so total cost per task may drop even though per-token price went up. Tool search saving 47% of tokens on MCP workloads helps too. Whether you end up paying more or less depends on what you're doing.

Batch and Flex pricing cuts everything in half: $1.25/M input, $7.50/M output. If latency doesn't matter, that's cheaper than GPT-5.2 at standard rates.

Steerability: a small thing that'll save time

GPT-5.4 Thinking now shows you its plan before generating a long response. You can jump in and redirect it while it's working. No more waiting 3 minutes for a detailed answer that went in the wrong direction and then starting over.

Available on chatgpt.com and Android now. iOS coming soon.

The fine print

The 1M context window is experimental. It's only available in Codex, and requests exceeding the standard 272K window cost 2x. In ChatGPT, context stays the same as GPT-5.2. Don't plan around 1M context for production workloads yet.

GPT-5.2 Thinking gets retired June 5, 2026. It'll stay in a Legacy Models section for three months, then it's gone. If you're on Enterprise or Edu, you need to toggle early access in admin settings.

The "High cybersecurity capability" classification from GPT-5.3-Codex carries forward. Same expanded monitoring, trusted access controls, and async blocking for higher-risk requests on zero-data-retention surfaces. Some false positives in blocking are expected as classifiers improve.

Independent benchmarks: not yet

Every number in this article comes from OpenAI. Artificial Analysis hasn't measured GPT-5.4 yet. We'll update our GPT-5.4 review and rankings once independent data arrives. Until then, take the benchmarks directionally. OpenAI has a decent track record of honest reporting, but provider-reported numbers consistently run higher than independent measurements.

Who should switch

If you're on GPT-5.2 and use ChatGPT for professional work (documents, spreadsheets, research), GPT-5.4 is a clear upgrade. The GDPval jump alone is worth it.

If you're building agents that need computer use, GPT-5.4 eliminates the need for a separate CUA model. Simpler architecture, fewer failure modes.

If you're running agentic pipelines with many MCP tools, tool search will cut your token bill.

If you're happy with GPT-5.3-Codex for pure coding and don't need knowledge work or computer use, there's no compelling reason to switch. The coding performance is nearly identical.

If you're on Claude Opus 4.6 or Gemini 3.1 Pro, wait for independent benchmarks before making any decisions. Provider-reported numbers are not directly comparable across companies.