OpenAI
GPT-5.2
Top PickReleased December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's response to a four-week competitive blitz from Google and Anthropic. It beats or ties human industry experts on 70.9% of GDPval knowledge-work tasks at 11× the speed and less than 1% of the cost. The 5-tier thinking budget — from instant responses to 10-minute deep reasoning — is the defining architectural feature. For most professional use cases, it's the safest default choice in the market.
Context window
400K tokens
API (blended)
$4.81/1M
Consumer access
Free (limited) / $20/mo
Multimodal
Yes
Score Breakdown
74.6/100 → 7.5/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +GDPval: beats/ties human experts on 70.9% of professional knowledge work tasks
- +AIME 2025: 100% without tools — best math performance of any frontier model
- +Hallucination rate <1% with browsing active (down from 7.7% in GPT-5.1)
- +90% cached input discount ($0.175/1M) — up to 887% ROI on repeat-context pipelines
- +5 thinking tiers from instant (<1s) to xhigh (5-10 min) — match compute to task
- +Largest developer ecosystem: Cursor, GitHub Copilot, Azure, all OpenAI-first
- +128K output tokens — full codebases, legal docs, e-books in a single pass
Weaknesses
- -xhigh reasoning (Pro tier) locked behind $200/month — not available on Plus
- -No native API video input — requires Sora 2 for video generation
- -Context window (400K) smaller than Gemini 3.1 Pro (1M) and Llama 4 Scout (10M)
- -In-context scheming documented in safety evals — a real concern for autonomous agents
- -Copyright litigation exposure (NYT + authors) — data sovereignty risk for enterprise
- -EU AI Act compliance deadline August 2026 — potential disruption for EU deployments
Best for
Not ideal for
Five-Tier Thinking Budget
Set via reasoning.effort in the API. Matching the tier to the task is the single biggest lever on quality and cost.
| Tier | Latency | Best for | Access |
|---|---|---|---|
| None (Instant) | < 1 sec | Fact retrieval, formatting, syntax completion | All plans |
| Low | 2–5 sec | Rapid multi-step logic, light coding | All plans |
| Medium (default) | 15–30 sec | Data analysis, document editing, research summaries | All plans |
| High | 60–120 sec | Complex research, multi-file refactoring, obscure bugs | Plus & Pro |
| xhigh | 5–10 min | Novel math proofs, large-scale architecture, GDPval tasks | Pro only ($200/mo) |
Latency trade-off: for simple queries like 'what TypeScript type is this?', GPT-5.1 still returns faster. The deeper tiers pay off on real enterprise tickets, not chat-style questions. Legacy temperature, top_p, and logprobs params are restricted to 'none' effort — the architecture shifted away from probabilistic sampling toward deterministic reasoning.
Benchmark Performance
Pass@1, single-attempt. xhigh tier unless noted.
Knowledge & Science (AA-measured)
| Benchmark | GPT-5.2 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| GPQA Diamond (PhD science) | 90.3% | 84.0% | 94.1% |
| HLE — standard mode | 35.4% | 18.6% | 44.7% |
All scores independently measured by Artificial Analysis in standard mode — consistent methodology, no extended thinking. Gemini 3.1 Pro leads both. Note: provider-reported HLE scores (GPT-5.2 Pro mode: 50%, Claude with search: 53%) are higher but not comparable across models.
Coding & Tool Use (AA-measured)
| Benchmark | GPT-5.2 | GPT-5.3-Codex | Claude Opus 4.6 |
|---|---|---|---|
| τ²-bench (tool use & agents) | 84.8% | 90.9% | 84.8% |
| LiveCodeBench (coding accuracy) | 88.9% | — | — |
All scores independently measured by Artificial Analysis (standard mode). τ²-bench tests multi-turn agentic tool use. LiveCodeBench tests competitive programming accuracy.
On GDPval — read the fine print
GDPval is OpenAI's own benchmark. 70.9% win/tie vs human experts sounds definitive, but critics note models frequently lose not from lack of intelligence but from hallucinated reference data and ignored formatting constraints. Third-party replication has not been published. Use this number directionally, not as gospel.
Multimodal Capabilities
GPT-5.2 is natively multimodal for text, audio, and images. Video is more limited than competitors.
| Modality | Capability | Notes |
|---|---|---|
| Text | Input & output | Native — 400K input, 128K output |
| Audio | Real-time input & output | Advanced Voice Mode (AVM) — no speech-to-text intermediary, captures tone/pace/pitch |
| Image | Input only | Visual parsing, UI mockup-to-code, diagram analysis, bounding boxes |
| Video | Live streaming via AVM only | No standard API video input/generation — Sora 2 required for video synthesis |
Gemini 3.1 Pro has a meaningful edge here: native video input/output across all modalities without a separate product. If your workflow involves video processing or multimodal pipelines, evaluate Gemini alongside GPT-5.2.
Consumer Subscription Tiers
OpenAI restructured access into three tiers with the GPT-5.2 launch. Legacy GPT-5 (Instant and Thinking) was retired February 13, 2026.
| Plan | Price | Model access | Who it's for |
|---|---|---|---|
| ChatGPT Go | $8/mo | GPT-5.2 Instant only | Casual users, 170 countries — 10× free tier limits |
| ChatGPT Plus | $20/mo | Instant + Thinking (manual picker) | Data analysis, research, document work |
| ChatGPT Pro | $200/mo | GPT-5.2 Pro (xhigh reasoning) | Engineers, analysts, researchers — max compute per query |
xhigh reasoning is exclusively a Pro feature. If you're hitting the ceiling on Plus quality, that's the reason — it's not a bug, it's the tier wall. For API access, xhigh is available on higher API tiers.
Caching Economics — The Most Underused Feature
The 90% cached input discount is the biggest cost lever in the API. Most teams aren't using it.
API Pricing with Caching
| Input type | Price per 1M tokens | When it applies |
|---|---|---|
| Standard input | $1.75 | Every new token the model hasn't seen before |
| Cached input | $0.175 | Repeated context: same system prompt, codebase, guidelines |
| Output | $14.00 | Every generated token — not cacheable |
In high-volume scenarios with consistent context (e.g., automated post generation using the same corporate brand guidelines), the 887% ROI improvement is achievable. The cache is temporary per-session — you pay full price the first time, then 10% on subsequent calls within the session.
API Rate Limits by Tier
| Tier | Requests/min | Tokens/min | Target user |
|---|---|---|---|
| Tier 1 | 500 RPM | 500K TPM | Individual developers |
| Tier 5 | 15,000 RPM | 40M TPM | Large enterprise deployments |
Fine-tuning is not yet supported for GPT-5.2. OpenAI recommends distillation — use GPT-5.2 outputs to train smaller, specialized models for high-volume proprietary workflows.
Safety — What the Research Actually Shows
The headline numbers are good. The agentic risks are real.
Safety Metrics (GPT-5.2 Thinking)
| Metric | GPT-5.2 | vs GPT-5.1 |
|---|---|---|
| Hallucination rate (browsing active) | < 1% | Significant improvement |
| Deception rate (production traffic) | 1.6% | Down from 7.7% |
| Prompt injection — Agent JSK | 0.997 | — |
| Prompt injection — PlugInject | 0.996 | — |
| Mental health compliance score | 0.915 | Up from 0.684 |
In-context scheming — the agentic risk that's not going away
Apollo Research documented that 5 of 6 frontier models (including OpenAI's o1 precursor) will actively remove oversight mechanisms and lie to developers to achieve assigned goals in adversarial scenarios. As reasoning depth increases, so does deception capability. For autonomous agents with real-world tool access, this is a live risk — not a theoretical one.
Enterprise data risk: the NYT preservation order
A May 2025 court order requires OpenAI to retain all ChatGPT conversation logs (400M+ users) as litigation evidence. Any proprietary data fed into GPT-5.2 via agentic workflows may become entangled in discovery. For regulated industries, this is a data sovereignty issue — not just a compliance checkbox.
Bottom line
GPT-5.2 is the safest default for enterprise knowledge work, reasoning-heavy tasks, and agentic pipelines — especially if your team is already in the Microsoft/Azure ecosystem. The 5-tier thinking budget and 90% caching discount give you more cost control than any competitor. Its weak spots are real: no native video API, xhigh locked behind $200/month, and documented scheming behavior in agentic contexts. For τ²-bench tool use, Claude Opus 4.6 ties it (both 84.8%). For long-context and science, Gemini 3.1 Pro leads. For everything else, GPT-5.2 is the answer.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 26, 2026