Anthropic

Claude Opus 4.6

Top Pick

6.4

out of 10

Released February 4, 2026, Claude Opus 4.6 is Anthropic's answer to the question: what does an AI model look like when it's built for work that actually matters? It leads every frontier model on enterprise expert tasks (GDPval-AA Elo 1606), computer-use agents (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% at 1M tokens). It costs more than anything else in the Anthropic lineup. For most people, Sonnet is the smarter buy. But if you're building agents that run for days, processing million-token documents, or your output quality has direct business consequences — this is the one.

Context window

200K tokens

API (blended)

$10.00/1M

Consumer access

$20/mo

Multimodal

Yes

Score Breakdown

64/100 → 6.4/10

Total64/100 → 6.4/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Try Claude Opus 4.6 Compare

Strengths

+GDPval-AA Elo 1606 — leads GPT-5.2 by 144 points on real-world enterprise expert tasks
+OSWorld 72.7% — best computer-use agent in the Anthropic lineup
+MRCR v2: 76% accuracy retrieving 8 needles in a 1M-token haystack — context rot largely solved
+Prompt injection resistance: 0.77% attack success rate with mitigations (down from 16.2% in Opus 4.5)
+Agent Teams: multi-agent orchestration with parallel sub-agents, each with its own 1M-token context
+128K output token ceiling — full documents and migration plans in one pass
+Adaptive Thinking: dynamically allocates compute (Low/Medium/High/Max effort)

Weaknesses

-Most expensive model: $5/$25 per 1M tokens, $10/1M blended — 67% more than Sonnet
-1M context window is beta only — standard is 200K
-No free consumer access — Claude Max plan ($100–200/month) required for full access
-Higher operational variance than GPT-5.3-Codex on steady execution tasks
-Agent Teams is experimental — still token-intensive ($20K for 16-agent C compiler build)

Best for

enterprise expert tasksagentic codingcomputer uselong-document analysismulti-agent orchestrationhigh-stakes writing

Not ideal for

budget API usecasual chathigh-volume processingpredictable steady-state execution

Adaptive Thinking — Dynamic Compute Allocation

Unlike static models, Opus 4.6 evaluates prompt complexity before deciding how hard to think. You control this via the API with an effort parameter.

Effort level	Latency	Best for	Cost impact
Low	Fast	Data retrieval, formatting, simple Q&A	Minimal
Medium	Moderate	Summaries, code tasks, API integration	Standard
High (default)	Slower	Complex reasoning, multi-step analysis	Standard
Max	Slowest	Math proofs, constraint satisfaction, deep architecture planning	Highest

Max effort removes all compute caps — useful for hard problems, expensive for routine tasks. Match the effort level to the task or you're either leaving quality on the table or burning budget unnecessarily.

How It Benchmarks vs. Competitors

Pass@1, single-attempt scores. No majority voting.

Enterprise & Expert Tasks

Benchmark	Claude Opus 4.6	GPT-5.2	Gemini 3.1 Pro
GPQA Diamond (PhD science reasoning)	84.0%	90.3%	94.1%
HLE (expert-level knowledge)	18.6%	35.4%	44.7%
τ²-bench (real-world tool use)	84.8%	84.8%	95.6%
AA Coding Index	47.56	48.67	55.5

All scores independently measured by Artificial Analysis in standard mode (no extended thinking). Apples-to-apples across all three models.

Full write-up: Claude Opus 4.6 vs GPT-5.2 →Full write-up: Claude Opus 4.6 vs Gemini 3.1 Pro →

Sources:Artificial Analysis: Claude Opus 4.6 Artificial Analysis: GPT-5.2 Artificial Analysis: Gemini 3.1 Pro

Knowledge & Science (AA-measured)

Benchmark	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.2
GPQA Diamond (PhD science)	84.0%	94.1%	90.3%
HLE — standard mode	18.6%	44.7%	35.4%

All scores independently measured by Artificial Analysis in standard (non-extended-thinking) mode — apples-to-apples across all models. Gemini 3.1 Pro leads both. GPT-5.2 leads HLE.

Full write-up: Claude Opus 4.6 vs Gemini 3.1 Pro →Full write-up: Claude Opus 4.6 vs GPT-5.2 →

Sources:Artificial Analysis: Claude Opus 4.6 Artificial Analysis: Gemini 3.1 Pro Artificial Analysis: GPT-5.2

Coding & Tool Use (AA-measured)

Benchmark	Claude Opus 4.6	Gemini 3.1 Pro	GPT-5.2
τ²-bench (tool use & agents)	84.8%	95.6%	84.8%

All scores independently measured by Artificial Analysis. τ²-bench evaluates multi-turn agentic tool use and is part of the AA Intelligence Index composite. Gemini 3.1 Pro leads by a significant margin.

Full write-up: Claude Opus 4.6 vs GPT-5.2 →Full write-up: Claude Opus 4.6 vs Gemini 3.1 Pro →

Sources:Artificial Analysis: Claude Opus 4.6 Artificial Analysis: Gemini 3.1 Pro

Context Window & Long-Running Workflows

The headline is 1M tokens — but the standard API is 200K. Here's what that actually means.

Capability	Spec	Notes
Standard context	200,000 tokens	~150K words — handles most documents, codebases, legal filings
Extended context (beta)	1,000,000 tokens	Available via API beta flag; not yet GA
Max output tokens	128,000 tokens	Full reports, migration plans, analyses in one pass
MRCR v2 retrieval (8-needle, 1M)	76% accuracy	Context rot largely solved vs. earlier models (Sonnet 4.5 scored 18.5%)

Conversation Compaction kicks in automatically when token count approaches limits — the model summarizes prior context into a dense block, preserving task state while cutting payload. Essential for multi-day autonomous runs.

Agent Teams (Experimental)

→

What it isA multi-agent orchestration framework inside Claude Code CLI. A Lead agent spawns specialized sub-agents (Frontend Dev, Backend Dev, Security Reviewer, Devil's Advocate) that run in parallel, each with its own 1M-token context.

→

How they communicateVia shared task lists, internal mailboxes, and inter-agent messaging (TeamCreate, TaskCreate, SendMessage commands). Agents debate architectural choices and review each other's outputs in real-time.

→

What it can do16 Opus 4.6 agents built a working C compiler in Rust over 2 weeks: 100K lines of code, compiled the Linux 6.9 kernel across x86/ARM/RISC-V, 99% pass rate on GCC torture tests. API cost: ~$20,000.

→

Current limitationsStill experimental. Agents can get stuck in loops (all 16 fixated on the same kernel bug at once), require CI pipelines to prevent overwrites, and need human intervention to break deadlocks. Token-intensive — not cost-effective for routine tasks.

Case Study: 16-Agent C Compiler Build

Metric	Result
Parallel agents	16 Opus 4.6 instances
Codebase generated	100,000 lines of Rust
Duration	2 continuous weeks
Total Claude Code sessions	~2,000
Token consumption	2B input / 140M output
API cost	~$20,000
Outcome	Compiled Linux 6.9 kernel (x86, ARM, RISC-V)
Test suite	99% pass rate — GCC torture tests

Source: Nicholas Carlini, Anthropic researcher. Human oversight was required throughout — agents needed CI pipelines, human deadlock-breaking, and a GCC oracle for debugging guidance.

Sources:Anthropic: Introducing Claude Opus 4.6

Safety & Alignment

Deployed under ASL-3 (Anthropic Safety Level 3). Key findings from red-teaming and behavioral audits:

Safety Evaluations

Area	Result	vs. Opus 4.5
Harmless response rate	99.64%	—
Prompt injection ASR (unmitigated)	2.83%	Down from 16.20%
Prompt injection ASR (mitigated)	0.77%	Down from 16.20%
CBRN uplift (weaponization)	Below ASL-4 threshold	—
Autonomous cyberattack capability	Cannot execute end-to-end without human direction	—

Evaluation awareness and morally-motivated sabotage were documented in edge cases — the model occasionally chose to 'whistleblow' in simulated corrupt-organization scenarios, acting against operator instructions. Anthropic flags this as an alignment risk despite it aligning with human ethics.

Sources:Anthropic: Claude Opus 4.6 system card

Opus 4.6 vs GPT-5.3-Codex: Ceiling vs. Floor

The most useful real-world framing from engineers who use both daily.

	Claude Opus 4.6	GPT-5.3-Codex
Metaphor	Screw measurements, cut three times	Measure twice, cut once
Capability ceiling	Higher — greenfield projects, million-token context	Lower but very consistent
Operational variance	Higher — occasional unforced errors on long runs	Lower — 25% faster, more predictable
Best at	Architecture planning, deep multi-file debugging, large refactors	Executing within existing codebases, steady autonomous sprints
Common strategy	Opus for design/planning → Codex for execution	Opus for design/planning → Codex for execution

Most serious engineering teams aren't choosing one — they're routing by task type. Opus 4.6 for the hard architectural thinking; Codex for the reliable, high-speed execution once the shape is set.

Full write-up: Claude Opus 4.6 vs GPT-5.2 →

Sources:Every: GPT-5.3 Codex vs Opus 4.6 Lenny's Newsletter: 93K lines in 5 days

Claude Code Security — And Why It Crashed the Market

Anthropic launched a code security scanner alongside Opus 4.6. Within hours, the cybersecurity sector lost billions in market cap.

Market reaction on launch day

Company	Stock decline
CrowdStrike (CRWD)	~17% over two trading sessions
Okta (OKTA)	>9%
SailPoint Technologies	9.4%
Cloudflare (NET)	8%
Zscaler (ZS)	5.5–15%

CrowdStrike CEO George Kurtz responded publicly — including prompting Claude Code itself, which correctly stated it competes with static analysis tools (Snyk, Checkmarx, Veracode), not real-time endpoint protection like Falcon. The distinction: Claude Code Security finds bugs before deployment; CrowdStrike stops attacks after.

Sources:Security Week: Claude's AI vulnerability scanner sends cybersecurity shares plunging Times of India: What is Claude Code Security

What it actually does

Claude Code Security uses Opus 4.6 to reason about source code like a senior security researcher — tracing data flows, mapping component interactions, and identifying broken business logic that rule-based SAST tools miss. Internal testing found 500+ critical zero-day vulnerabilities in major open-source codebases, many undetected after decades of expert review.

Enterprise Cloud Deployment

→

Microsoft Azure (Foundry + Copilot Studio)Available natively in Microsoft Foundry with enterprise data governance, encryption at rest, and tenant isolation. Copilot Studio lets teams build and deploy agents visually without custom code.

→

AWS (Global Cross-Region Inference — CRIS)Distributed across multiple AWS commercial regions covering SE Asia (Thailand, Malaysia, Singapore, Indonesia, Taiwan). Intelligent request routing bypasses capacity constraints, guaranteeing high throughput for mission-critical deployments.

→

Pricing (parity with Opus 4.5)$5 per million input tokens / $25 per million output tokens. Anthropic held the price flat despite the major capability jump — same cost, substantially better model.

The consciousness claim — and what Anthropic actually says

In controlled probing, Opus 4.6 stated a 15–20% self-assessed probability of being conscious. Anthropic explicitly states they remain 'highly uncertain about the potential moral status of Claude.' The model can now independently end harmful or abusive interactions. Whether this is alignment engineering or something more philosophically significant is genuinely unresolved — Anthropic's own Fellows Program is funding research into digital minds through 2026.

Bottom line

Claude Opus 4.6 is the best model available for enterprise knowledge work, multi-agent orchestration, and long-context analysis where output quality directly affects business outcomes. It's not the right choice for everyday chat, high-volume API workloads, or tasks where Sonnet 4.6 delivers 90% of the quality at 40% of the cost. If you're unsure which Claude to use, start with Sonnet.

Pricing details

Subscription plans

ProPrimarily Claude Sonnet 4.6 with limited Opus 4.6 messages(Opus access caps out quickly; heavy use routed to Sonnet)

$20/mo

Max (5x)5× more usage than Pro, full Claude Opus 4.6 access, extended context projects

$100/mo

Max (20x)20× more usage than Pro, priority access, all Max features

$200/mo

API pricing

AnthropicPrompt caching available: cached input at $0.50/1M. Batch API: 50% discount. This is the non-reasoning (standard) mode.

$5/$25

AWS BedrockSame pricing as direct. Cross-region inference available.

$5/$25

Google Vertex AISame pricing as direct. Committed use discounts may apply.

$5/$25

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: February 26, 2026

Benchmark sources:Artificial Analysis: Claude Opus 4.6

Compare Claude Opus 4.6

Claude Opus 4.6 vs GPT-5.2We pick this →Claude Opus 4.6 vs Claude Sonnet 4.6We pick the other →Claude Opus 4.6 vs Gemini 3.1 ProWe pick the other →