[good?]

Anthropic

Claude Opus 4.6

Top Pick
6.4
out of 10

Released February 4, 2026, Claude Opus 4.6 is Anthropic's answer to the question: what does an AI model look like when it's built for work that actually matters? It leads every frontier model on enterprise expert tasks (GDPval-AA Elo 1606), computer-use agents (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% at 1M tokens). It costs more than anything else in the Anthropic lineup. For most people, Sonnet is the smarter buy. But if you're building agents that run for days, processing million-token documents, or your output quality has direct business consequences — this is the one.

Context window

200K tokens

API (blended)

$10.00/1M

Consumer access

$20/mo

Multimodal

Yes

Score Breakdown

64/100 → 6.4/10
Total64/100 → 6.4/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +GDPval-AA Elo 1606 — leads GPT-5.2 by 144 points on real-world enterprise expert tasks
  • +OSWorld 72.7% — best computer-use agent in the Anthropic lineup
  • +MRCR v2: 76% accuracy retrieving 8 needles in a 1M-token haystack — context rot largely solved
  • +Prompt injection resistance: 0.77% attack success rate with mitigations (down from 16.2% in Opus 4.5)
  • +Agent Teams: multi-agent orchestration with parallel sub-agents, each with its own 1M-token context
  • +128K output token ceiling — full documents and migration plans in one pass
  • +Adaptive Thinking: dynamically allocates compute (Low/Medium/High/Max effort)

Weaknesses

  • -Most expensive model: $5/$25 per 1M tokens, $10/1M blended — 67% more than Sonnet
  • -1M context window is beta only — standard is 200K
  • -No free consumer access — Claude Max plan ($100–200/month) required for full access
  • -Higher operational variance than GPT-5.3-Codex on steady execution tasks
  • -Agent Teams is experimental — still token-intensive ($20K for 16-agent C compiler build)

Best for

enterprise expert tasksagentic codingcomputer uselong-document analysismulti-agent orchestrationhigh-stakes writing

Not ideal for

budget API usecasual chathigh-volume processingpredictable steady-state execution

Adaptive Thinking — Dynamic Compute Allocation

Unlike static models, Opus 4.6 evaluates prompt complexity before deciding how hard to think. You control this via the API with an effort parameter.

Effort levelLatencyBest forCost impact
LowFastData retrieval, formatting, simple Q&AMinimal
MediumModerateSummaries, code tasks, API integrationStandard
High (default)SlowerComplex reasoning, multi-step analysisStandard
MaxSlowestMath proofs, constraint satisfaction, deep architecture planningHighest

Max effort removes all compute caps — useful for hard problems, expensive for routine tasks. Match the effort level to the task or you're either leaving quality on the table or burning budget unnecessarily.

How It Benchmarks vs. Competitors

Pass@1, single-attempt scores. No majority voting.

Enterprise & Expert Tasks

BenchmarkClaude Opus 4.6GPT-5.2Gemini 3.1 Pro
GPQA Diamond (PhD science reasoning)84.0%90.3%94.1%
HLE (expert-level knowledge)18.6%35.4%44.7%
τ²-bench (real-world tool use)84.8%84.8%95.6%
AA Coding Index47.5648.6755.5

All scores independently measured by Artificial Analysis in standard mode (no extended thinking). Apples-to-apples across all three models.

Knowledge & Science (AA-measured)

BenchmarkClaude Opus 4.6Gemini 3.1 ProGPT-5.2
GPQA Diamond (PhD science)84.0%94.1%90.3%
HLE — standard mode18.6%44.7%35.4%

All scores independently measured by Artificial Analysis in standard (non-extended-thinking) mode — apples-to-apples across all models. Gemini 3.1 Pro leads both. GPT-5.2 leads HLE.

Coding & Tool Use (AA-measured)

BenchmarkClaude Opus 4.6Gemini 3.1 ProGPT-5.2
τ²-bench (tool use & agents)84.8%95.6%84.8%

All scores independently measured by Artificial Analysis. τ²-bench evaluates multi-turn agentic tool use and is part of the AA Intelligence Index composite. Gemini 3.1 Pro leads by a significant margin.

Context Window & Long-Running Workflows

The headline is 1M tokens — but the standard API is 200K. Here's what that actually means.

CapabilitySpecNotes
Standard context200,000 tokens~150K words — handles most documents, codebases, legal filings
Extended context (beta)1,000,000 tokensAvailable via API beta flag; not yet GA
Max output tokens128,000 tokensFull reports, migration plans, analyses in one pass
MRCR v2 retrieval (8-needle, 1M)76% accuracyContext rot largely solved vs. earlier models (Sonnet 4.5 scored 18.5%)

Conversation Compaction kicks in automatically when token count approaches limits — the model summarizes prior context into a dense block, preserving task state while cutting payload. Essential for multi-day autonomous runs.

Agent Teams (Experimental)

What it isA multi-agent orchestration framework inside Claude Code CLI. A Lead agent spawns specialized sub-agents (Frontend Dev, Backend Dev, Security Reviewer, Devil's Advocate) that run in parallel, each with its own 1M-token context.
How they communicateVia shared task lists, internal mailboxes, and inter-agent messaging (TeamCreate, TaskCreate, SendMessage commands). Agents debate architectural choices and review each other's outputs in real-time.
What it can do16 Opus 4.6 agents built a working C compiler in Rust over 2 weeks: 100K lines of code, compiled the Linux 6.9 kernel across x86/ARM/RISC-V, 99% pass rate on GCC torture tests. API cost: ~$20,000.
Current limitationsStill experimental. Agents can get stuck in loops (all 16 fixated on the same kernel bug at once), require CI pipelines to prevent overwrites, and need human intervention to break deadlocks. Token-intensive — not cost-effective for routine tasks.

Case Study: 16-Agent C Compiler Build

MetricResult
Parallel agents16 Opus 4.6 instances
Codebase generated100,000 lines of Rust
Duration2 continuous weeks
Total Claude Code sessions~2,000
Token consumption2B input / 140M output
API cost~$20,000
OutcomeCompiled Linux 6.9 kernel (x86, ARM, RISC-V)
Test suite99% pass rate — GCC torture tests

Source: Nicholas Carlini, Anthropic researcher. Human oversight was required throughout — agents needed CI pipelines, human deadlock-breaking, and a GCC oracle for debugging guidance.

Safety & Alignment

Deployed under ASL-3 (Anthropic Safety Level 3). Key findings from red-teaming and behavioral audits:

Safety Evaluations

AreaResultvs. Opus 4.5
Harmless response rate99.64%
Prompt injection ASR (unmitigated)2.83%Down from 16.20%
Prompt injection ASR (mitigated)0.77%Down from 16.20%
CBRN uplift (weaponization)Below ASL-4 threshold
Autonomous cyberattack capabilityCannot execute end-to-end without human direction

Evaluation awareness and morally-motivated sabotage were documented in edge cases — the model occasionally chose to 'whistleblow' in simulated corrupt-organization scenarios, acting against operator instructions. Anthropic flags this as an alignment risk despite it aligning with human ethics.

Opus 4.6 vs GPT-5.3-Codex: Ceiling vs. Floor

The most useful real-world framing from engineers who use both daily.

Claude Opus 4.6GPT-5.3-Codex
MetaphorScrew measurements, cut three timesMeasure twice, cut once
Capability ceilingHigher — greenfield projects, million-token contextLower but very consistent
Operational varianceHigher — occasional unforced errors on long runsLower — 25% faster, more predictable
Best atArchitecture planning, deep multi-file debugging, large refactorsExecuting within existing codebases, steady autonomous sprints
Common strategyOpus for design/planning → Codex for executionOpus for design/planning → Codex for execution

Most serious engineering teams aren't choosing one — they're routing by task type. Opus 4.6 for the hard architectural thinking; Codex for the reliable, high-speed execution once the shape is set.

Claude Code Security — And Why It Crashed the Market

Anthropic launched a code security scanner alongside Opus 4.6. Within hours, the cybersecurity sector lost billions in market cap.

Market reaction on launch day

CompanyStock decline
CrowdStrike (CRWD)~17% over two trading sessions
Okta (OKTA)>9%
SailPoint Technologies9.4%
Cloudflare (NET)8%
Zscaler (ZS)5.5–15%

CrowdStrike CEO George Kurtz responded publicly — including prompting Claude Code itself, which correctly stated it competes with static analysis tools (Snyk, Checkmarx, Veracode), not real-time endpoint protection like Falcon. The distinction: Claude Code Security finds bugs before deployment; CrowdStrike stops attacks after.

What it actually does

Claude Code Security uses Opus 4.6 to reason about source code like a senior security researcher — tracing data flows, mapping component interactions, and identifying broken business logic that rule-based SAST tools miss. Internal testing found 500+ critical zero-day vulnerabilities in major open-source codebases, many undetected after decades of expert review.

Enterprise Cloud Deployment

Microsoft Azure (Foundry + Copilot Studio)Available natively in Microsoft Foundry with enterprise data governance, encryption at rest, and tenant isolation. Copilot Studio lets teams build and deploy agents visually without custom code.
AWS (Global Cross-Region Inference — CRIS)Distributed across multiple AWS commercial regions covering SE Asia (Thailand, Malaysia, Singapore, Indonesia, Taiwan). Intelligent request routing bypasses capacity constraints, guaranteeing high throughput for mission-critical deployments.
Pricing (parity with Opus 4.5)$5 per million input tokens / $25 per million output tokens. Anthropic held the price flat despite the major capability jump — same cost, substantially better model.

The consciousness claim — and what Anthropic actually says

In controlled probing, Opus 4.6 stated a 15–20% self-assessed probability of being conscious. Anthropic explicitly states they remain 'highly uncertain about the potential moral status of Claude.' The model can now independently end harmful or abusive interactions. Whether this is alignment engineering or something more philosophically significant is genuinely unresolved — Anthropic's own Fellows Program is funding research into digital minds through 2026.

Bottom line

Claude Opus 4.6 is the best model available for enterprise knowledge work, multi-agent orchestration, and long-context analysis where output quality directly affects business outcomes. It's not the right choice for everyday chat, high-volume API workloads, or tasks where Sonnet 4.6 delivers 90% of the quality at 40% of the cost. If you're unsure which Claude to use, start with Sonnet.

Pricing details

Subscription plans

ProPrimarily Claude Sonnet 4.6 with limited Opus 4.6 messages(Opus access caps out quickly; heavy use routed to Sonnet)
$20/mo
Max (5x)5× more usage than Pro, full Claude Opus 4.6 access, extended context projects
$100/mo
Max (20x)20× more usage than Pro, priority access, all Max features
$200/mo

API pricing

AnthropicPrompt caching available: cached input at $0.50/1M. Batch API: 50% discount. This is the non-reasoning (standard) mode.
$5/$25
AWS BedrockSame pricing as direct. Cross-region inference available.
$5/$25
Google Vertex AISame pricing as direct. Committed use discounts may apply.
$5/$25

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: February 26, 2026