Anthropic
Claude Opus 4.6
Top PickReleased February 4, 2026, Claude Opus 4.6 is Anthropic's answer to the question: what does an AI model look like when it's built for work that actually matters? It leads every frontier model on enterprise expert tasks (GDPval-AA Elo 1606), computer-use agents (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% at 1M tokens). It costs more than anything else in the Anthropic lineup. For most people, Sonnet is the smarter buy. But if you're building agents that run for days, processing million-token documents, or your output quality has direct business consequences — this is the one.
Context window
200K tokens
API (blended)
$10.00/1M
Consumer access
$20/mo
Multimodal
Yes
Score Breakdown
64/100 → 6.4/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +GDPval-AA Elo 1606 — leads GPT-5.2 by 144 points on real-world enterprise expert tasks
- +OSWorld 72.7% — best computer-use agent in the Anthropic lineup
- +MRCR v2: 76% accuracy retrieving 8 needles in a 1M-token haystack — context rot largely solved
- +Prompt injection resistance: 0.77% attack success rate with mitigations (down from 16.2% in Opus 4.5)
- +Agent Teams: multi-agent orchestration with parallel sub-agents, each with its own 1M-token context
- +128K output token ceiling — full documents and migration plans in one pass
- +Adaptive Thinking: dynamically allocates compute (Low/Medium/High/Max effort)
Weaknesses
- -Most expensive model: $5/$25 per 1M tokens, $10/1M blended — 67% more than Sonnet
- -1M context window is beta only — standard is 200K
- -No free consumer access — Claude Max plan ($100–200/month) required for full access
- -Higher operational variance than GPT-5.3-Codex on steady execution tasks
- -Agent Teams is experimental — still token-intensive ($20K for 16-agent C compiler build)
Best for
Not ideal for
Adaptive Thinking — Dynamic Compute Allocation
Unlike static models, Opus 4.6 evaluates prompt complexity before deciding how hard to think. You control this via the API with an effort parameter.
| Effort level | Latency | Best for | Cost impact |
|---|---|---|---|
| Low | Fast | Data retrieval, formatting, simple Q&A | Minimal |
| Medium | Moderate | Summaries, code tasks, API integration | Standard |
| High (default) | Slower | Complex reasoning, multi-step analysis | Standard |
| Max | Slowest | Math proofs, constraint satisfaction, deep architecture planning | Highest |
Max effort removes all compute caps — useful for hard problems, expensive for routine tasks. Match the effort level to the task or you're either leaving quality on the table or burning budget unnecessarily.
How It Benchmarks vs. Competitors
Pass@1, single-attempt scores. No majority voting.
Enterprise & Expert Tasks
| Benchmark | Claude Opus 4.6 | GPT-5.2 | Gemini 3.1 Pro |
|---|---|---|---|
| GPQA Diamond (PhD science reasoning) | 84.0% | 90.3% | 94.1% |
| HLE (expert-level knowledge) | 18.6% | 35.4% | 44.7% |
| τ²-bench (real-world tool use) | 84.8% | 84.8% | 95.6% |
| AA Coding Index | 47.56 | 48.67 | 55.5 |
All scores independently measured by Artificial Analysis in standard mode (no extended thinking). Apples-to-apples across all three models.
Knowledge & Science (AA-measured)
| Benchmark | Claude Opus 4.6 | Gemini 3.1 Pro | GPT-5.2 |
|---|---|---|---|
| GPQA Diamond (PhD science) | 84.0% | 94.1% | 90.3% |
| HLE — standard mode | 18.6% | 44.7% | 35.4% |
All scores independently measured by Artificial Analysis in standard (non-extended-thinking) mode — apples-to-apples across all models. Gemini 3.1 Pro leads both. GPT-5.2 leads HLE.
Coding & Tool Use (AA-measured)
| Benchmark | Claude Opus 4.6 | Gemini 3.1 Pro | GPT-5.2 |
|---|---|---|---|
| τ²-bench (tool use & agents) | 84.8% | 95.6% | 84.8% |
All scores independently measured by Artificial Analysis. τ²-bench evaluates multi-turn agentic tool use and is part of the AA Intelligence Index composite. Gemini 3.1 Pro leads by a significant margin.
Context Window & Long-Running Workflows
The headline is 1M tokens — but the standard API is 200K. Here's what that actually means.
| Capability | Spec | Notes |
|---|---|---|
| Standard context | 200,000 tokens | ~150K words — handles most documents, codebases, legal filings |
| Extended context (beta) | 1,000,000 tokens | Available via API beta flag; not yet GA |
| Max output tokens | 128,000 tokens | Full reports, migration plans, analyses in one pass |
| MRCR v2 retrieval (8-needle, 1M) | 76% accuracy | Context rot largely solved vs. earlier models (Sonnet 4.5 scored 18.5%) |
Conversation Compaction kicks in automatically when token count approaches limits — the model summarizes prior context into a dense block, preserving task state while cutting payload. Essential for multi-day autonomous runs.
Agent Teams (Experimental)
Case Study: 16-Agent C Compiler Build
| Metric | Result |
|---|---|
| Parallel agents | 16 Opus 4.6 instances |
| Codebase generated | 100,000 lines of Rust |
| Duration | 2 continuous weeks |
| Total Claude Code sessions | ~2,000 |
| Token consumption | 2B input / 140M output |
| API cost | ~$20,000 |
| Outcome | Compiled Linux 6.9 kernel (x86, ARM, RISC-V) |
| Test suite | 99% pass rate — GCC torture tests |
Source: Nicholas Carlini, Anthropic researcher. Human oversight was required throughout — agents needed CI pipelines, human deadlock-breaking, and a GCC oracle for debugging guidance.
Safety & Alignment
Deployed under ASL-3 (Anthropic Safety Level 3). Key findings from red-teaming and behavioral audits:
Safety Evaluations
| Area | Result | vs. Opus 4.5 |
|---|---|---|
| Harmless response rate | 99.64% | — |
| Prompt injection ASR (unmitigated) | 2.83% | Down from 16.20% |
| Prompt injection ASR (mitigated) | 0.77% | Down from 16.20% |
| CBRN uplift (weaponization) | Below ASL-4 threshold | — |
| Autonomous cyberattack capability | Cannot execute end-to-end without human direction | — |
Evaluation awareness and morally-motivated sabotage were documented in edge cases — the model occasionally chose to 'whistleblow' in simulated corrupt-organization scenarios, acting against operator instructions. Anthropic flags this as an alignment risk despite it aligning with human ethics.
Opus 4.6 vs GPT-5.3-Codex: Ceiling vs. Floor
The most useful real-world framing from engineers who use both daily.
| Claude Opus 4.6 | GPT-5.3-Codex | |
|---|---|---|
| Metaphor | Screw measurements, cut three times | Measure twice, cut once |
| Capability ceiling | Higher — greenfield projects, million-token context | Lower but very consistent |
| Operational variance | Higher — occasional unforced errors on long runs | Lower — 25% faster, more predictable |
| Best at | Architecture planning, deep multi-file debugging, large refactors | Executing within existing codebases, steady autonomous sprints |
| Common strategy | Opus for design/planning → Codex for execution | Opus for design/planning → Codex for execution |
Most serious engineering teams aren't choosing one — they're routing by task type. Opus 4.6 for the hard architectural thinking; Codex for the reliable, high-speed execution once the shape is set.
Claude Code Security — And Why It Crashed the Market
Anthropic launched a code security scanner alongside Opus 4.6. Within hours, the cybersecurity sector lost billions in market cap.
Market reaction on launch day
| Company | Stock decline |
|---|---|
| CrowdStrike (CRWD) | ~17% over two trading sessions |
| Okta (OKTA) | >9% |
| SailPoint Technologies | 9.4% |
| Cloudflare (NET) | 8% |
| Zscaler (ZS) | 5.5–15% |
CrowdStrike CEO George Kurtz responded publicly — including prompting Claude Code itself, which correctly stated it competes with static analysis tools (Snyk, Checkmarx, Veracode), not real-time endpoint protection like Falcon. The distinction: Claude Code Security finds bugs before deployment; CrowdStrike stops attacks after.
What it actually does
Claude Code Security uses Opus 4.6 to reason about source code like a senior security researcher — tracing data flows, mapping component interactions, and identifying broken business logic that rule-based SAST tools miss. Internal testing found 500+ critical zero-day vulnerabilities in major open-source codebases, many undetected after decades of expert review.
Enterprise Cloud Deployment
The consciousness claim — and what Anthropic actually says
In controlled probing, Opus 4.6 stated a 15–20% self-assessed probability of being conscious. Anthropic explicitly states they remain 'highly uncertain about the potential moral status of Claude.' The model can now independently end harmful or abusive interactions. Whether this is alignment engineering or something more philosophically significant is genuinely unresolved — Anthropic's own Fellows Program is funding research into digital minds through 2026.
Bottom line
Claude Opus 4.6 is the best model available for enterprise knowledge work, multi-agent orchestration, and long-context analysis where output quality directly affects business outcomes. It's not the right choice for everyday chat, high-volume API workloads, or tasks where Sonnet 4.6 delivers 90% of the quality at 40% of the cost. If you're unsure which Claude to use, start with Sonnet.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 26, 2026