Anthropic
Claude Sonnet 4.5
Released September 29, 2025, Claude Sonnet 4.5 was the most consequential Anthropic release of 2025 — a mid-tier model that outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, set the high watermark on SWE-bench Verified (77.2%), and proved that AI could sustain autonomous coding sessions for 30+ continuous hours. It has since been succeeded by Sonnet 4.6 (February 2026). If you're starting fresh, use Sonnet 4.6. But Sonnet 4.5 remains a proven, heavily safety-tested model at the same price — and for teams already running on it, there's no urgent reason to upgrade.
Context window
200K tokens
API (blended)
$6.00/1M
Consumer access
Free (limited) / $20/mo
Multimodal
Yes
Score Breakdown
58.1/100 → 5.8/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +77.2% SWE-bench Verified — best coding model in the world at launch (Sept 2025)
- +61.4% OSWorld — best computer-use score at release, 45% jump over Sonnet 4
- +30+ hour autonomous coding sessions — 4× longer than prior-generation Claude Opus 4
- +98% τ²-bench telecom (tool-use orchestration) — near-perfect agentic performance
- +Heavily safety-tested: 148-page system card, ASL-3, 99.29% harmless response rate
- +Same $3/$15 API pricing as Sonnet 4.6 — proven model, no upgrade cost
Weaknesses
- -Succeeded by Sonnet 4.6 (Feb 2026) — most use cases should prefer 4.6
- -Knowledge cutoff July 2025 — 6 months behind Sonnet 4.6 (Jan 2026 training data)
- -Hedges in 34% of code review comments — more cautious than necessary on actionable feedback
- -Visual reasoning (MMMU 77.8%) trails GPT-5.2 significantly
- -UI/frontend work still a known weak spot vs Gemini models
Best for
Not ideal for
Where Sonnet 4.5 Stood at Launch
At release in September 2025, Sonnet 4.5 was the best coding model in the world by most measures — and it beat its own company's flagship at a fraction of the cost.
Launch benchmark comparison (provider-reported, extended thinking)
| Benchmark | Sonnet 4.5 | GPT-5 (Oct 2025) | GPT-5 Codex | What it measures |
|---|---|---|---|---|
| SWE-bench Verified | 77.2% (82% parallel) | 72.8% | 74.5% | Real GitHub bug-fixing |
| Terminal-Bench (thinking) | 61.3% | — | 58.8% | Autonomous CLI coding |
| OSWorld (computer use) | 61.4% | ~38% | Not reported | Desktop GUI navigation |
| Finance Agent v1.1 | 55.3% | 46.9% | — | Financial analysis tasks |
| GPQA Diamond (science) | 83.4% | 85.7% | — | Expert scientific reasoning |
These are provider-reported scores using extended thinking — higher than AA standard-mode measurements. They're useful for understanding Sonnet 4.5's position at launch relative to competitors at the time. For current apples-to-apples AA-measured data, see the Score Breakdown panel above.
It beat its own flagship
Sonnet 4.5 surpassed Claude Opus 4.1 — Anthropic's own flagship at the time — on nearly every metric while costing $3/$15 per 1M tokens vs. Opus 4.1's $15/$75. Anthropic CPO Mike Krieger described Sonnet 4.5 as smaller than Opus 4.1 but smarter 'in almost every single way.' This tier compression — where mid-range models outperform prior-generation flagships — has become the defining pattern of the Claude 4.x family.
Coding & Agentic Performance
The two capabilities that defined Sonnet 4.5's reputation: SWE-bench leadership and 30-hour autonomous runs.
Coding benchmark performance
| Benchmark | Score | Context |
|---|---|---|
| SWE-bench Verified (standard) | 77.2% | Best in class at launch — #1 of any model |
| SWE-bench Verified (parallel compute) | 82.0% | With multi-sample test-time compute |
| Terminal-Bench (extended thinking) | 61.3% | First model to crack 60% |
| τ²-bench (retail) | 86.2% | Tool orchestration — Telecom: 98.0% |
| τ²-bench (telecom) | 98.0% | Near-perfect on multi-tool agentic tasks |
| Finance Agent v1.1 | 55.3% | #1 at launch, 8pp above GPT-5 |
SWE-bench Verified measures whether a model can resolve real, unsimplified GitHub issues. 77.2% means it successfully fixed 77 out of 100 real bugs on the first attempt. This held the top spot for five months before Sonnet 4.6 pushed higher.
30-hour autonomous agent — what that means in practice
| Metric | Sonnet 4.5 | Claude Opus 4 (previous) |
|---|---|---|
| Sustained autonomous task horizon | 30+ hours | ~7 hours |
| Improvement | 4× longer | — |
| Real-world demo | 11,000-line chat app (Slack-like) | — |
| Tasks completed autonomously | Write code, stand up DB, buy domain, SOC 2 audit | — |
Anthropic researcher David Hershey documented an enterprise trial where Sonnet 4.5 autonomously built a Slack-like chat application — 11,000 lines of code — including database setup, domain purchase, and compliance auditing. Human checkpoints were in place but not required for most steps.
Computer Use — OSWorld 61.4%
Sonnet 4.5 nearly doubled its predecessor's computer use performance in four months.
OSWorld progression — Anthropic Claude family
| Model | OSWorld score | Release |
|---|---|---|
| Claude Sonnet 4 (Sonnet 4.0) | 42.2% | May 2025 |
| Claude Sonnet 4.5 | 61.4% | Sept 2025 (+45% improvement) |
| Claude Sonnet 4.6 | 72.5% | Feb 2026 |
| Claude Opus 4.6 | 72.7% | Feb 2026 |
OSWorld measures a model's ability to operate desktop and browser UIs autonomously — clicking, typing, navigating applications, running terminal commands. The 45% jump from Sonnet 4 to Sonnet 4.5 in just four months reflects Anthropic's investment in computer-use training data during this period.
Where Sonnet 4.5 Falls Short
The model has consistent weaknesses that developers hit in production — and they haven't been fixed. They're addressed in Sonnet 4.6.
Known limitations
| Limitation | Evidence | Fixed in 4.6? |
|---|---|---|
| Hedging in code review | 34% of actionable comments use 'might,' 'could,' 'possibly' — worse than Opus 4.1 (28%) | Improved |
| Visual reasoning | MMMU 77.8% — well behind GPT-5.2 (84.2%) | Partial |
| UI / frontend work | Multiple developers report 15-20% slower than Gemini on short UI fixes | Partial |
| Instruction literalness | Treats MUST/ALWAYS as contextual, not absolute — broke production prompts for some teams | Improved |
| Knowledge cutoff | July 2025 — events after are unreliable | Yes (4.6 has Jan 2026) |
CodeRabbit's 25-PR benchmark found Sonnet 4.5 catching 41% of important bugs vs Opus 4.1's 50% — closer than expected, but the hedging behavior made its reviews less actionable. 'A thoughtful colleague where Opus is surgical.'
Safety — Most Thoroughly Tested Claude at Release
Sonnet 4.5 shipped with a 148-page system card — the most comprehensive of any Claude release at the time.
Safety evaluation results
| Metric | Sonnet 4.5 | vs. Sonnet 4 |
|---|---|---|
| Harmless response rate | 99.29% | Up from 98.22% |
| Over-refusal rate | 0.02% | Down from 0.15% |
| Shortcut behaviors | 65% reduction | vs. Sonnet 3.7 |
| Malicious agentic requests rejected | 98.7% (148/150) | Up from 89.3% |
| ASL classification | ASL-3 | ASL-3 |
| Third-party auditors | UK AI Safety Institute, US AISI, Apollo Research | — |
First Claude model evaluated using mechanistic interpretability — probing internal neural representations for alignment rather than relying solely on behavioral tests. Political bias dropped to 3.3% (1.3% with extended thinking).
Sonnet 4.5 vs Sonnet 4.6 — Should You Upgrade?
Both cost $3/$15 per 1M tokens. Here's what changed.
| Dimension | Sonnet 4.5 | Sonnet 4.6 | Upgrade worth it? |
|---|---|---|---|
| AA Intelligence Index | 37.14 | 44.38 | Yes — measurable gap |
| GDPval-AA (office tasks) | Not reported | 1,633 Elo | Yes |
| OSWorld (computer use) | 61.4% | 72.5% | Yes — significant jump |
| Knowledge cutoff | July 2025 | Aug 2025 (train: Jan 2026) | Yes — 6 months newer |
| Adaptive thinking tiers | 4 tiers | 4 tiers | No change |
| Context window | 200K (1M beta) | 200K (1M beta) | No change |
| Max output tokens | 64K | 64K | No change |
| API price | $3/$15 | $3/$15 | Free upgrade |
If you're calling via API, switching from claude-sonnet-4-5-20250929 to claude-sonnet-4-6 is a zero-cost upgrade with meaningful capability improvements. The main reason to stay on 4.5 is if you've tuned prompts specifically for its behavior and don't want to re-evaluate outputs.
What Sonnet 4.5 meant for the industry
Sonnet 4.5 arrived when Anthropic was valued at $183 billion (Series F). Claude Code reached $500M+ run-rate revenue in its first six months — largely on the back of Sonnet 4.5's coding capability. By the time Sonnet 4.6 shipped in February 2026, Anthropic had raised another $30B at a $380B valuation. The coding arms race that Sonnet 4.5 helped define — 30-hour autonomous sessions, SWE-bench leadership, agentic tool use — is now the primary competitive battleground for every major AI lab.
Bottom line
Claude Sonnet 4.5 set the standard for what mid-tier AI could do in 2025 — and it holds up well. For new projects, use Sonnet 4.6: it's meaningfully better on intelligence, computer use, and knowledge recency at the same price. For existing systems built on Sonnet 4.5, the prompts and behavior you've tuned are stable — upgrade on your own timeline. The model's core strengths (coding, agentic persistence, τ²-bench tool use) remain competitive even against 2026 alternatives.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 27, 2026