[good?]

Anthropic

Claude Sonnet 4.5

5.8
out of 10

Released September 29, 2025, Claude Sonnet 4.5 was the most consequential Anthropic release of 2025 — a mid-tier model that outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, set the high watermark on SWE-bench Verified (77.2%), and proved that AI could sustain autonomous coding sessions for 30+ continuous hours. It has since been succeeded by Sonnet 4.6 (February 2026). If you're starting fresh, use Sonnet 4.6. But Sonnet 4.5 remains a proven, heavily safety-tested model at the same price — and for teams already running on it, there's no urgent reason to upgrade.

Context window

200K tokens

API (blended)

$6.00/1M

Consumer access

Free (limited) / $20/mo

Multimodal

Yes

Score Breakdown

58.1/100 → 5.8/10
Total58.1/100 → 5.8/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +77.2% SWE-bench Verified — best coding model in the world at launch (Sept 2025)
  • +61.4% OSWorld — best computer-use score at release, 45% jump over Sonnet 4
  • +30+ hour autonomous coding sessions — 4× longer than prior-generation Claude Opus 4
  • +98% τ²-bench telecom (tool-use orchestration) — near-perfect agentic performance
  • +Heavily safety-tested: 148-page system card, ASL-3, 99.29% harmless response rate
  • +Same $3/$15 API pricing as Sonnet 4.6 — proven model, no upgrade cost

Weaknesses

  • -Succeeded by Sonnet 4.6 (Feb 2026) — most use cases should prefer 4.6
  • -Knowledge cutoff July 2025 — 6 months behind Sonnet 4.6 (Jan 2026 training data)
  • -Hedges in 34% of code review comments — more cautious than necessary on actionable feedback
  • -Visual reasoning (MMMU 77.8%) trails GPT-5.2 significantly
  • -UI/frontend work still a known weak spot vs Gemini models

Best for

codingagentic taskscomputer uselong autonomous sessionsteams already on Sonnet 4.5

Not ideal for

latest events/knowledgevisual reasoningUI-heavy frontend tasksnew projects (use Sonnet 4.6)

Where Sonnet 4.5 Stood at Launch

At release in September 2025, Sonnet 4.5 was the best coding model in the world by most measures — and it beat its own company's flagship at a fraction of the cost.

Launch benchmark comparison (provider-reported, extended thinking)

BenchmarkSonnet 4.5GPT-5 (Oct 2025)GPT-5 CodexWhat it measures
SWE-bench Verified77.2% (82% parallel)72.8%74.5%Real GitHub bug-fixing
Terminal-Bench (thinking)61.3%58.8%Autonomous CLI coding
OSWorld (computer use)61.4%~38%Not reportedDesktop GUI navigation
Finance Agent v1.155.3%46.9%Financial analysis tasks
GPQA Diamond (science)83.4%85.7%Expert scientific reasoning

These are provider-reported scores using extended thinking — higher than AA standard-mode measurements. They're useful for understanding Sonnet 4.5's position at launch relative to competitors at the time. For current apples-to-apples AA-measured data, see the Score Breakdown panel above.

It beat its own flagship

Sonnet 4.5 surpassed Claude Opus 4.1 — Anthropic's own flagship at the time — on nearly every metric while costing $3/$15 per 1M tokens vs. Opus 4.1's $15/$75. Anthropic CPO Mike Krieger described Sonnet 4.5 as smaller than Opus 4.1 but smarter 'in almost every single way.' This tier compression — where mid-range models outperform prior-generation flagships — has become the defining pattern of the Claude 4.x family.

Coding & Agentic Performance

The two capabilities that defined Sonnet 4.5's reputation: SWE-bench leadership and 30-hour autonomous runs.

Coding benchmark performance

BenchmarkScoreContext
SWE-bench Verified (standard)77.2%Best in class at launch — #1 of any model
SWE-bench Verified (parallel compute)82.0%With multi-sample test-time compute
Terminal-Bench (extended thinking)61.3%First model to crack 60%
τ²-bench (retail)86.2%Tool orchestration — Telecom: 98.0%
τ²-bench (telecom)98.0%Near-perfect on multi-tool agentic tasks
Finance Agent v1.155.3%#1 at launch, 8pp above GPT-5

SWE-bench Verified measures whether a model can resolve real, unsimplified GitHub issues. 77.2% means it successfully fixed 77 out of 100 real bugs on the first attempt. This held the top spot for five months before Sonnet 4.6 pushed higher.

30-hour autonomous agent — what that means in practice

MetricSonnet 4.5Claude Opus 4 (previous)
Sustained autonomous task horizon30+ hours~7 hours
Improvement4× longer
Real-world demo11,000-line chat app (Slack-like)
Tasks completed autonomouslyWrite code, stand up DB, buy domain, SOC 2 audit

Anthropic researcher David Hershey documented an enterprise trial where Sonnet 4.5 autonomously built a Slack-like chat application — 11,000 lines of code — including database setup, domain purchase, and compliance auditing. Human checkpoints were in place but not required for most steps.

Computer Use — OSWorld 61.4%

Sonnet 4.5 nearly doubled its predecessor's computer use performance in four months.

OSWorld progression — Anthropic Claude family

ModelOSWorld scoreRelease
Claude Sonnet 4 (Sonnet 4.0)42.2%May 2025
Claude Sonnet 4.561.4%Sept 2025 (+45% improvement)
Claude Sonnet 4.672.5%Feb 2026
Claude Opus 4.672.7%Feb 2026

OSWorld measures a model's ability to operate desktop and browser UIs autonomously — clicking, typing, navigating applications, running terminal commands. The 45% jump from Sonnet 4 to Sonnet 4.5 in just four months reflects Anthropic's investment in computer-use training data during this period.

Where Sonnet 4.5 Falls Short

The model has consistent weaknesses that developers hit in production — and they haven't been fixed. They're addressed in Sonnet 4.6.

Known limitations

LimitationEvidenceFixed in 4.6?
Hedging in code review34% of actionable comments use 'might,' 'could,' 'possibly' — worse than Opus 4.1 (28%)Improved
Visual reasoningMMMU 77.8% — well behind GPT-5.2 (84.2%)Partial
UI / frontend workMultiple developers report 15-20% slower than Gemini on short UI fixesPartial
Instruction literalnessTreats MUST/ALWAYS as contextual, not absolute — broke production prompts for some teamsImproved
Knowledge cutoffJuly 2025 — events after are unreliableYes (4.6 has Jan 2026)

CodeRabbit's 25-PR benchmark found Sonnet 4.5 catching 41% of important bugs vs Opus 4.1's 50% — closer than expected, but the hedging behavior made its reviews less actionable. 'A thoughtful colleague where Opus is surgical.'

Safety — Most Thoroughly Tested Claude at Release

Sonnet 4.5 shipped with a 148-page system card — the most comprehensive of any Claude release at the time.

Safety evaluation results

MetricSonnet 4.5vs. Sonnet 4
Harmless response rate99.29%Up from 98.22%
Over-refusal rate0.02%Down from 0.15%
Shortcut behaviors65% reductionvs. Sonnet 3.7
Malicious agentic requests rejected98.7% (148/150)Up from 89.3%
ASL classificationASL-3ASL-3
Third-party auditorsUK AI Safety Institute, US AISI, Apollo Research

First Claude model evaluated using mechanistic interpretability — probing internal neural representations for alignment rather than relying solely on behavioral tests. Political bias dropped to 3.3% (1.3% with extended thinking).

Sonnet 4.5 vs Sonnet 4.6 — Should You Upgrade?

Both cost $3/$15 per 1M tokens. Here's what changed.

DimensionSonnet 4.5Sonnet 4.6Upgrade worth it?
AA Intelligence Index37.1444.38Yes — measurable gap
GDPval-AA (office tasks)Not reported1,633 EloYes
OSWorld (computer use)61.4%72.5%Yes — significant jump
Knowledge cutoffJuly 2025Aug 2025 (train: Jan 2026)Yes — 6 months newer
Adaptive thinking tiers4 tiers4 tiersNo change
Context window200K (1M beta)200K (1M beta)No change
Max output tokens64K64KNo change
API price$3/$15$3/$15Free upgrade

If you're calling via API, switching from claude-sonnet-4-5-20250929 to claude-sonnet-4-6 is a zero-cost upgrade with meaningful capability improvements. The main reason to stay on 4.5 is if you've tuned prompts specifically for its behavior and don't want to re-evaluate outputs.

What Sonnet 4.5 meant for the industry

Sonnet 4.5 arrived when Anthropic was valued at $183 billion (Series F). Claude Code reached $500M+ run-rate revenue in its first six months — largely on the back of Sonnet 4.5's coding capability. By the time Sonnet 4.6 shipped in February 2026, Anthropic had raised another $30B at a $380B valuation. The coding arms race that Sonnet 4.5 helped define — 30-hour autonomous sessions, SWE-bench leadership, agentic tool use — is now the primary competitive battleground for every major AI lab.

Bottom line

Claude Sonnet 4.5 set the standard for what mid-tier AI could do in 2025 — and it holds up well. For new projects, use Sonnet 4.6: it's meaningfully better on intelligence, computer use, and knowledge recency at the same price. For existing systems built on Sonnet 4.5, the prompts and behavior you've tuned are stable — upgrade on your own timeline. The model's core strengths (coding, agentic persistence, τ²-bench tool use) remain competitive even against 2026 alternatives.

Pricing details

Subscription plans

FreeClaude Sonnet access with daily limits (4.6 is current default)(Message cap; no file uploads; no Projects)
Free
ProAccess to all Claude models including Sonnet 4.5 via model picker
$20/mo
TeamAll Pro features, admin console, centralized billing, higher rate limits
$25/mo (annual)

API pricing

AnthropicAPI ID: claude-sonnet-4-5-20250929. Prompt caching: cached input at $0.30/1M. Batch API: 50% discount.
$3/$15
AWS BedrockID: anthropic.claude-sonnet-4-5. Same pricing as direct.
$3/$15
Google Vertex AISame pricing as direct.
$3/$15

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.