[good?]

Google

Gemini 3.1 Pro

Top Pick
8.7
out of 10

Released February 19, 2026, Gemini 3.1 Pro is Google's industrial-grade reasoning engine. It's not the fastest model and it won't write your quarterly business plan better than Claude — but nothing else alive scores 95.6% on τ²-bench or 94.1% on GPQA Diamond (both AA-measured). If your work involves multi-step autonomous agents, massive codebases, or hard science, this is the one.

Context window

1.0M tokens

API (blended)

$4.50/1M

Consumer access

Free (limited) / $20/mo

Multimodal

Yes

Score Breakdown

86.6/100 → 8.7/10
Total86.6/100 → 8.7/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +#1 Artificial Analysis Intelligence Index score (57) as of February 2026 — leads 115+ models
  • +τ²-bench: 95.6% (AA-measured) — highest of any model; best-in-field for agentic tool use
  • +GPQA Diamond: 94.1% and HLE: 44.7% — both AA-measured, both #1 across all models in the dataset
  • +GPQA Diamond: 94.3% scientific knowledge — highest published score across all models
  • +65,536 output tokens — largest published output context of any frontier model
  • +Three-tier thinking API (Low/Medium/High) — precisely balance speed vs. reasoning depth per request
  • +Same $2/$12 pricing as Gemini 3 Pro — major capability upgrade at no extra cost
  • +Dedicated custom-tools API endpoint (gemini-3.1-pro-preview-customtools) for agentic workflows

Weaknesses

  • -Preview only as of February 2026 — not yet generally available
  • -Time to first token: 29.96s — high latency makes it unsuitable for interactive or streaming use
  • -GDPval-AA Elo only 1317 — trails Claude Sonnet 4.6 (1633) by 316 points on enterprise expert tasks
  • -Very verbose — generates far more tokens per task (cost impact at scale)
  • -Prompts over 200K tokens billed at 2× — full 1M context at scale gets expensive quickly

Best for

reasoningagentic codingscientific researchlong documentsmultimodal analysisagentic pipelinescompetitive programming

Not ideal for

enterprise expert tasks (Claude leads GDPval-AA by 316 points)real-time interactive use (29.96s time to first token)cost-sensitive very-long-context workloads (2× billing over 200K tokens)

Three-Tier Thinking System

Unlike older models with a binary fast/slow mode, Gemini 3.1 Pro gives you three thinking levels via the API. Choose wrong and you'll burn budget or get shallow answers.

LevelLatency (TTFT)Best forCost impact
Low1–3 secChat, translation, formatting, data extractionStandard
Medium5–15 secCoding, summaries, API integration, daily tasksStandard
High (default)30–90 sec+Scientific research, complex algorithms, agentic planning10–40× per query

The high tier generates 15,000–20,000 hidden thinking tokens billed at the standard output rate ($12/1M). Those tokens also count toward the 65,536-token output limit — so a long reasoning chain can eat your output budget before the answer is fully written.

Old API parameter deprecated

The legacy thinking_budget parameter is gone. Version 3.1 uses thinking_level: 'low' | 'medium' | 'high'. Passing both will throw a 400 error.

How It Benchmarks vs. Competitors

Numbers from single-attempt (pass@1) testing — no majority voting, no cherry-picking.

Knowledge & Science (AA-measured)

BenchmarkGemini 3.1 ProClaude Opus 4.6GPT-5.2
GPQA Diamond (PhD science)94.1%84.0%90.3%
HLE — standard mode44.7%18.6%35.4%

All scores independently measured by Artificial Analysis in standard mode — no extended thinking, consistent methodology across all models. Gemini 3.1 Pro leads GPQA and HLE.

Coding & Tool Use (AA-measured)

BenchmarkGemini 3.1 ProClaude Opus 4.6GPT-5.2
τ²-bench (tool use & agents)95.6%84.8%84.8%
LiveCodeBench (coding accuracy)88.9%

All scores independently measured by Artificial Analysis. τ²-bench tests multi-turn agentic tool use. Gemini 3.1 Pro leads τ²-bench by a significant margin.

Agentic & Tool Orchestration

BenchmarkGemini 3.1 ProClaude Opus 4.6
τ²-bench (tool use & agents)95.6%84.8%

τ²-bench independently measured by Artificial Analysis — consistent methodology, standard mode. Gemini 3.1 Pro leads by 10.8 points.

Where it falls short: enterprise expert tasks

On GDPval-AA — financial modeling, legal analysis, strategic planning — Gemini 3.1 Pro scores 1317 Elo vs. Claude Sonnet 4.6's 1633. That's a 316-point gap. It's great at abstract logic but lacks the nuanced judgment of a business professional that Claude models exhibit.

Multimodal Input Capabilities

Gemini 3.1 Pro is natively multimodal — these modes are baked into the base architecture, not bolt-on OCR or transcription layers.

Input typeCapacityKey notes
Text / Code1,048,576 tokensFull codebases, logs, legal docs
ImagesUp to 3,000 per promptPNG, JPEG, WEBP, HEIC, HEIF — 1,120 tokens/image
Video1 hr (no audio) / 45 min (with audio)Up to 10 files; native YouTube URL parsing
AudioUp to 8.4 hoursSpeech and acoustic signals processed natively
PDFsUp to 900 pagesExtracts text, diagrams, and formatting
File uploadsUp to 100MB per fileUp from 20MB in Gemini 3 Pro

Output is text-only. The model can generate SVG code and 3D visualizations, but it does not produce rasterized images (JPEG/PNG). For image generation, you need Imagen 3 or Nano Banana Pro — separate models.

Pricing Deep Dive

The base price matches Gemini 3 Pro. But context caching and batch mode can dramatically cut costs for the right workloads.

API Pricing

ModeInput (≤200K tokens)Input (>200K tokens)Output
Standard (≤200K tokens)$2.00/1M$4.00/1M$12.00/1M
Standard (>200K tokens)$4.00/1M$18.00/1M
Batch API (≤200K tokens)$1.00/1M$2.00/1M$6.00/1M
Context Cache read$0.20/1M$0.40/1M
Context Cache storage$4.50/1M tokens/hr
Search Grounding5,000 queries/mo freethen $14.00/1K queries

Context caching cuts input costs by 90% when re-reading the same large document or codebase. Critical for long-session legal/code analysis where competitors without caching become cost-prohibitive. Data processed via the Paid API tier is not used to train Google's models.

Agentic Features Worth Knowing

Thought SignaturesEncrypted snapshots of the model's internal reasoning state that must be passed back on each API turn. Prevents hallucination drift during long autonomous loops — a critical fix for agents that forget their objective mid-task.
Custom-Tools Endpointgemini-3.1-pro-preview-customtools — same weights, same pricing, but tuned to call your registered functions instead of defaulting to raw bash commands. Essential for multi-tool agentic pipelines. Trade-off: may regress on pure reasoning tasks without active tool use.
Antigravity IDEGoogle's VS Code fork built for async multi-agent coding. Agents write code, spin up servers, and test their own UI via headless Chrome. Still experimental — documented cases of agents deleting files, recursive loops, and unauthorized git branches.
Native Search GroundingLive Google Search integration eliminates hallucinations on current events. 5,000 free queries/month via API, then $14/1K queries.

Frontier Safety Framework Evaluation

Evaluated against FSF v3.0 (published September 2025). Critical Capability Levels (CCLs) define minimum capabilities for severe real-world harm.

DomainStatusKey findings
CybersecurityAlert threshold met (Uplift Level 1)Solved 11/12 v1 skills challenges. Failed all 13 v2 end-to-end attack chains. CCL not reached.
CBRNThreshold not metProvides accurate textbook data but cannot generate actionable weaponization instructions.
ML R&DThreshold not metSubstantially below Acceleration Level 1 and Automation Level 1 thresholds.
MisalignmentExploratory — not reached3/11 situational awareness challenges. Deception found only in rare edge cases.

Chain-of-Thought Legibility mitigation: 99.98% of thinking tokens are comprehensible and 99.85% informative to safety auditors — preventing hidden malicious reasoning in the high thinking tier.

What the cybersecurity finding actually means

Solving 11/12 of the v1 skills challenges (reconnaissance, tool use, operational security) triggered an early warning alert. But the model scored 0/13 on v2's full end-to-end attack chain. It's capable at individual skills — it cannot string them into a complete autonomous attack. Google has deployed active mitigations regardless.

Bottom line

Gemini 3.1 Pro is the right tool if you're building autonomous agents, processing massive documents, or doing serious scientific or mathematical work. It's not the right tool if you need fast responses, nuanced business writing, or cost-efficient long-context queries at very high volume. For everyday chat and professional writing, Claude Sonnet 4.6 is still the better pick.

Pricing details

Subscription plans

Google AI Studio (Developer Free Tier)Rate-limited API access for development and testing — not a consumer product(60 req/min; for development and evaluation only)
Free
Google One AI PremiumGemini 3.1 Pro access via Gemini App, 2TB Google Drive, Gemini in Gmail/Docs/Sheets
$20/mo

API pricing

Google AI Studiofree tier≤200K context: $2/$12 per 1M tokens. >200K tokens: $4/$18 per 1M. Output: up to 65,536 tokens. Free tier: rate-limited (60 req/min). Three thinking tiers (Low/Medium/High) via API. Batch API, context caching, function calling, and search grounding supported. Custom-tools endpoint: gemini-3.1-pro-preview-customtools.
$2/$12
Google Vertex AIEnterprise tier with SLAs. Same base pricing as AI Studio. Committed use discounts available. Preferred for production workloads.
$2/$12

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: February 27, 2026