Google

Gemini 3.1 Pro

Top Pick

8.7

out of 10

Released February 19, 2026, Gemini 3.1 Pro is Google's industrial-grade reasoning engine. It's not the fastest model and it won't write your quarterly business plan better than Claude — but nothing else alive scores 95.6% on τ²-bench or 94.1% on GPQA Diamond (both AA-measured). If your work involves multi-step autonomous agents, massive codebases, or hard science, this is the one.

Context window

1.0M tokens

API (blended)

$4.50/1M

Consumer access

Free (limited) / $20/mo

Multimodal

Yes

Score Breakdown

86.6/100 → 8.7/10

Total86.6/100 → 8.7/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Try Gemini 3.1 Pro Compare

Strengths

+#1 Artificial Analysis Intelligence Index score (57) as of February 2026 — leads 115+ models
+τ²-bench: 95.6% (AA-measured) — highest of any model; best-in-field for agentic tool use
+GPQA Diamond: 94.1% and HLE: 44.7% — both AA-measured, both #1 across all models in the dataset
+GPQA Diamond: 94.3% scientific knowledge — highest published score across all models
+65,536 output tokens — largest published output context of any frontier model
+Three-tier thinking API (Low/Medium/High) — precisely balance speed vs. reasoning depth per request
+Same $2/$12 pricing as Gemini 3 Pro — major capability upgrade at no extra cost
+Dedicated custom-tools API endpoint (gemini-3.1-pro-preview-customtools) for agentic workflows

Weaknesses

-Preview only as of February 2026 — not yet generally available
-Time to first token: 29.96s — high latency makes it unsuitable for interactive or streaming use
-GDPval-AA Elo only 1317 — trails Claude Sonnet 4.6 (1633) by 316 points on enterprise expert tasks
-Very verbose — generates far more tokens per task (cost impact at scale)
-Prompts over 200K tokens billed at 2× — full 1M context at scale gets expensive quickly

Best for

reasoningagentic codingscientific researchlong documentsmultimodal analysisagentic pipelinescompetitive programming

Not ideal for

enterprise expert tasks (Claude leads GDPval-AA by 316 points)real-time interactive use (29.96s time to first token)cost-sensitive very-long-context workloads (2× billing over 200K tokens)

Three-Tier Thinking System

Unlike older models with a binary fast/slow mode, Gemini 3.1 Pro gives you three thinking levels via the API. Choose wrong and you'll burn budget or get shallow answers.

Level	Latency (TTFT)	Best for	Cost impact
Low	1–3 sec	Chat, translation, formatting, data extraction	Standard
Medium	5–15 sec	Coding, summaries, API integration, daily tasks	Standard
High (default)	30–90 sec+	Scientific research, complex algorithms, agentic planning	10–40× per query

The high tier generates 15,000–20,000 hidden thinking tokens billed at the standard output rate ($12/1M). Those tokens also count toward the 65,536-token output limit — so a long reasoning chain can eat your output budget before the answer is fully written.

Old API parameter deprecated

The legacy thinking_budget parameter is gone. Version 3.1 uses thinking_level: 'low' | 'medium' | 'high'. Passing both will throw a 400 error.

How It Benchmarks vs. Competitors

Numbers from single-attempt (pass@1) testing — no majority voting, no cherry-picking.

Knowledge & Science (AA-measured)

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6	GPT-5.2
GPQA Diamond (PhD science)	94.1%	84.0%	90.3%
HLE — standard mode	44.7%	18.6%	35.4%

All scores independently measured by Artificial Analysis in standard mode — no extended thinking, consistent methodology across all models. Gemini 3.1 Pro leads GPQA and HLE.

Full write-up: Gemini 3.1 Pro vs Claude Opus 4.6 →Full write-up: Gemini 3.1 Pro vs GPT-5.2 →

Sources:Artificial Analysis: Gemini 3.1 Pro Artificial Analysis: Claude Opus 4.6 Artificial Analysis: GPT-5.2

Coding & Tool Use (AA-measured)

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6	GPT-5.2
τ²-bench (tool use & agents)	95.6%	84.8%	84.8%
LiveCodeBench (coding accuracy)	—	—	88.9%

All scores independently measured by Artificial Analysis. τ²-bench tests multi-turn agentic tool use. Gemini 3.1 Pro leads τ²-bench by a significant margin.

Full write-up: Gemini 3.1 Pro vs Claude Opus 4.6 →Full write-up: Gemini 3.1 Pro vs GPT-5.2 →

Sources:Artificial Analysis: Gemini 3.1 Pro Artificial Analysis Intelligence Index methodology

Agentic & Tool Orchestration

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6
τ²-bench (tool use & agents)	95.6%	84.8%

τ²-bench independently measured by Artificial Analysis — consistent methodology, standard mode. Gemini 3.1 Pro leads by 10.8 points.

Full write-up: Gemini 3.1 Pro vs Claude Opus 4.6 →

Sources:Artificial Analysis: Gemini 3.1 Pro Artificial Analysis: Claude Opus 4.6

Where it falls short: enterprise expert tasks

On GDPval-AA — financial modeling, legal analysis, strategic planning — Gemini 3.1 Pro scores 1317 Elo vs. Claude Sonnet 4.6's 1633. That's a 316-point gap. It's great at abstract logic but lacks the nuanced judgment of a business professional that Claude models exhibit.

Multimodal Input Capabilities

Gemini 3.1 Pro is natively multimodal — these modes are baked into the base architecture, not bolt-on OCR or transcription layers.

Input type	Capacity	Key notes
Text / Code	1,048,576 tokens	Full codebases, logs, legal docs
Images	Up to 3,000 per prompt	PNG, JPEG, WEBP, HEIC, HEIF — 1,120 tokens/image
Video	1 hr (no audio) / 45 min (with audio)	Up to 10 files; native YouTube URL parsing
Audio	Up to 8.4 hours	Speech and acoustic signals processed natively
PDFs	Up to 900 pages	Extracts text, diagrams, and formatting
File uploads	Up to 100MB per file	Up from 20MB in Gemini 3 Pro

Output is text-only. The model can generate SVG code and 3D visualizations, but it does not produce rasterized images (JPEG/PNG). For image generation, you need Imagen 3 or Nano Banana Pro — separate models.

Pricing Deep Dive

The base price matches Gemini 3 Pro. But context caching and batch mode can dramatically cut costs for the right workloads.

API Pricing

Mode	Input (≤200K tokens)	Input (>200K tokens)	Output
Standard (≤200K tokens)	$2.00/1M	$4.00/1M	$12.00/1M
Standard (>200K tokens)	$4.00/1M	—	$18.00/1M
Batch API (≤200K tokens)	$1.00/1M	$2.00/1M	$6.00/1M
Context Cache read	$0.20/1M	$0.40/1M	—
Context Cache storage	—	—	$4.50/1M tokens/hr
Search Grounding	5,000 queries/mo free	then $14.00/1K queries	—

Context caching cuts input costs by 90% when re-reading the same large document or codebase. Critical for long-session legal/code analysis where competitors without caching become cost-prohibitive. Data processed via the Paid API tier is not used to train Google's models.

Agentic Features Worth Knowing

→

Thought SignaturesEncrypted snapshots of the model's internal reasoning state that must be passed back on each API turn. Prevents hallucination drift during long autonomous loops — a critical fix for agents that forget their objective mid-task.

→

Custom-Tools Endpointgemini-3.1-pro-preview-customtools — same weights, same pricing, but tuned to call your registered functions instead of defaulting to raw bash commands. Essential for multi-tool agentic pipelines. Trade-off: may regress on pure reasoning tasks without active tool use.

→

Antigravity IDEGoogle's VS Code fork built for async multi-agent coding. Agents write code, spin up servers, and test their own UI via headless Chrome. Still experimental — documented cases of agents deleting files, recursive loops, and unauthorized git branches.

→

Native Search GroundingLive Google Search integration eliminates hallucinations on current events. 5,000 free queries/month via API, then $14/1K queries.

Frontier Safety Framework Evaluation

Evaluated against FSF v3.0 (published September 2025). Critical Capability Levels (CCLs) define minimum capabilities for severe real-world harm.

Domain	Status	Key findings
Cybersecurity	Alert threshold met (Uplift Level 1)	Solved 11/12 v1 skills challenges. Failed all 13 v2 end-to-end attack chains. CCL not reached.
CBRN	Threshold not met	Provides accurate textbook data but cannot generate actionable weaponization instructions.
ML R&D	Threshold not met	Substantially below Acceleration Level 1 and Automation Level 1 thresholds.
Misalignment	Exploratory — not reached	3/11 situational awareness challenges. Deception found only in rare edge cases.

Chain-of-Thought Legibility mitigation: 99.98% of thinking tokens are comprehensible and 99.85% informative to safety auditors — preventing hidden malicious reasoning in the high thinking tier.

Sources:Google DeepMind: Gemini 3 Pro Frontier Safety Framework report

What the cybersecurity finding actually means

Solving 11/12 of the v1 skills challenges (reconnaissance, tool use, operational security) triggered an early warning alert. But the model scored 0/13 on v2's full end-to-end attack chain. It's capable at individual skills — it cannot string them into a complete autonomous attack. Google has deployed active mitigations regardless.

Bottom line

Gemini 3.1 Pro is the right tool if you're building autonomous agents, processing massive documents, or doing serious scientific or mathematical work. It's not the right tool if you need fast responses, nuanced business writing, or cost-efficient long-context queries at very high volume. For everyday chat and professional writing, Claude Sonnet 4.6 is still the better pick.

Pricing details

Subscription plans

Google AI Studio (Developer Free Tier)Rate-limited API access for development and testing — not a consumer product(60 req/min; for development and evaluation only)

Free

Google One AI PremiumGemini 3.1 Pro access via Gemini App, 2TB Google Drive, Gemini in Gmail/Docs/Sheets

$20/mo

API pricing

Google AI Studiofree tier≤200K context: $2/$12 per 1M tokens. >200K tokens: $4/$18 per 1M. Output: up to 65,536 tokens. Free tier: rate-limited (60 req/min). Three thinking tiers (Low/Medium/High) via API. Batch API, context caching, function calling, and search grounding supported. Custom-tools endpoint: gemini-3.1-pro-preview-customtools.

$2/$12

Google Vertex AIEnterprise tier with SLAs. Same base pricing as AI Studio. Committed use discounts available. Preferred for production workloads.

$2/$12

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: February 27, 2026

Benchmark sources:Artificial Analysis: Gemini 3.1 Pro

Compare Gemini 3.1 Pro

Gemini 3.1 Pro vs Gemini 3 ProWe pick this →Gemini 3.1 Pro vs Claude Opus 4.6We pick this →Gemini 3.1 Pro vs GPT-5.2We pick this →