Gemini 3.1 Pro
Top PickReleased February 19, 2026, Gemini 3.1 Pro is Google's industrial-grade reasoning engine. It's not the fastest model and it won't write your quarterly business plan better than Claude — but nothing else alive scores 95.6% on τ²-bench or 94.1% on GPQA Diamond (both AA-measured). If your work involves multi-step autonomous agents, massive codebases, or hard science, this is the one.
Context window
1.0M tokens
API (blended)
$4.50/1M
Consumer access
Free (limited) / $20/mo
Multimodal
Yes
Score Breakdown
86.6/100 → 8.7/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +#1 Artificial Analysis Intelligence Index score (57) as of February 2026 — leads 115+ models
- +τ²-bench: 95.6% (AA-measured) — highest of any model; best-in-field for agentic tool use
- +GPQA Diamond: 94.1% and HLE: 44.7% — both AA-measured, both #1 across all models in the dataset
- +GPQA Diamond: 94.3% scientific knowledge — highest published score across all models
- +65,536 output tokens — largest published output context of any frontier model
- +Three-tier thinking API (Low/Medium/High) — precisely balance speed vs. reasoning depth per request
- +Same $2/$12 pricing as Gemini 3 Pro — major capability upgrade at no extra cost
- +Dedicated custom-tools API endpoint (gemini-3.1-pro-preview-customtools) for agentic workflows
Weaknesses
- -Preview only as of February 2026 — not yet generally available
- -Time to first token: 29.96s — high latency makes it unsuitable for interactive or streaming use
- -GDPval-AA Elo only 1317 — trails Claude Sonnet 4.6 (1633) by 316 points on enterprise expert tasks
- -Very verbose — generates far more tokens per task (cost impact at scale)
- -Prompts over 200K tokens billed at 2× — full 1M context at scale gets expensive quickly
Best for
Not ideal for
Three-Tier Thinking System
Unlike older models with a binary fast/slow mode, Gemini 3.1 Pro gives you three thinking levels via the API. Choose wrong and you'll burn budget or get shallow answers.
| Level | Latency (TTFT) | Best for | Cost impact |
|---|---|---|---|
| Low | 1–3 sec | Chat, translation, formatting, data extraction | Standard |
| Medium | 5–15 sec | Coding, summaries, API integration, daily tasks | Standard |
| High (default) | 30–90 sec+ | Scientific research, complex algorithms, agentic planning | 10–40× per query |
The high tier generates 15,000–20,000 hidden thinking tokens billed at the standard output rate ($12/1M). Those tokens also count toward the 65,536-token output limit — so a long reasoning chain can eat your output budget before the answer is fully written.
Old API parameter deprecated
The legacy thinking_budget parameter is gone. Version 3.1 uses thinking_level: 'low' | 'medium' | 'high'. Passing both will throw a 400 error.
How It Benchmarks vs. Competitors
Numbers from single-attempt (pass@1) testing — no majority voting, no cherry-picking.
Knowledge & Science (AA-measured)
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| GPQA Diamond (PhD science) | 94.1% | 84.0% | 90.3% |
| HLE — standard mode | 44.7% | 18.6% | 35.4% |
All scores independently measured by Artificial Analysis in standard mode — no extended thinking, consistent methodology across all models. Gemini 3.1 Pro leads GPQA and HLE.
Coding & Tool Use (AA-measured)
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| τ²-bench (tool use & agents) | 95.6% | 84.8% | 84.8% |
| LiveCodeBench (coding accuracy) | — | — | 88.9% |
All scores independently measured by Artificial Analysis. τ²-bench tests multi-turn agentic tool use. Gemini 3.1 Pro leads τ²-bench by a significant margin.
Agentic & Tool Orchestration
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 |
|---|---|---|
| τ²-bench (tool use & agents) | 95.6% | 84.8% |
τ²-bench independently measured by Artificial Analysis — consistent methodology, standard mode. Gemini 3.1 Pro leads by 10.8 points.
Where it falls short: enterprise expert tasks
On GDPval-AA — financial modeling, legal analysis, strategic planning — Gemini 3.1 Pro scores 1317 Elo vs. Claude Sonnet 4.6's 1633. That's a 316-point gap. It's great at abstract logic but lacks the nuanced judgment of a business professional that Claude models exhibit.
Multimodal Input Capabilities
Gemini 3.1 Pro is natively multimodal — these modes are baked into the base architecture, not bolt-on OCR or transcription layers.
| Input type | Capacity | Key notes |
|---|---|---|
| Text / Code | 1,048,576 tokens | Full codebases, logs, legal docs |
| Images | Up to 3,000 per prompt | PNG, JPEG, WEBP, HEIC, HEIF — 1,120 tokens/image |
| Video | 1 hr (no audio) / 45 min (with audio) | Up to 10 files; native YouTube URL parsing |
| Audio | Up to 8.4 hours | Speech and acoustic signals processed natively |
| PDFs | Up to 900 pages | Extracts text, diagrams, and formatting |
| File uploads | Up to 100MB per file | Up from 20MB in Gemini 3 Pro |
Output is text-only. The model can generate SVG code and 3D visualizations, but it does not produce rasterized images (JPEG/PNG). For image generation, you need Imagen 3 or Nano Banana Pro — separate models.
Pricing Deep Dive
The base price matches Gemini 3 Pro. But context caching and batch mode can dramatically cut costs for the right workloads.
API Pricing
| Mode | Input (≤200K tokens) | Input (>200K tokens) | Output |
|---|---|---|---|
| Standard (≤200K tokens) | $2.00/1M | $4.00/1M | $12.00/1M |
| Standard (>200K tokens) | $4.00/1M | — | $18.00/1M |
| Batch API (≤200K tokens) | $1.00/1M | $2.00/1M | $6.00/1M |
| Context Cache read | $0.20/1M | $0.40/1M | — |
| Context Cache storage | — | — | $4.50/1M tokens/hr |
| Search Grounding | 5,000 queries/mo free | then $14.00/1K queries | — |
Context caching cuts input costs by 90% when re-reading the same large document or codebase. Critical for long-session legal/code analysis where competitors without caching become cost-prohibitive. Data processed via the Paid API tier is not used to train Google's models.
Agentic Features Worth Knowing
Frontier Safety Framework Evaluation
Evaluated against FSF v3.0 (published September 2025). Critical Capability Levels (CCLs) define minimum capabilities for severe real-world harm.
| Domain | Status | Key findings |
|---|---|---|
| Cybersecurity | Alert threshold met (Uplift Level 1) | Solved 11/12 v1 skills challenges. Failed all 13 v2 end-to-end attack chains. CCL not reached. |
| CBRN | Threshold not met | Provides accurate textbook data but cannot generate actionable weaponization instructions. |
| ML R&D | Threshold not met | Substantially below Acceleration Level 1 and Automation Level 1 thresholds. |
| Misalignment | Exploratory — not reached | 3/11 situational awareness challenges. Deception found only in rare edge cases. |
Chain-of-Thought Legibility mitigation: 99.98% of thinking tokens are comprehensible and 99.85% informative to safety auditors — preventing hidden malicious reasoning in the high thinking tier.
What the cybersecurity finding actually means
Solving 11/12 of the v1 skills challenges (reconnaissance, tool use, operational security) triggered an early warning alert. But the model scored 0/13 on v2's full end-to-end attack chain. It's capable at individual skills — it cannot string them into a complete autonomous attack. Google has deployed active mitigations regardless.
Bottom line
Gemini 3.1 Pro is the right tool if you're building autonomous agents, processing massive documents, or doing serious scientific or mathematical work. It's not the right tool if you need fast responses, nuanced business writing, or cost-efficient long-context queries at very high volume. For everyday chat and professional writing, Claude Sonnet 4.6 is still the better pick.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 27, 2026