Gemini 3 Pro
Released November 18, 2025, Gemini 3 Pro was the first model to break 1,500 Elo on LMArena and led 13 of 16 major benchmarks at launch. Three months later, Google shipped Gemini 3.1 Pro at the same price — better reasoning across the board — and scheduled Gemini 3 Pro for deprecation on March 9, 2026. If you're starting fresh, use 3.1 Pro. For existing deployments, the migration is a model string swap. The model is still capable: 138 t/s output, a real 1M-token context window, and native multimodal inputs including up to 9.5 hours of audio and an hour of video per call. Just know what you're working with.
Context window
1.0M tokens
API (blended)
$4.50/1M
Consumer access
Free (limited) / $20/mo
Multimodal
Yes
Score Breakdown
77.6/100 → 7.8/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +1M token context — handles book-length documents, entire codebases, and multi-hour transcripts natively
- +GPQA Diamond 90.8% and HLE 37.2% (AA-measured) — top-tier scientific reasoning
- +τ²-bench 87.1% and LiveCodeBench 91.7% (AA-measured) — excellent for coding and agentic tool use
- +AA Intelligence Index 48 — second only to Gemini 3.1 Pro in the Gemini family
- +Natively multimodal: text, image, audio, video input in a single API call
- +138 t/s output speed (AA-measured) — significantly faster than Claude and GPT-5.2
- +Free tier via AI Studio — good for development and prototyping
Weaknesses
- -Deprecated March 9, 2026 — migrate to gemini-3.1-pro-preview (same price, better reasoning)
- -88% hallucination rate (AA-Omniscience) — fabricates confident answers when it doesn't know something
- -Extreme verbosity inflates real API costs: Artificial Analysis spent $892 evaluating it vs ~$100 for other models
- -GDPval-AA: 1,317 Elo — trails Claude Sonnet 4.6 by 316 points on office and professional judgment tasks
- -No free API tier — free API access is limited to Gemini 3 Flash
- -Output truncation bug cuts code generation at ~21K tokens despite 65K stated cap (fixed in 3.1 Pro)
Best for
Not ideal for
Deprecated March 9, 2026 — plan your migration
Gemini 3 Pro (model string: gemini-3-pro-preview) will be decommissioned on March 9, 2026. API calls to this endpoint fail after that date. The replacement is gemini-3.1-pro-preview — same pricing, no migration cost, and significantly better on ARC-AGI-2 (31.1% → 77.1%), GPQA Diamond (91.9% → 94.3%), and SWE-Bench Verified (76.2% → 80.6%). Update the model string, test your prompts. That's the whole migration.
Benchmark Performance
All scores independently measured by Artificial Analysis in standard mode — no extended thinking, same methodology across all models.
Knowledge & Science (AA-measured)
| Benchmark | Gemini 3 Pro | GPT-5.2 | Claude Opus 4.6 |
|---|---|---|---|
| GPQA Diamond (PhD science) | 90.8% | 90.3% | 84.0% |
| HLE (expert-level knowledge) | 37.2% | 35.4% | 18.6% |
Gemini 3 Pro edges GPT-5.2 on both. Gemini 3.1 Pro extended those leads further: GPQA Diamond 94.3%, HLE 44.7%.
Coding & Tool Use (AA-measured)
| Benchmark | Gemini 3 Pro | GPT-5.2 | Claude Opus 4.6 |
|---|---|---|---|
| τ²-bench (multi-turn tool use) | 87.1% | 84.8% | 84.8% |
| LiveCodeBench (coding accuracy) | 91.7% | 88.9% | — |
Gemini 3 Pro leads both on tool use — a consistent strength across the Gemini 3 family. Gemini 3.1 Pro pushed τ²-bench further to 95.6%.
Where competitors still lead
GDPval-AA — everyday office tasks, strategic planning, financial analysis — is where Claude models win. Gemini 3 Pro scores 1,317 Elo against Claude Sonnet 4.6's 1,633. That gap is real. For professional judgment work requiring contextual reasoning over structured data, the Claude models are the better choice.
What 1M Tokens Actually Means
The 1,048,576-token input limit is real and tested. Here's what fits.
| Input type | Capacity | Notes |
|---|---|---|
| Text / code | ~750,000 words | Full codebases, legal filings, research archives |
| Images | Up to 900 per prompt | PNG, JPEG, WEBP, HEIC, HEIF — ~1,120 tokens each |
| Audio | Up to 9.5 hours | 32 tokens per second — speech and acoustic signals natively |
| Video | Up to 1 hour | 45 min with audio; YouTube URLs supported; up to 10 files per prompt |
| PDFs | Up to ~1,000 pages | Text, diagrams, and layout extracted natively |
| Max output | 65,536 tokens (~49K words) | Note: Gemini 3 Pro had a bug cutting code at ~21K tokens. Gemini 3.1 Pro fixed it. |
Prompts over 200K tokens trigger the long-context pricing tier: $4/$18 per 1M instead of $2/$12. Build that into your cost estimates for large-document jobs.
Pricing — Cheap Listed Rate, Higher Real Cost
The headline numbers are competitive. Verbosity is where the budget goes.
API pricing
| Context tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| ≤200K tokens | $2.00 | $12.00 |
| >200K tokens | $4.00 | $18.00 |
| Batch API (≤200K) | $1.00 | $6.00 |
| Context cache read | $0.20 | — |
| Cache storage | — | $4.50 per 1M tokens/hr |
| Search grounding | 5,000 queries/mo free | then $14.00 per 1,000 queries |
No free API tier for Gemini 3 Pro. Free API access is limited to Gemini 3 Flash. Consumer access starts at Google AI Pro ($19.99/month). Context caching cuts repeated input costs by 90% — essential for long-session document analysis.
The verbosity problem — real costs are much higher than listed
Artificial Analysis spent $892 evaluating Gemini 3 Pro on their Intelligence Index. Other frontier models averaged roughly $100 for the same evaluation. Gemini 3 Pro generated 57 million output tokens where competitors generated around 12 million. At $12 per million output tokens, that's a 4–5× cost multiplier at equivalent task quality. Set max_tokens limits in production or the bill will surprise you.
Built-in Tools
88% hallucination rate — read this before deploying
Artificial Analysis measured an 88% hallucination rate on their Omniscience evaluation. When the model can't reliably answer something, it produces a confident wrong answer 88% of the time rather than acknowledging uncertainty. For comparison: Claude 4.5 Haiku was 26%, Claude 4.5 Sonnet 48%. Gemini 3.1 Pro reportedly improved to around 50% — still high. For anything where factual accuracy matters, pair this model with Search grounding and build output verification into your pipeline.
Gemini 3 Pro vs Gemini 3.1 Pro — Is the Upgrade Worth It?
Same price. Significantly better reasoning. Shorter answer: yes.
| Dimension | Gemini 3 Pro | Gemini 3.1 Pro | Change |
|---|---|---|---|
| ARC-AGI-2 | 31.1% | 77.1% | +46.0pp — more than doubled |
| GPQA Diamond (provider-reported) | 91.9% | 94.3% | +2.4pp |
| SWE-Bench Verified | 76.2% | 80.6% | +4.4pp |
| Humanity's Last Exam (no tools) | 37.5% | 44.4% | +6.9pp |
| Thinking levels | LOW + HIGH | LOW + MEDIUM + HIGH | MEDIUM added |
| Output truncation bug | Cuts at ~21K tokens | Fixed | — |
| Token efficiency | Verbose | Improved | Lower real cost per task |
| Deprecation date | March 9, 2026 | None set | — |
| API pricing | $2/$12 per 1M | $2/$12 per 1M | Identical |
The output truncation bug in Gemini 3 Pro silently cut code generation at ~21,000 tokens despite the stated 65,536-token cap. If you've hit this in production, that alone is reason to switch — not just a benchmark improvement.
Bottom line
Gemini 3 Pro is a capable model at a competitive price — real 1M-token context, strong science and coding benchmarks, and built-in Search grounding. But it has a March 9, 2026 deprecation date, an 88% hallucination rate, and verbosity that pushes real API costs well above the listed rate. Migrate to Gemini 3.1 Pro: same price, meaningfully better reasoning, fixed truncation bug. If you're deciding which Gemini model to use from scratch, the 3.1 Pro review is where to start.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 27, 2026