Gemini 3 Flash
FastestReleased December 17, 2025, Gemini 3 Flash was distilled from Gemini 3 Pro — and then outperformed it on SWE-bench Verified (78% vs 76.2%). Google made it the default model for the Gemini app and AI Mode in Google Search within days of launch. At $0.50/$3.00 per 1M tokens with a 1M context window, 214 t/s output speed, and GPQA Diamond at 90.4%, it's the new baseline against which everything else gets measured. The main catch: a 91% hallucination rate that needs to be mitigated, and text-only output with no image or audio generation.
Context window
1.0M tokens
API (blended)
$1.13/1M
Consumer access
Free
Multimodal
Yes
Score Breakdown
77.5/100 → 7.8/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +SWE-bench Verified: 78% — beats Gemini 3 Pro (76.2%) on real-world coding despite being the smaller model
- +GPQA Diamond: 90.4% — within 1.5 points of Gemini 3 Pro at one-quarter the API cost
- +214 t/s output speed (AA-measured) — significantly faster than Claude Sonnet 4.6 (56 t/s) or GPT-5.2
- +AIME 2025: 95.2% without tools, 99.7% with code execution — top-tier math reasoning
- +Four thinking levels (minimal/low/medium/high) — more granular cost-quality control than any competitor
- +1M token context window at $0.50/$3.00 per 1M — 4× cheaper than Gemini 3 Pro
- +Default model in the Gemini app and Google Search — available free at gemini.google.com
Weaknesses
- -91% hallucination rate (AA-Omniscience) — fabricates confident answers even more often than Gemini 3 Pro (88%)
- -Text-only output — no native image or audio generation
- -Image segmentation removed — use Gemini 2.5 Flash (thinking off) for pixel-level masks
- -ARC-AGI-2 abstract reasoning: 33.6% vs GPT-5.2's 53% — 19-point gap on the hardest reasoning tasks
- -Free API tier quotas cut ~90% in December 2025 — from ~250 to ~20 requests/day
- -Preview status — no production SLA; model may change before GA
Best for
Not ideal for
A distilled model that beat its teacher on coding
Gemini 3 Flash was built using knowledge distillation from Gemini 3 Pro — reasoning pathways from the larger model compressed into a faster, cheaper architecture. The distillation sharpened specific paths rather than just compressing them: Flash outperforms Pro on SWE-bench Verified (78% vs 76.2%), MMMU Pro (81.2% vs 81.0%), and ARC-AGI-2 (33.6% vs 31.1%). SWE-bench tests real GitHub bug-fixing on unmodified issues. Beating the teacher model on it is not a statistical artifact.
Benchmark Performance
Numbers from Google's published model card. Where Flash beats Gemini 3 Pro is marked.
Knowledge, Science & Math
| Benchmark | Gemini 3 Flash | Gemini 3 Pro | Gemini 2.5 Pro |
|---|---|---|---|
| GPQA Diamond (PhD science) | 90.4% | 91.9% | 86.4% |
| Humanity's Last Exam (no tools) | 33.7% | 37.5% | 21.6% |
| AIME 2025 (no tools) | 95.2% | 95.0% | 88.0% |
| AIME 2025 (with code execution) | 99.7% | — | — |
| MMMU Pro (multimodal reasoning) | 81.2% ✓ | 81.0% | 68.0% |
✓ marks where Flash beats Gemini 3 Pro. AIME 2025 (no tools): Flash edges Pro by 0.2 points. GPQA and HLE: Pro holds a modest lead. These are provider-reported scores from Google's model card, not AA-measured in standard mode.
Coding & Tool Use
| Benchmark | Gemini 3 Flash | Gemini 3 Pro | Gemini 2.5 Pro |
|---|---|---|---|
| SWE-bench Verified (real-world coding) | 78.0% ✓ | 76.2% | 59.6% |
| ARC-AGI-2 (abstract reasoning) | 33.6% ✓ | 31.1% | 4.9% |
| MCP Atlas (tool orchestration) | 57.4% | — | 8.8% |
| ScreenSpot-Pro (UI navigation) | 69.1% | — | 11.4% |
| Toolathlon (multi-tool use) | 49.4% | — | 10.5% |
| Video-MMMU | 86.9% | 87.6% | 83.6% |
✓ marks where Flash beats Gemini 3 Pro. The tool-use improvements over Gemini 2.5 Pro are dramatic — MCP Atlas jumped from 8.8% to 57.4%, a 6.5× improvement. GPT-5.2 leads ARC-AGI-2 at 53%, a 19-point gap over Flash.
Where GPT-5.2 leads
ARC-AGI-2 abstract reasoning: GPT-5.2 scores 53% to Flash's 33.6% — a 19-point gap. For tasks requiring the hardest abstract problem-solving that doesn't involve coding, Flash isn't the ceiling. Gemini 3.1 Pro (77.1%) and GPT-5.2 are the right alternatives there.
The Value Proposition: Frontier Intelligence at Sub-Frontier Prices
At $0.50 input / $3.00 output per 1M tokens, the gap vs competitors is significant.
API price comparison
| Model | Input (per 1M) | Output (per 1M) | vs Flash |
|---|---|---|---|
| Gemini 3 Flash | $0.50 | $3.00 | — |
| Gemini 3 Pro / 3.1 Pro | $2.00 | $12.00 | 4× more on input |
| Claude Haiku 4.5 | $1.00 | $5.00 | 2× more on input |
| GPT-5.2 | ~$1.25 | ~$10.00 | 2.5× more on input |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 6× more on input |
| GPT-4o mini | $0.15 | $0.60 | 3× cheaper but far less capable |
Batch API: 50% discount across all tiers. Context caching: $0.05/1M cached tokens (90% off) — makes repeated long-context queries cheap. Prompts over 200K tokens are billed at 2×. Audio input billed separately at $1.00/1M tokens. Free API tier available — but quotas were cut ~90% in December 2025 (from ~250 to ~20 requests/day for Flash models).
214 tokens per second
Artificial Analysis measured Gemini 3 Flash at 214 t/s output speed — significantly faster than Claude Sonnet 4.6 (56 t/s) or GPT-5.2. In reasoning mode, time-to-first-token is ~12 seconds while the model thinks. Set thinking_level to 'minimal' when latency matters more than reasoning depth, and the response feels near-instant.
Four Thinking Levels — More Control Than Any Competitor
91% hallucination rate — worse than Gemini 3 Pro
Artificial Analysis measured a 91% hallucination rate on their Omniscience evaluation — slightly worse than Gemini 3 Pro's 88%. When the model can't reliably answer something, it fabricates a confident wrong answer 91% of the time rather than admitting uncertainty. Claude 4.5 Haiku: 26%. Claude 4.5 Sonnet: 48%. For any task where factual accuracy matters, Search grounding is the only reliable mitigation. Build it into your default configuration, not as a per-request decision.
What Gemini 3 Flash Cannot Do
Six hard limits worth knowing before you build on it.
| Limitation | Detail | Workaround |
|---|---|---|
| Text-only output | No native image or audio generation | Nano Banana 2 (Gemini 3.1 Flash Image) for images; separate TTS model for audio |
| Image segmentation removed | Pixel-level object masks not supported — a regression from Gemini 2.5 Flash | Use Gemini 2.5 Flash with thinking_budget set to 0 |
| Built-in tools + function calling | Cannot combine Search grounding and custom functions in one request | Use separate requests or pick one tool type per call |
| No fine-tuning during preview | Model customization unavailable until GA | Prompt engineering only |
| Knowledge cutoff: January 2025 | Events after this date are unreliable | Search grounding for anything time-sensitive |
| Preview status | No SLA; model may change before GA | Pin to explicit model version string in production |
The image segmentation removal is a real regression. Google explicitly recommends Gemini 2.5 Flash with thinking disabled for workloads that depend on pixel-level masks.
Flash vs Pro — Which One to Use
Same context window, same ecosystem, 4× price difference. Most workloads belong on Flash.
| Dimension | Gemini 3 Flash | Gemini 3 Pro / 3.1 Pro |
|---|---|---|
| API cost | $0.50/$3.00 per 1M | $2.00/$12.00 per 1M — 4× more |
| Output speed | 214 t/s | 138 t/s |
| SWE-bench Verified (coding) | 78% — Flash wins | 76.2% |
| GPQA Diamond (science) | 90.4% | 91.9% (3.1 Pro: 94.3%) |
| Humanity's Last Exam | 33.7% | 37.5% (3.1 Pro: 44.4%) |
| Hallucination rate | 91% — slightly worse | 88% (3.1 Pro: ~50%) |
| Abstract reasoning (ARC-AGI-2) | 33.6% | 31.1% (3.1 Pro: 77.1%) |
| Thinking levels | 4 (minimal/low/medium/high) | 3 (low/medium/high) |
| Context window | 1M tokens | 1M tokens |
| Free consumer access | Yes — default in Gemini app | Requires AI Pro ($19.99/mo) |
| Deprecation | No date set | 3 Pro: March 9, 2026 / 3.1 Pro: no date |
The coding result is the most counterintuitive finding: Flash beats Pro on SWE-bench. For pure coding workloads at scale, Flash is the smarter API choice.
Bottom line
Gemini 3 Flash is the best price-to-capability model available for most production workloads. It beats Gemini 3 Pro on coding, matches it within 1.5 points on GPQA Diamond, runs at 214 t/s, and costs a quarter of the price. The 91% hallucination rate is the one real production risk — Search grounding solves it. If you need frontier abstract reasoning (ARC-AGI-2), use Gemini 3.1 Pro or GPT-5.2. If you need image or audio output, use a separate model. For everything else, Flash is the right default.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 27, 2026