[good?]

Google

Gemini 3 Flash

Fastest
7.8
out of 10

Released December 17, 2025, Gemini 3 Flash was distilled from Gemini 3 Pro — and then outperformed it on SWE-bench Verified (78% vs 76.2%). Google made it the default model for the Gemini app and AI Mode in Google Search within days of launch. At $0.50/$3.00 per 1M tokens with a 1M context window, 214 t/s output speed, and GPQA Diamond at 90.4%, it's the new baseline against which everything else gets measured. The main catch: a 91% hallucination rate that needs to be mitigated, and text-only output with no image or audio generation.

Context window

1.0M tokens

API (blended)

$1.13/1M

Consumer access

Free

Multimodal

Yes

Score Breakdown

77.5/100 → 7.8/10
Total77.5/100 → 7.8/10

Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →

Strengths

  • +SWE-bench Verified: 78% — beats Gemini 3 Pro (76.2%) on real-world coding despite being the smaller model
  • +GPQA Diamond: 90.4% — within 1.5 points of Gemini 3 Pro at one-quarter the API cost
  • +214 t/s output speed (AA-measured) — significantly faster than Claude Sonnet 4.6 (56 t/s) or GPT-5.2
  • +AIME 2025: 95.2% without tools, 99.7% with code execution — top-tier math reasoning
  • +Four thinking levels (minimal/low/medium/high) — more granular cost-quality control than any competitor
  • +1M token context window at $0.50/$3.00 per 1M — 4× cheaper than Gemini 3 Pro
  • +Default model in the Gemini app and Google Search — available free at gemini.google.com

Weaknesses

  • -91% hallucination rate (AA-Omniscience) — fabricates confident answers even more often than Gemini 3 Pro (88%)
  • -Text-only output — no native image or audio generation
  • -Image segmentation removed — use Gemini 2.5 Flash (thinking off) for pixel-level masks
  • -ARC-AGI-2 abstract reasoning: 33.6% vs GPT-5.2's 53% — 19-point gap on the hardest reasoning tasks
  • -Free API tier quotas cut ~90% in December 2025 — from ~250 to ~20 requests/day
  • -Preview status — no production SLA; model may change before GA

Best for

high-volume API workloadscoding and software engineeringmultimodal pipelinesreal-time streamingmath reasoningagentic tool usebudget-conscious production API use

Not ideal for

factual tasks without Search grounding (91% hallucination rate)image or audio generationimage segmentationfrontier abstract reasoning (use Gemini 3.1 Pro or GPT-5.2)

A distilled model that beat its teacher on coding

Gemini 3 Flash was built using knowledge distillation from Gemini 3 Pro — reasoning pathways from the larger model compressed into a faster, cheaper architecture. The distillation sharpened specific paths rather than just compressing them: Flash outperforms Pro on SWE-bench Verified (78% vs 76.2%), MMMU Pro (81.2% vs 81.0%), and ARC-AGI-2 (33.6% vs 31.1%). SWE-bench tests real GitHub bug-fixing on unmodified issues. Beating the teacher model on it is not a statistical artifact.

Benchmark Performance

Numbers from Google's published model card. Where Flash beats Gemini 3 Pro is marked.

Knowledge, Science & Math

BenchmarkGemini 3 FlashGemini 3 ProGemini 2.5 Pro
GPQA Diamond (PhD science)90.4%91.9%86.4%
Humanity's Last Exam (no tools)33.7%37.5%21.6%
AIME 2025 (no tools)95.2%95.0%88.0%
AIME 2025 (with code execution)99.7%
MMMU Pro (multimodal reasoning)81.2% ✓81.0%68.0%

✓ marks where Flash beats Gemini 3 Pro. AIME 2025 (no tools): Flash edges Pro by 0.2 points. GPQA and HLE: Pro holds a modest lead. These are provider-reported scores from Google's model card, not AA-measured in standard mode.

Coding & Tool Use

BenchmarkGemini 3 FlashGemini 3 ProGemini 2.5 Pro
SWE-bench Verified (real-world coding)78.0% ✓76.2%59.6%
ARC-AGI-2 (abstract reasoning)33.6% ✓31.1%4.9%
MCP Atlas (tool orchestration)57.4%8.8%
ScreenSpot-Pro (UI navigation)69.1%11.4%
Toolathlon (multi-tool use)49.4%10.5%
Video-MMMU86.9%87.6%83.6%

✓ marks where Flash beats Gemini 3 Pro. The tool-use improvements over Gemini 2.5 Pro are dramatic — MCP Atlas jumped from 8.8% to 57.4%, a 6.5× improvement. GPT-5.2 leads ARC-AGI-2 at 53%, a 19-point gap over Flash.

Where GPT-5.2 leads

ARC-AGI-2 abstract reasoning: GPT-5.2 scores 53% to Flash's 33.6% — a 19-point gap. For tasks requiring the hardest abstract problem-solving that doesn't involve coding, Flash isn't the ceiling. Gemini 3.1 Pro (77.1%) and GPT-5.2 are the right alternatives there.

The Value Proposition: Frontier Intelligence at Sub-Frontier Prices

At $0.50 input / $3.00 output per 1M tokens, the gap vs competitors is significant.

API price comparison

ModelInput (per 1M)Output (per 1M)vs Flash
Gemini 3 Flash$0.50$3.00
Gemini 3 Pro / 3.1 Pro$2.00$12.004× more on input
Claude Haiku 4.5$1.00$5.002× more on input
GPT-5.2~$1.25~$10.002.5× more on input
Claude Sonnet 4.6$3.00$15.006× more on input
GPT-4o mini$0.15$0.603× cheaper but far less capable

Batch API: 50% discount across all tiers. Context caching: $0.05/1M cached tokens (90% off) — makes repeated long-context queries cheap. Prompts over 200K tokens are billed at 2×. Audio input billed separately at $1.00/1M tokens. Free API tier available — but quotas were cut ~90% in December 2025 (from ~250 to ~20 requests/day for Flash models).

214 tokens per second

Artificial Analysis measured Gemini 3 Flash at 214 t/s output speed — significantly faster than Claude Sonnet 4.6 (56 t/s) or GPT-5.2. In reasoning mode, time-to-first-token is ~12 seconds while the model thinks. Set thinking_level to 'minimal' when latency matters more than reasoning depth, and the response feels near-instant.

Four Thinking Levels — More Control Than Any Competitor

MinimalNear-zero reasoning overhead. Fastest response, lowest cost. For data retrieval, translation, classification, and formatting tasks.
LowLight chain-of-thought. Quick coding tasks, summaries, simple Q&A.
MediumBalanced. Most analysis tasks, API integration, document editing, structured extraction.
High (default)Full reasoning depth — generates 15K–20K hidden thinking tokens billed at output rates. Use for complex algorithms, scientific research, and multi-step agentic planning.
Google Search groundingThe primary mitigation for the 91% hallucination rate. 5,000 free queries/month via API, then $14/1,000. Required for any task where current or specific factual information matters.
Thought signaturesEncrypted reasoning state preserved across multi-turn tool calls. Prevents drift in long autonomous agent runs — must be passed back exactly as received.
Parallel function callingMultiple tool calls in one inference step. Supports multimodal function responses — images and PDFs in tool results, added in Gemini 3.

91% hallucination rate — worse than Gemini 3 Pro

Artificial Analysis measured a 91% hallucination rate on their Omniscience evaluation — slightly worse than Gemini 3 Pro's 88%. When the model can't reliably answer something, it fabricates a confident wrong answer 91% of the time rather than admitting uncertainty. Claude 4.5 Haiku: 26%. Claude 4.5 Sonnet: 48%. For any task where factual accuracy matters, Search grounding is the only reliable mitigation. Build it into your default configuration, not as a per-request decision.

What Gemini 3 Flash Cannot Do

Six hard limits worth knowing before you build on it.

LimitationDetailWorkaround
Text-only outputNo native image or audio generationNano Banana 2 (Gemini 3.1 Flash Image) for images; separate TTS model for audio
Image segmentation removedPixel-level object masks not supported — a regression from Gemini 2.5 FlashUse Gemini 2.5 Flash with thinking_budget set to 0
Built-in tools + function callingCannot combine Search grounding and custom functions in one requestUse separate requests or pick one tool type per call
No fine-tuning during previewModel customization unavailable until GAPrompt engineering only
Knowledge cutoff: January 2025Events after this date are unreliableSearch grounding for anything time-sensitive
Preview statusNo SLA; model may change before GAPin to explicit model version string in production

The image segmentation removal is a real regression. Google explicitly recommends Gemini 2.5 Flash with thinking disabled for workloads that depend on pixel-level masks.

Flash vs Pro — Which One to Use

Same context window, same ecosystem, 4× price difference. Most workloads belong on Flash.

DimensionGemini 3 FlashGemini 3 Pro / 3.1 Pro
API cost$0.50/$3.00 per 1M$2.00/$12.00 per 1M — 4× more
Output speed214 t/s138 t/s
SWE-bench Verified (coding)78% — Flash wins76.2%
GPQA Diamond (science)90.4%91.9% (3.1 Pro: 94.3%)
Humanity's Last Exam33.7%37.5% (3.1 Pro: 44.4%)
Hallucination rate91% — slightly worse88% (3.1 Pro: ~50%)
Abstract reasoning (ARC-AGI-2)33.6%31.1% (3.1 Pro: 77.1%)
Thinking levels4 (minimal/low/medium/high)3 (low/medium/high)
Context window1M tokens1M tokens
Free consumer accessYes — default in Gemini appRequires AI Pro ($19.99/mo)
DeprecationNo date set3 Pro: March 9, 2026 / 3.1 Pro: no date

The coding result is the most counterintuitive finding: Flash beats Pro on SWE-bench. For pure coding workloads at scale, Flash is the smarter API choice.

Bottom line

Gemini 3 Flash is the best price-to-capability model available for most production workloads. It beats Gemini 3 Pro on coding, matches it within 1.5 points on GPQA Diamond, runs at 214 t/s, and costs a quarter of the price. The 91% hallucination rate is the one real production risk — Search grounding solves it. If you need frontier abstract reasoning (ARC-AGI-2), use Gemini 3.1 Pro or GPT-5.2. If you need image or audio output, use a separate model. For everything else, Flash is the right default.

Pricing details

Subscription plans

Free (gemini.google.com)Gemini Flash access via web and mobile; no hard usage cap for normal use(May be slower during peak; some advanced features locked to premium)
Free
Google One AI PremiumGemini Advanced (higher capability tier), 2TB Google Drive, Gemini in Gmail/Docs/Sheets
$20/mo

API pricing

Google AI Studiofree tierFree tier: rate-limited (60 req/min). Paid: $0.50/$3.00 per 1M tokens. Prompts >200K tokens billed at 2×. All four modalities (text/image/audio/video) included.
$0.5/$3
Google Vertex AIEnterprise tier. Same base pricing, SLA available.
$0.5/$3
OpenRouterSmall markup over direct Google pricing.
$0.52/$3.1

Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.

Last updated: February 27, 2026