Rankings

Best Multimodal AI Models

Models that accept both text and images as input, ranked by overall quality score. Useful for document analysis, screenshot debugging, visual Q&A, and mixed-media workflows. Text-only models are excluded. Rankings use the same composite score as overall rankings — price not included.

top multimodal model

Gemini 3.1 Pro

Google · Quality 8.7/10 · AA Index 57.18

Gemini 3.1 ProGoogletop-pick

Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 115+ models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. AA independently measures it at 94.1% GPQA Diamond, 44.7% HLE, and 95.6% τ²-bench — top of field on all three. The API exposes three thinking tiers (Low / Medium / High) and a 65,536-token output window — the largest published output context of any frontier model. A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.

AA Index 57.181M tokens91 t/s

8.7

/10

Free (limited) · $4.50/1M API

Full review →

Gemini 3 ProGoogle

Google's November 2025 flagship — deprecated March 9, 2026, replaced by Gemini 3.1 Pro at the same $2/$12 per 1M token price. It led 13 of 16 major benchmarks at launch: 90.8% GPQA Diamond, 87.1% τ²-bench, 138 t/s output speed, and a real 1M-token context window. Two things to know before deploying: an 88% hallucination rate (AA-Omniscience) that requires Search grounding to mitigate, and verbosity that inflates real API costs 4–5× above the listed rate. If you're starting fresh, use 3.1 Pro. Already on 3 Pro? The migration is a model string change.

AA Index 48.391M tokens138 t/s

7.8

/10

Free (limited) · $4.50/1M API

Full review →

Gemini 3 FlashGooglefastest

Google's December 2025 Flash model — distilled from Gemini 3 Pro, and in a result that embarrassed the larger model, it beats Pro on SWE-bench Verified (78% vs 76.2%). At $0.50/$3.00 per 1M tokens with a 1M context window and 214 t/s output speed, it's now the default model powering the Gemini app and AI Mode in Google Search for hundreds of millions of users. The intelligence-to-cost ratio is unusual: GPQA Diamond 90.4%, near-Pro level science reasoning, at one-quarter the API price. One thing to know before production use: a 91% hallucination rate that needs Search grounding to control, and text-only output — no image or audio generation.

AA Index 46.431M tokens214 t/s

7.8

/10

Free consumer product

Full review →

GPT-5.4OpenAInew

Released March 5, 2026, GPT-5.4 is OpenAI's most capable frontier model for professional work. It merges the coding depth of GPT-5.3-Codex with leading knowledge-work, computer-use, and agentic tool capabilities into a single model. On GDPval it beats or ties human experts on 83% of tasks (up from 70.9% on GPT-5.2). It's the first general-purpose OpenAI model with native computer-use, hitting 75% on OSWorld-Verified and surpassing human performance. The 1M token context window (experimental in Codex), tool search for efficient MCP integration, and 33% fewer hallucinated claims make it the new default for enterprise automation.

AA Index 55.51M tokens95 t/s

7.5

/10

Free (limited) · $5.63/1M API

Full review →

GPT-5.2OpenAItop-pick

Released December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's flagship reasoning model. It beats or ties human industry experts on 70.9% of GDPval knowledge work tasks, scores 100% on AIME 2025 without tools, and runs at a hallucination rate under 1% with browsing active. The 400K context window, 5-tier thinking budget, and 90% cached-input discount make it the default choice for enterprise automation and agentic pipelines.

AA Index 51.28400K tokens91 t/s

7.5

/10

Free (limited) · $4.81/1M API

Full review →

Claude Sonnet 4.6Anthropic

Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.

AA Index 44.38200K tokens55 t/s

6.6

/10

Free (limited) · $6.00/1M API

Full review →

Claude Opus 4.6Anthropictop-pick

Anthropic's most powerful model, released February 4, 2026. Opus 4.6 leads the industry on enterprise expert tasks (GDPval-AA Elo 1606 — 144 points above GPT-5.2), agentic computer use (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% accuracy at 1M tokens). Its 1M-token context window is in beta; standard is 200K. The price — $5/$25 per 1M tokens — reflects the positioning: reach for it when output quality has direct business consequences.

AA Index 46.46200K tokens69 t/s

6.4

/10

$10.00/1M API blended

Full review →

GPT-5 MiniOpenAIbest-value

OpenAI's small-but-smart model and the best value in the GPT-5 family. At $0.25/$2.00 per 1M tokens it costs 7× less than GPT-5.2 while delivering an AA Intelligence Index of 41 — higher than Claude Haiku and Gemini Flash. The 400K context window and multimodal input make it a strong default for cost-sensitive production pipelines.

AA Index 41.17400K tokens76 t/s

6.3

/10

Free (limited) · $0.69/1M API

Full review →

Claude Sonnet 4.5Anthropic

Anthropic's mid-tier model from September 2025. Sonnet 4.5 was the best coding model in the world at launch — it outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, scored 77.2% on SWE-bench Verified, and demonstrated 30+ hour autonomous coding sessions. It has since been succeeded by Sonnet 4.6 (February 2026), but remains a production-ready model for teams already built on it. Same $3/$15 pricing as its successor.

AA Index 37.14200K tokens37.2 t/s

5.8

/10

Free (limited) · $6.00/1M API

Full review →

Claude Haiku 4.5Anthropicbest-value

Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.

AA Index 31.05200K tokens111 t/s

5.5

/10

Free (limited) · $2.00/1M API

Full review →

Grok 4.2xAI

Released February 17, 2026 as a public beta and updated to Beta 2 on March 3, Grok 4.2 (also called Grok 4.20) is xAI's most architecturally novel model: four specialized AI agents — Grok, Harper, Benjamin, and Lucas — debate and synthesize answers in real time on every complex query. Beta 2 tightened instruction following, reduced capability hallucinations, and fixed LaTeX rendering. The public beta still runs on xAI's 500B-parameter 'small' foundation model; the full-size variant hasn't finished training. There are no official benchmarks yet. It arrives amid regulatory investigations across seven countries, mass founder departures, and the SpaceX acquisition.

AA Index 43256K tokens85 t/s

5.2

/10

$9.00/1M API blended

Full review →

Grok 4.1xAI

Released November 17, 2025, Grok 4.1 is xAI's most refined model — a post-training upgrade to Grok 4 that briefly claimed the #1 spot on LMArena (30-position jump) before Gemini 3 Pro and Claude Opus 4.6 overtook it. It leads every frontier model on emotional intelligence (EQ-Bench3: 1586 Elo) and creative writing. It's not trying to win on coding or reasoning — it's trying to be the most compelling AI personality, with the cheapest entry point and real-time X data.

AA Index 23.562M tokens133 t/s

4.7

/10

Free (limited) · $0.25/1M API

Full review →

Llama 4 MaverickMetaopen weightsopen-source

Meta's mid-sized open-weights model and the most capable Llama 4 variant for general use. Maverick runs as a mixture-of-experts architecture with 400B total parameters but only 17B active — giving it good speed at 115 t/s while maintaining an AA Intelligence Index of 18. It's multimodal, handles 1M tokens of context, and can be self-hosted. The trade-off: it trails frontier closed models significantly on all AA-measured benchmarks.

AA Index 18.361M tokens115 t/s

4.0

/10

Free (limited) · $0.31/1M API

Full review →

Llama 4 ScoutMetaopen weightsopen-source

Meta's ultra-long-context open-weights model with a 10M token window — the largest of any publicly available model. Scout is a smaller MoE variant (109B total, ~17B active) optimized for speed and context length over raw intelligence. At 135 t/s and AA Intelligence Index 14, it's the right call when you need to process enormous documents or codebases that would overflow any other model.

AA Index 13.5210M tokens135 t/s

3.9

/10

Free (limited) · $0.17/1M API

Full review →

What counts as multimodal?

These models accept at least image + text input. Several also support audio, video frames, or document uploads. “Multimodal” does not mean they generate images — for image generation see Image Generators. Want to browse without rank order? Browse all multimodal models →

Last updated February 2026. Intelligence scores from Artificial Analysis. See how we rate for full methodology.