Rankings
Best Multimodal AI Models
Models that accept both text and images as input, ranked by overall quality score. Useful for document analysis, screenshot debugging, visual Q&A, and mixed-media workflows. Text-only models are excluded. Rankings use the same composite score as overall rankings — price not included.
top multimodal model
Gemini 3.1 Pro
Google · Quality 8.7/10 · AA Index 57.18
Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 115+ models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. AA independently measures it at 94.1% GPQA Diamond, 44.7% HLE, and 95.6% τ²-bench — top of field on all three. The API exposes three thinking tiers (Low / Medium / High) and a 65,536-token output window — the largest published output context of any frontier model. A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.
Google's November 2025 flagship — deprecated March 9, 2026, replaced by Gemini 3.1 Pro at the same $2/$12 per 1M token price. It led 13 of 16 major benchmarks at launch: 90.8% GPQA Diamond, 87.1% τ²-bench, 138 t/s output speed, and a real 1M-token context window. Two things to know before deploying: an 88% hallucination rate (AA-Omniscience) that requires Search grounding to mitigate, and verbosity that inflates real API costs 4–5× above the listed rate. If you're starting fresh, use 3.1 Pro. Already on 3 Pro? The migration is a model string change.
Google's December 2025 Flash model — distilled from Gemini 3 Pro, and in a result that embarrassed the larger model, it beats Pro on SWE-bench Verified (78% vs 76.2%). At $0.50/$3.00 per 1M tokens with a 1M context window and 214 t/s output speed, it's now the default model powering the Gemini app and AI Mode in Google Search for hundreds of millions of users. The intelligence-to-cost ratio is unusual: GPQA Diamond 90.4%, near-Pro level science reasoning, at one-quarter the API price. One thing to know before production use: a 91% hallucination rate that needs Search grounding to control, and text-only output — no image or audio generation.
Released December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's flagship reasoning model. It beats or ties human industry experts on 70.9% of GDPval knowledge work tasks, scores 100% on AIME 2025 without tools, and runs at a hallucination rate under 1% with browsing active. The 400K context window, 5-tier thinking budget, and 90% cached-input discount make it the default choice for enterprise automation and agentic pipelines.
Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.
Anthropic's most powerful model, released February 4, 2026. Opus 4.6 leads the industry on enterprise expert tasks (GDPval-AA Elo 1606 — 144 points above GPT-5.2), agentic computer use (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% accuracy at 1M tokens). Its 1M-token context window is in beta; standard is 200K. The price — $5/$25 per 1M tokens — reflects the positioning: reach for it when output quality has direct business consequences.
OpenAI's small-but-smart model and the best value in the GPT-5 family. At $0.25/$2.00 per 1M tokens it costs 7× less than GPT-5.2 while delivering an AA Intelligence Index of 41 — higher than Claude Haiku and Gemini Flash. The 400K context window and multimodal input make it a strong default for cost-sensitive production pipelines.
Anthropic's mid-tier model from September 2025. Sonnet 4.5 was the best coding model in the world at launch — it outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, scored 77.2% on SWE-bench Verified, and demonstrated 30+ hour autonomous coding sessions. It has since been succeeded by Sonnet 4.6 (February 2026), but remains a production-ready model for teams already built on it. Same $3/$15 pricing as its successor.
Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.
Released February 17, 2026 as a public beta, Grok 4.2 (also called Grok 4.20) is xAI's most architecturally novel model: four specialized AI agents — Grok, Harper, Benjamin, and Lucas — debate and synthesize answers in real time on every complex query. The public beta runs on xAI's 500B-parameter 'small' foundation model; the full-size variant hasn't finished training. There are no official benchmarks yet. It arrives amid regulatory investigations across seven countries, mass founder departures, and the SpaceX acquisition — making it one of the most ambitious and controversial AI launches in 2026.
Released November 17, 2025, Grok 4.1 is xAI's most refined model — a post-training upgrade to Grok 4 that briefly claimed the #1 spot on LMArena (30-position jump) before Gemini 3 Pro and Claude Opus 4.6 overtook it. It leads every frontier model on emotional intelligence (EQ-Bench3: 1586 Elo) and creative writing. It's not trying to win on coding or reasoning — it's trying to be the most compelling AI personality, with the cheapest entry point and real-time X data.
Meta's mid-sized open-weights model and the most capable Llama 4 variant for general use. Maverick runs as a mixture-of-experts architecture with 400B total parameters but only 17B active — giving it good speed at 115 t/s while maintaining an AA Intelligence Index of 18. It's multimodal, handles 1M tokens of context, and can be self-hosted. The trade-off: it trails frontier closed models significantly on all AA-measured benchmarks.
Meta's ultra-long-context open-weights model with a 10M token window — the largest of any publicly available model. Scout is a smaller MoE variant (109B total, ~17B active) optimized for speed and context length over raw intelligence. At 135 t/s and AA Intelligence Index 14, it's the right call when you need to process enormous documents or codebases that would overflow any other model.
What counts as multimodal?
These models accept at least image + text input. Several also support audio, video frames, or document uploads. “Multimodal” does not mean they generate images — for image generation see Image Generators. Want to browse without rank order? Browse all multimodal models →
Last updated February 2026. Intelligence scores from Artificial Analysis. See how we rate for full methodology.