Multimodal AI Models
13 models that accept text + images as input — useful for document analysis, screenshot debugging, visual Q&A, and mixed-media workflows. Text-only models are excluded here. Want a ranked leaderboard? See overall rankings →
13 models
Anthropic
Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.
Anthropic
Anthropic's most powerful model, released February 4, 2026. Opus 4.6 leads the industry on enterprise expert tasks (GDPval-AA Elo 1606 — 144 points above GPT-5.2), agentic computer use (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% accuracy at 1M tokens). Its 1M-token context window is in beta; standard is 200K. The price — $5/$25 per 1M tokens — reflects the positioning: reach for it when output quality has direct business consequences.
Anthropic
Anthropic's mid-tier model from September 2025. Sonnet 4.5 was the best coding model in the world at launch — it outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, scored 77.2% on SWE-bench Verified, and demonstrated 30+ hour autonomous coding sessions. It has since been succeeded by Sonnet 4.6 (February 2026), but remains a production-ready model for teams already built on it. Same $3/$15 pricing as its successor.
Anthropic
Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.
Google's December 2025 Flash model — distilled from Gemini 3 Pro, and in a result that embarrassed the larger model, it beats Pro on SWE-bench Verified (78% vs 76.2%). At $0.50/$3.00 per 1M tokens with a 1M context window and 214 t/s output speed, it's now the default model powering the Gemini app and AI Mode in Google Search for hundreds of millions of users. The intelligence-to-cost ratio is unusual: GPQA Diamond 90.4%, near-Pro level science reasoning, at one-quarter the API price. One thing to know before production use: a 91% hallucination rate that needs Search grounding to control, and text-only output — no image or audio generation.
Google's November 2025 flagship — deprecated March 9, 2026, replaced by Gemini 3.1 Pro at the same $2/$12 per 1M token price. It led 13 of 16 major benchmarks at launch: 90.8% GPQA Diamond, 87.1% τ²-bench, 138 t/s output speed, and a real 1M-token context window. Two things to know before deploying: an 88% hallucination rate (AA-Omniscience) that requires Search grounding to mitigate, and verbosity that inflates real API costs 4–5× above the listed rate. If you're starting fresh, use 3.1 Pro. Already on 3 Pro? The migration is a model string change.
Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 115+ models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. AA independently measures it at 94.1% GPQA Diamond, 44.7% HLE, and 95.6% τ²-bench — top of field on all three. The API exposes three thinking tiers (Low / Medium / High) and a 65,536-token output window — the largest published output context of any frontier model. A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.
Meta
Meta's mid-sized open-weights model and the most capable Llama 4 variant for general use. Maverick runs as a mixture-of-experts architecture with 400B total parameters but only 17B active — giving it good speed at 115 t/s while maintaining an AA Intelligence Index of 18. It's multimodal, handles 1M tokens of context, and can be self-hosted. The trade-off: it trails frontier closed models significantly on all AA-measured benchmarks.
Meta
Meta's ultra-long-context open-weights model with a 10M token window — the largest of any publicly available model. Scout is a smaller MoE variant (109B total, ~17B active) optimized for speed and context length over raw intelligence. At 135 t/s and AA Intelligence Index 14, it's the right call when you need to process enormous documents or codebases that would overflow any other model.
OpenAI
OpenAI's small-but-smart model and the best value in the GPT-5 family. At $0.25/$2.00 per 1M tokens it costs 7× less than GPT-5.2 while delivering an AA Intelligence Index of 41 — higher than Claude Haiku and Gemini Flash. The 400K context window and multimodal input make it a strong default for cost-sensitive production pipelines.
OpenAI
Released December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's flagship reasoning model. It beats or ties human industry experts on 70.9% of GDPval knowledge work tasks, scores 100% on AIME 2025 without tools, and runs at a hallucination rate under 1% with browsing active. The 400K context window, 5-tier thinking budget, and 90% cached-input discount make it the default choice for enterprise automation and agentic pipelines.
xAI
Released November 17, 2025, Grok 4.1 is xAI's most refined model — a post-training upgrade to Grok 4 that briefly claimed the #1 spot on LMArena (30-position jump) before Gemini 3 Pro and Claude Opus 4.6 overtook it. It leads every frontier model on emotional intelligence (EQ-Bench3: 1586 Elo) and creative writing. It's not trying to win on coding or reasoning — it's trying to be the most compelling AI personality, with the cheapest entry point and real-time X data.
xAI
Released February 17, 2026 as a public beta, Grok 4.2 (also called Grok 4.20) is xAI's most architecturally novel model: four specialized AI agents — Grok, Harper, Benjamin, and Lucas — debate and synthesize answers in real time on every complex query. The public beta runs on xAI's 500B-parameter 'small' foundation model; the full-size variant hasn't finished training. There are no official benchmarks yet. It arrives amid regulatory investigations across seven countries, mass founder departures, and the SpaceX acquisition — making it one of the most ambitious and controversial AI launches in 2026.
What counts as multimodal?
These models accept at least image + text input. Several also support audio, video frames, or document uploads. “Multimodal” does not mean they generate images — for image generation see Image Generators.