Multimodal AI Models

14 models that accept text + images as input — useful for document analysis, screenshot debugging, visual Q&A, and mixed-media workflows. Text-only models are excluded here. Want a ranked leaderboard? See overall rankings →

Provider

Features

14 models

Claude Haiku 4.5Best Value

Anthropic

5.5/10

Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.

200K ctxFree tier

Read full review →

Claude Opus 4.6Top Pick

Anthropic

6.4/10

Anthropic's most powerful model, released February 4, 2026. Opus 4.6 leads the industry on enterprise expert tasks (GDPval-AA Elo 1606 — 144 points above GPT-5.2), agentic computer use (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% accuracy at 1M tokens). Its 1M-token context window is in beta; standard is 200K. The price — $5/$25 per 1M tokens — reflects the positioning: reach for it when output quality has direct business consequences.

Anthropic

Anthropic's mid-tier model from September 2025. Sonnet 4.5 was the best coding model in the world at launch — it outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, scored 77.2% on SWE-bench Verified, and demonstrated 30+ hour autonomous coding sessions. It has since been succeeded by Sonnet 4.6 (February 2026), but remains a production-ready model for teams already built on it. Same $3/$15 pricing as its successor.

Anthropic

Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.

200K ctxFree tier

Read full review →

Gemini 3 FlashFastest

Google

7.8/10

Google's December 2025 Flash model — distilled from Gemini 3 Pro, and in a result that embarrassed the larger model, it beats Pro on SWE-bench Verified (78% vs 76.2%). At $0.50/$3.00 per 1M tokens with a 1M context window and 214 t/s output speed, it's now the default model powering the Gemini app and AI Mode in Google Search for hundreds of millions of users. The intelligence-to-cost ratio is unusual: GPQA Diamond 90.4%, near-Pro level science reasoning, at one-quarter the API price. One thing to know before production use: a 91% hallucination rate that needs Search grounding to control, and text-only output — no image or audio generation.

Google

Google's November 2025 flagship — deprecated March 9, 2026, replaced by Gemini 3.1 Pro at the same $2/$12 per 1M token price. It led 13 of 16 major benchmarks at launch: 90.8% GPQA Diamond, 87.1% τ²-bench, 138 t/s output speed, and a real 1M-token context window. Two things to know before deploying: an 88% hallucination rate (AA-Omniscience) that requires Search grounding to mitigate, and verbosity that inflates real API costs 4–5× above the listed rate. If you're starting fresh, use 3.1 Pro. Already on 3 Pro? The migration is a model string change.

1M ctxFree tier

Read full review →

Gemini 3.1 ProTop Pick

Google

8.7/10

Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 115+ models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. AA independently measures it at 94.1% GPQA Diamond, 44.7% HLE, and 95.6% τ²-bench — top of field on all three. The API exposes three thinking tiers (Low / Medium / High) and a 65,536-token output window — the largest published output context of any frontier model. A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.

1M ctxFree tier

Read full review →

Llama 4 MaverickOpen Source

Meta

3.9/10

Meta's ultra-long-context open-weights model with a 10M token window — the largest of any publicly available model. Scout is a smaller MoE variant (109B total, ~17B active) optimized for speed and context length over raw intelligence. At 135 t/s and AA Intelligence Index 14, it's the right call when you need to process enormous documents or codebases that would overflow any other model.

10M ctxFree tierOpen weights

Read full review →

GPT-5 MiniBest Value

OpenAI

6.3/10

OpenAI's small-but-smart model and the best value in the GPT-5 family. At $0.25/$2.00 per 1M tokens it costs 7× less than GPT-5.2 while delivering an AA Intelligence Index of 41 — higher than Claude Haiku and Gemini Flash. The 400K context window and multimodal input make it a strong default for cost-sensitive production pipelines.

OpenAI

Released December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's flagship reasoning model. It beats or ties human industry experts on 70.9% of GDPval knowledge work tasks, scores 100% on AIME 2025 without tools, and runs at a hallucination rate under 1% with browsing active. The 400K context window, 5-tier thinking budget, and 90% cached-input discount make it the default choice for enterprise automation and agentic pipelines.

OpenAI

Released March 5, 2026, GPT-5.4 is OpenAI's most capable frontier model for professional work. It merges the coding depth of GPT-5.3-Codex with leading knowledge-work, computer-use, and agentic tool capabilities into a single model. On GDPval it beats or ties human experts on 83% of tasks (up from 70.9% on GPT-5.2). It's the first general-purpose OpenAI model with native computer-use, hitting 75% on OSWorld-Verified and surpassing human performance. The 1M token context window (experimental in Codex), tool search for efficient MCP integration, and 33% fewer hallucinated claims make it the new default for enterprise automation.

xAI

Released November 17, 2025, Grok 4.1 is xAI's most refined model — a post-training upgrade to Grok 4 that briefly claimed the #1 spot on LMArena (30-position jump) before Gemini 3 Pro and Claude Opus 4.6 overtook it. It leads every frontier model on emotional intelligence (EQ-Bench3: 1586 Elo) and creative writing. It's not trying to win on coding or reasoning — it's trying to be the most compelling AI personality, with the cheapest entry point and real-time X data.

xAI

Released February 17, 2026 as a public beta and updated to Beta 2 on March 3, Grok 4.2 (also called Grok 4.20) is xAI's most architecturally novel model: four specialized AI agents — Grok, Harper, Benjamin, and Lucas — debate and synthesize answers in real time on every complex query. Beta 2 tightened instruction following, reduced capability hallucinations, and fixed LaTeX rendering. The public beta still runs on xAI's 500B-parameter 'small' foundation model; the full-size variant hasn't finished training. There are no official benchmarks yet. It arrives amid regulatory investigations across seven countries, mass founder departures, and the SpaceX acquisition.

256K ctx

Read full review →

What counts as multimodal?

These models accept at least image + text input. Several also support audio, video frames, or document uploads. “Multimodal” does not mean they generate images — for image generation see Image Generators.