All LLM Reviews

20 large language models reviewed. Click any model to read the full breakdown. Want a ranked list? See overall rankings →

Provider

Features

20 models

Qwen 3 235BOpen Source

Alibaba

Qwen 3 235B is Alibaba's largest open-source language model, released April 2025 under the MIT license. It uses a Mixture-of-Experts architecture with 235 billion total parameters and 22 billion active per forward pass. With a 262K context window and pricing as low as $0.20/$0.88 per 1M tokens on Alibaba Cloud, it's one of the most capable open models available at scale. Qwen 3 235B supports both a standard instruct mode and a Thinking mode for step-by-step reasoning. AA Intelligence Index of 17 reflects its April 2025 release date — newer open models have since surpassed it. A newer Qwen3 235B 2507 version has been released for those wanting the latest.

262K ctxFree tierOpen weights

Read full review →

Claude Haiku 4.5Best Value

Anthropic

Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.

200K ctxFree tierMultimodal

Read full review →

Claude Opus 4.6Top Pick

Anthropic

Anthropic's most powerful model, released February 4, 2026. Opus 4.6 leads the industry on enterprise expert tasks (GDPval-AA Elo 1606 — 144 points above GPT-5.2), agentic computer use (OSWorld 72.7%), and long-context retrieval (MRCR v2: 76% accuracy at 1M tokens). Its 1M-token context window is in beta; standard is 200K. The price — $5/$25 per 1M tokens — reflects the positioning: reach for it when output quality has direct business consequences.

200K ctxMultimodal

Read full review →

Claude Sonnet 4.5

Anthropic

Anthropic's mid-tier model from September 2025. Sonnet 4.5 was the best coding model in the world at launch — it outperformed its own flagship Opus 4.1 on most tasks at one-fifth the price, scored 77.2% on SWE-bench Verified, and demonstrated 30+ hour autonomous coding sessions. It has since been succeeded by Sonnet 4.6 (February 2026), but remains a production-ready model for teams already built on it. Same $3/$15 pricing as its successor.

200K ctxFree tierMultimodal

Read full review →

Claude Sonnet 4.6

Anthropic

Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.

200K ctxFree tierMultimodal

Read full review →

DeepSeek V3.2Open Source

DeepSeek

DeepSeek's open-weights frontier model and one of the most cost-effective APIs available. V3.2 punches far above its price — at $0.28/$1.10 per 1M tokens it costs roughly 20× less than Claude Sonnet while delivering an AA Intelligence Index of 32. Strong on coding and reasoning tasks, but hosted in China with the privacy implications that brings.

128K ctxFree tierOpen weights

Read full review →

Gemini 3 FlashFastest

Google

Google's December 2025 Flash model — distilled from Gemini 3 Pro, and in a result that embarrassed the larger model, it beats Pro on SWE-bench Verified (78% vs 76.2%). At $0.50/$3.00 per 1M tokens with a 1M context window and 214 t/s output speed, it's now the default model powering the Gemini app and AI Mode in Google Search for hundreds of millions of users. The intelligence-to-cost ratio is unusual: GPQA Diamond 90.4%, near-Pro level science reasoning, at one-quarter the API price. One thing to know before production use: a 91% hallucination rate that needs Search grounding to control, and text-only output — no image or audio generation.

1M ctxFree tierMultimodal

Read full review →

Google

Google's November 2025 flagship — deprecated March 9, 2026, replaced by Gemini 3.1 Pro at the same $2/$12 per 1M token price. It led 13 of 16 major benchmarks at launch: 90.8% GPQA Diamond, 87.1% τ²-bench, 138 t/s output speed, and a real 1M-token context window. Two things to know before deploying: an 88% hallucination rate (AA-Omniscience) that requires Search grounding to mitigate, and verbosity that inflates real API costs 4–5× above the listed rate. If you're starting fresh, use 3.1 Pro. Already on 3 Pro? The migration is a model string change.

1M ctxFree tierMultimodal

Read full review →

Gemini 3.1 ProTop Pick

Google

Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 115+ models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. AA independently measures it at 94.1% GPQA Diamond, 44.7% HLE, and 95.6% τ²-bench — top of field on all three. The API exposes three thinking tiers (Low / Medium / High) and a 65,536-token output window — the largest published output context of any frontier model. A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.

1M ctxFree tierMultimodal

Read full review →

Llama 4 MaverickOpen Source

Meta

Meta's mid-sized open-weights model and the most capable Llama 4 variant for general use. Maverick runs as a mixture-of-experts architecture with 400B total parameters but only 17B active — giving it good speed at 115 t/s while maintaining an AA Intelligence Index of 18. It's multimodal, handles 1M tokens of context, and can be self-hosted. The trade-off: it trails frontier closed models significantly on all AA-measured benchmarks.

1M ctxFree tierMultimodalOpen weights

Read full review →

Llama 4 ScoutOpen Source

Meta

Meta's ultra-long-context open-weights model with a 10M token window — the largest of any publicly available model. Scout is a smaller MoE variant (109B total, ~17B active) optimized for speed and context length over raw intelligence. At 135 t/s and AA Intelligence Index 14, it's the right call when you need to process enormous documents or codebases that would overflow any other model.

10M ctxFree tierMultimodalOpen weights

Read full review →

Mistral Large 3

Mistral AI

Mistral AI's flagship closed-weights model from the leading European AI lab. Mistral Large 3 offers a 256K context window at $0.50/$1.50 per 1M — roughly 4× cheaper than Claude Sonnet with a similar positioning. AA Intelligence Index of 23 puts it in the mid-tier, but it's the strongest option for teams that require EU data residency or want to avoid US and Chinese providers.

Read full review →

Kimi K2Open Source

Moonshot AI

Kimi K2 (0905) is the flagship model from Moonshot AI, a Beijing-based startup — and the current default model on T3.chat. It's a Mixture-of-Experts model with 1 trillion total parameters and 32 billion active per forward pass, released September 2025. Kimi K2 scores an AA Intelligence Index of 31 (#6 of 36 in open-weights large models) and is available as open weights under a permissive license. At $0.39/$1.90 per 1M tokens via the Moonshot API with a 262K context window, it offers strong capability per dollar. Also available in a dedicated Thinking mode for complex reasoning. A newer version (Kimi K2.5) has since launched.

262K ctxFree tierOpen weights

Read full review →

GPT OSS 120BOpen Source

OpenAI

GPT OSS 120B is OpenAI's first large open-weight language model, released August 2025. It uses a Mixture-of-Experts architecture with 117 billion total parameters and 5.1 billion active per forward pass — designed so it can run on a single H100 GPU. With an AA Intelligence Index of 33 (#1 of 50 in reasoning open-weight models), it's the most capable officially released open-weight model from a frontier lab. At $0.15/$0.60 per 1M tokens and 336 tokens/second, it's both cheap and fast. The open weights are available on Hugging Face and can be self-hosted. A smaller companion model, GPT OSS 20B, runs on consumer 16GB GPUs at $0.05/$0.20 per 1M.

131K ctxFree tierOpen weights

Read full review →

GPT-5 MiniBest Value

OpenAI

OpenAI's small-but-smart model and the best value in the GPT-5 family. At $0.25/$2.00 per 1M tokens it costs 7× less than GPT-5.2 while delivering an AA Intelligence Index of 41 — higher than Claude Haiku and Gemini Flash. The 400K context window and multimodal input make it a strong default for cost-sensitive production pipelines.

400K ctxFree tierMultimodal

Read full review →

GPT-5.2Top Pick

OpenAI

Released December 11, 2025 under the internal codename 'Garlic', GPT-5.2 is OpenAI's flagship reasoning model. It beats or ties human industry experts on 70.9% of GDPval knowledge work tasks, scores 100% on AIME 2025 without tools, and runs at a hallucination rate under 1% with browsing active. The 400K context window, 5-tier thinking budget, and 90% cached-input discount make it the default choice for enterprise automation and agentic pipelines.

400K ctxFree tierMultimodal

Read full review →

OpenAI

Released February 5, 2026, GPT-5.3-Codex is OpenAI's most capable agentic coding model — combining the coding depth of GPT-5.2-Codex with the reasoning and professional knowledge of GPT-5.2, at 25% faster speed. It powers the Codex product (chatgpt.com/codex) and runs autonomously for hours: writing features, fixing bugs, proposing PRs, and operating computers end-to-end. It's the first OpenAI model self-classified as 'High capability' in cybersecurity, delaying API access. Token efficiency is the clearest competitive edge — it uses roughly 3× fewer tokens than Claude Code on equivalent tasks.

Read full review →

OpenAI

Released March 5, 2026, GPT-5.4 is OpenAI's most capable frontier model for professional work. It merges the coding depth of GPT-5.3-Codex with leading knowledge-work, computer-use, and agentic tool capabilities into a single model. On GDPval it beats or ties human experts on 83% of tasks (up from 70.9% on GPT-5.2). It's the first general-purpose OpenAI model with native computer-use, hitting 75% on OSWorld-Verified and surpassing human performance. The 1M token context window (experimental in Codex), tool search for efficient MCP integration, and 33% fewer hallucinated claims make it the new default for enterprise automation.

1M ctxFree tierMultimodal

Read full review →

xAI

Released November 17, 2025, Grok 4.1 is xAI's most refined model — a post-training upgrade to Grok 4 that briefly claimed the #1 spot on LMArena (30-position jump) before Gemini 3 Pro and Claude Opus 4.6 overtook it. It leads every frontier model on emotional intelligence (EQ-Bench3: 1586 Elo) and creative writing. It's not trying to win on coding or reasoning — it's trying to be the most compelling AI personality, with the cheapest entry point and real-time X data.

2M ctxFree tierMultimodal

Read full review →

xAI

Released February 17, 2026 as a public beta and updated to Beta 2 on March 3, Grok 4.2 (also called Grok 4.20) is xAI's most architecturally novel model: four specialized AI agents — Grok, Harper, Benjamin, and Lucas — debate and synthesize answers in real time on every complex query. Beta 2 tightened instruction following, reduced capability hallucinations, and fixed LaTeX rendering. The public beta still runs on xAI's 500B-parameter 'small' foundation model; the full-size variant hasn't finished training. There are no official benchmarks yet. It arrives amid regulatory investigations across seven countries, mass founder departures, and the SpaceX acquisition.

256K ctxMultimodal

Read full review →

Ratings computed from intelligence, tool use, context window, trust, and speed (100 pts total). How we rate models →