LLM Comparison Table
All 19 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.
| # | Model | Score | AA Index | Context | Speed t/s | API $/1M | Free? | Open? | Multi? | Trust | GPQA (AA) | HLE (AA) | τ²-bench (AA) | LCB (AA) | Released |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Google top-pick | 8.7/10 | 57.18 | 1M | 91 | $4.50 | ± ltd | ✗ | ✓ | US/EU | 94.1% | 44.7% | 95.6% | — | 2026-02-19 |
| 2 | Gemini 3 Pro Google | 7.8/10 | 48.39 | 1M | 138 | $4.50 | ± ltd | ✗ | ✓ | US/EU | 90.8% | 37.2% | 87.1% | 91.7% | 2025-12-10 |
| 3 | Gemini 3 Flash Google fastest | 7.8/10 | 46.43 | 1M | 214 | $1.13 | ✓ full | ✗ | ✓ | US/EU | 89.8% | 34.7% | 80.4% | 90.8% | 2025-12-17 |
| 4 | GPT-5.2 OpenAI top-pick | 7.5/10 | 51.28 | 400K | 91 | $4.81 | ± ltd | ✗ | ✓ | US/EU | 90.3% | 35.4% | 84.8% | 88.9% | 2025-12-11 |
| 5 | GPT-5.3-Codex OpenAI | 7.2/10 | 53.97 | 400K | 99 | $4.81 | — | ✗ | ✗ | US/EU | 91.5% | 39.9% | 90.9% | — | 2026-02-05 |
| 6 | Claude Sonnet 4.6 Anthropic | 6.6/10 | 44.38 | 200K | 55 | $6.00 | ± ltd | ✗ | ✓ | US/EU | 79.9% | 13.2% | 79.5% | — | 2026-02-17 |
| 7 | Claude Opus 4.6 Anthropic top-pick | 6.4/10 | 46.46 | 200K | 69 | $10.00 | — | ✗ | ✓ | US/EU | 84.0% | 18.6% | 84.8% | — | 2026-02-04 |
| 8 | GPT-5 Mini OpenAI best-value | 6.3/10 | 41.17 | 400K | 76 | $0.69 | ± ltd | ✗ | ✓ | US/EU | 82.8% | 19.7% | 68.4% | 83.8% | 2026-01-31 |
| 9 | Claude Sonnet 4.5 Anthropic | 5.8/10 | 37.14 | 200K | 37.2 | $6.00 | ± ltd | ✗ | ✓ | US/EU | 72.7% | 7.1% | 70.5% | — | 2025-09-29 |
| 10 | Claude Haiku 4.5 Anthropic best-value | 5.5/10 | 31.05 | 200K | 111 | $2.00 | ± ltd | ✗ | ✓ | US/EU | 64.6% | 4.3% | 32.5% | 51.1% | 2025-10-15 |
| 11 | GPT OSS 120B OpenAI open-source | 5.4/10 | 33.27 | 131K | 304 | $0.26 | ± ltd | ✓ | ✗ | US/EU | 78.2% | 18.5% | 65.8% | 87.8% | 2025-08-05 |
| 12 | Grok 4.2 xAI | 5.2/10 | 43* | 256K | 85* | $9.00 | — | ✗ | ✓ | US/EU | — | — | — | — | 2026-02-17 |
| 13 | Grok 4.1 xAI | 4.7/10 | 23.56 | 2M | 133 | $0.25 | ± ltd | ✗ | ✓ | US/EU | 63.7% | 5.0% | 63.7% | 39.9% | 2025-11-17 |
| 14 | DeepSeek V3.2 DeepSeek open-source | 4.0/10 | 32.09 | 128K | 43 | $0.49 | ✓ full | ✓ | ✗ | CN ⚠ | 75.1% | 10.5% | 78.9% | 59.3% | 2026-01-20 |
| 15 | Llama 4 Maverick Meta open-source | 4.0/10 | 18.36 | 1M | 115 | $0.31 | ± ltd | ✓ | ✓ | US/EU | 67.1% | 4.8% | 17.8% | 39.7% | 2025-11-15 |
| 16 | Kimi K2 Moonshot AI open-source | 4.0/10 | 26.32 | 262K | 42 | $0.77 | ± ltd | ✓ | ✗ | CN ⚠ | 76.6% | 7.0% | 61.1% | 55.6% | 2025-09-05 |
| 17 | Llama 4 Scout Meta open-source | 3.9/10 | 13.52 | 10M | 135 | $0.17 | ± ltd | ✓ | ✓ | US/EU | 58.7% | 4.3% | 15.5% | 29.9% | 2025-11-15 |
| 18 | Mistral Large 3 Mistral AI | 3.2/10 | 22.8 | 256K | 50 | $0.75 | — | ✗ | ✗ | US/EU | 68.0% | 4.1% | 24.6% | 46.5% | 2025-11-01 |
| 19 | Qwen 3 235B Alibaba open-source | 2.7/10 | 16.96 | 262K | 35 | $0.37 | ± ltd | ✓ | ✗ | CN ⚠ | 61.3% | 4.7% | 27.2% | 34.3% | 2025-04-29 |
All benchmark scores independently measured by Artificial Analysis (AA) in standard mode. | GPQA: GPQA Diamond — 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark, no tools. τ²-bench: tool use and multi-turn agent tasks. LCB: LiveCodeBench — competitive programming accuracy.
Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →
Last verified February 2026. All benchmark data sourced from Artificial Analysis.