[good?]

LLM Comparison Table

All 19 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.

Best in columnCN ⚠ = Chinese jurisdictionAPI $/1M = blended at 3:1 input:outputHow we score →
#ModelScoreAA IndexContextSpeed t/sAPI $/1MFree?Open?Multi?TrustGPQA (AA)HLE (AA)τ²-bench (AA)LCB (AA)Released
1Gemini 3.1 Pro
Google
top-pick
8.7/1057.181M91$4.50± ltdUS/EU94.1%44.7%95.6%2026-02-19
2Gemini 3 Pro
Google
7.8/1048.391M138$4.50± ltdUS/EU90.8%37.2%87.1%91.7%2025-12-10
3Gemini 3 Flash
Google
fastest
7.8/1046.431M214$1.13✓ fullUS/EU89.8%34.7%80.4%90.8%2025-12-17
4GPT-5.2
OpenAI
top-pick
7.5/1051.28400K91$4.81± ltdUS/EU90.3%35.4%84.8%88.9%2025-12-11
5GPT-5.3-Codex
OpenAI
7.2/1053.97400K99$4.81US/EU91.5%39.9%90.9%2026-02-05
6Claude Sonnet 4.6
Anthropic
6.6/1044.38200K55$6.00± ltdUS/EU79.9%13.2%79.5%2026-02-17
7Claude Opus 4.6
Anthropic
top-pick
6.4/1046.46200K69$10.00US/EU84.0%18.6%84.8%2026-02-04
8GPT-5 Mini
OpenAI
best-value
6.3/1041.17400K76$0.69± ltdUS/EU82.8%19.7%68.4%83.8%2026-01-31
9Claude Sonnet 4.5
Anthropic
5.8/1037.14200K37.2$6.00± ltdUS/EU72.7%7.1%70.5%2025-09-29
10Claude Haiku 4.5
Anthropic
best-value
5.5/1031.05200K111$2.00± ltdUS/EU64.6%4.3%32.5%51.1%2025-10-15
11GPT OSS 120B
OpenAI
open-source
5.4/1033.27131K304$0.26± ltdUS/EU78.2%18.5%65.8%87.8%2025-08-05
12Grok 4.2
xAI
5.2/1043*256K85*$9.00US/EU2026-02-17
13Grok 4.1
xAI
4.7/1023.562M133$0.25± ltdUS/EU63.7%5.0%63.7%39.9%2025-11-17
14DeepSeek V3.2
DeepSeek
open-source
4.0/1032.09128K43$0.49✓ fullCN ⚠75.1%10.5%78.9%59.3%2026-01-20
15Llama 4 Maverick
Meta
open-source
4.0/1018.361M115$0.31± ltdUS/EU67.1%4.8%17.8%39.7%2025-11-15
16Kimi K2
Moonshot AI
open-source
4.0/1026.32262K42$0.77± ltdCN ⚠76.6%7.0%61.1%55.6%2025-09-05
17Llama 4 Scout
Meta
open-source
3.9/1013.5210M135$0.17± ltdUS/EU58.7%4.3%15.5%29.9%2025-11-15
18Mistral Large 3
Mistral AI
3.2/1022.8256K50$0.75US/EU68.0%4.1%24.6%46.5%2025-11-01
19Qwen 3 235B
Alibaba
open-source
2.7/1016.96262K35$0.37± ltdCN ⚠61.3%4.7%27.2%34.3%2025-04-29

All benchmark scores independently measured by Artificial Analysis (AA) in standard mode. | GPQA: GPQA Diamond — 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark, no tools. τ²-bench: tool use and multi-turn agent tasks. LCB: LiveCodeBench — competitive programming accuracy.

Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →

Last verified February 2026. All benchmark data sourced from Artificial Analysis.