LLM Comparison Table

All 20 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.

Best in columnCN ⚠ = Chinese jurisdictionAPI $/1M = blended at 3:1 input:outputHow we score →

#	Model	Score	AA Index	Context	Speed t/s	API $/1M	Free?	Open?	Multi?	Trust	GPQA (AA)	HLE (AA)	τ²-bench (AA)	LCB (AA)	Released
1	Gemini 3.1 Pro Google top-pick	8.7/10	57.18	1M	91	$4.50	± ltd	✗	✓	US/EU	94.1%	44.7%	95.6%	—	2026-02-19
2	Gemini 3 Pro Google	7.8/10	48.39	1M	138	$4.50	± ltd	✗	✓	US/EU	90.8%	37.2%	87.1%	91.7%	2025-12-10
3	Gemini 3 Flash Google fastest	7.8/10	46.43	1M	214	$1.13	✓ full	✗	✓	US/EU	89.8%	34.7%	80.4%	90.8%	2025-12-17
4	GPT-5.4 OpenAI new	7.5/10	55.5*	1M	95*	$5.63	± ltd	✗	✓	US/EU	—	—	—	—	2026-03-05
5	GPT-5.2 OpenAI top-pick	7.5/10	51.28	400K	91	$4.81	± ltd	✗	✓	US/EU	90.3%	35.4%	84.8%	88.9%	2025-12-11
6	GPT-5.3-Codex OpenAI	7.2/10	53.97	400K	99	$4.81	—	✗	✗	US/EU	91.5%	39.9%	90.9%	—	2026-02-05
7	Claude Sonnet 4.6 Anthropic	6.6/10	44.38	200K	55	$6.00	± ltd	✗	✓	US/EU	79.9%	13.2%	79.5%	—	2026-02-17
8	Claude Opus 4.6 Anthropic top-pick	6.4/10	46.46	200K	69	$10.00	—	✗	✓	US/EU	84.0%	18.6%	84.8%	—	2026-02-04
9	GPT-5 Mini OpenAI best-value	6.3/10	41.17	400K	76	$0.69	± ltd	✗	✓	US/EU	82.8%	19.7%	68.4%	83.8%	2026-01-31
10	Claude Sonnet 4.5 Anthropic	5.8/10	37.14	200K	37.2	$6.00	± ltd	✗	✓	US/EU	72.7%	7.1%	70.5%	—	2025-09-29
11	Claude Haiku 4.5 Anthropic best-value	5.5/10	31.05	200K	111	$2.00	± ltd	✗	✓	US/EU	64.6%	4.3%	32.5%	51.1%	2025-10-15
12	GPT OSS 120B OpenAI open-source	5.4/10	33.27	131K	304	$0.26	± ltd	✓	✗	US/EU	78.2%	18.5%	65.8%	87.8%	2025-08-05
13	Grok 4.2 xAI	5.2/10	43*	256K	85*	$9.00	—	✗	✓	US/EU	—	—	—	—	2026-02-17
14	Grok 4.1 xAI	4.7/10	23.56	2M	133	$0.25	± ltd	✗	✓	US/EU	63.7%	5.0%	63.7%	39.9%	2025-11-17
15	DeepSeek V3.2 DeepSeek open-source	4.0/10	32.09	128K	43	$0.49	✓ full	✓	✗	CN ⚠	75.1%	10.5%	78.9%	59.3%	2026-01-20
16	Llama 4 Maverick Meta open-source	4.0/10	18.36	1M	115	$0.31	± ltd	✓	✓	US/EU	67.1%	4.8%	17.8%	39.7%	2025-11-15
17	Kimi K2 Moonshot AI open-source	4.0/10	26.32	262K	42	$0.77	± ltd	✓	✗	CN ⚠	76.6%	7.0%	61.1%	55.6%	2025-09-05
18	Llama 4 Scout Meta open-source	3.9/10	13.52	10M	135	$0.17	± ltd	✓	✓	US/EU	58.7%	4.3%	15.5%	29.9%	2025-11-15
19	Mistral Large 3 Mistral AI	3.2/10	22.8	256K	50	$0.75	—	✗	✗	US/EU	68.0%	4.1%	24.6%	46.5%	2025-11-01
20	Qwen 3 235B Alibaba open-source	2.7/10	16.96	262K	35	$0.37	± ltd	✓	✗	CN ⚠	61.3%	4.7%	27.2%	34.3%	2025-04-29

All benchmark scores independently measured by Artificial Analysis (AA) in standard mode. | GPQA: GPQA Diamond — 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark, no tools. τ²-bench: tool use and multi-turn agent tasks. LCB: LiveCodeBench — competitive programming accuracy.

Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →

Last verified February 2026. All benchmark data sourced from Artificial Analysis.