[good?]

Smartest LLMs — Ranked by Intelligence

Ranked by Artificial Analysis Intelligence Index v4.0 — an independently measured composite of 10 standard benchmarks. All models run under the same conditions (standard/medium mode, no extended thinking). These are the most trustworthy intelligence rankings available because AA measures every model the same way, rather than relying on self-reported results.

What the AA Index measures

GPQA DiamondHumanity's Last Examτ²-BenchTerminal-Bench HardCritPtGDPval-AASciCodeAA-LCRAA-OmniscienceIFBench

Source: artificialanalysis.ai · Standard/medium inference mode · Not extended thinking

most intelligent (measured)

Gemini 3.1 Pro

Google · AA Index 57.18

1
57.2
8.7
quality
Context: 1M tokens · Speed: 91 t/sFull review →
2
54.0
7.2
quality
Context: 400K tokens · Speed: 99 t/sFull review →
3
GPT-5.2OpenAI
51.3
7.5
quality
Context: 400K tokens · Speed: 91 t/sFull review →
4
48.4
7.8
quality
Context: 1M tokens · Speed: 138 t/sFull review →
5
46.5
6.4
quality
Context: 200K tokens · Speed: 69 t/sFull review →
6
46.4
7.8
quality
Context: 1M tokens · Speed: 214 t/sFull review →
7
44.4
6.6
quality
Context: 200K tokens · Speed: 55 t/sFull review →
8
Grok 4.2xAIest.
43.0
5.2
quality
Context: 256K tokens · Speed: 85 t/s est.Full review →
9
41.2
6.3
quality
Context: 400K tokens · Speed: 76 t/sFull review →
10
37.1
5.8
quality
Context: 200K tokens · Speed: 37.2 t/sFull review →
11
33.3
5.4
quality
Context: 131K tokens · Speed: 304 t/sFull review →
12
32.1
4.0
quality
Context: 128K tokens · Speed: 43 t/sFull review →
13
31.1
5.5
quality
Context: 200K tokens · Speed: 111 t/sFull review →
14
Kimi K2Moonshot AI
26.3
4.0
quality
Context: 262K tokens · Speed: 42 t/sFull review →
15
23.6
4.7
quality
Context: 2M tokens · Speed: 133 t/sFull review →
16
Mistral Large 3Mistral AI
22.8
3.2
quality
Context: 256K tokens · Speed: 50 t/sFull review →
17
4.0
quality
Context: 1M tokens · Speed: 115 t/sFull review →
18
17.0
2.7
quality
Context: 262K tokens · Speed: 35 t/sFull review →
19
3.9
quality
Context: 10M tokens · Speed: 135 t/sFull review →

Why AA Index instead of individual benchmarks?

Single benchmarks saturate (models score 90%+), get gamed by training on leaked test sets, and measure narrow capability. The AA Index combines 10 complementary benchmarks — including hard reasoning (GPQA Diamond, HLE), agentic tool use (τ²-Bench, Terminal-Bench Hard), and scientific reasoning (SciCode, GDPval) — under controlled conditions. No self-reported scores. Every model is run the same way.

Values marked “est.” are extrapolated from the nearest available AA measurement for that model version. Updated when AA completes direct indexing.