Smartest LLMs — Ranked by Intelligence

Ranked by Artificial Analysis Intelligence Index v4.0 — an independently measured composite of 10 standard benchmarks. All models run under the same conditions (standard/medium mode, no extended thinking). These are the most trustworthy intelligence rankings available because AA measures every model the same way, rather than relying on self-reported results.

What the AA Index measures

GPQA DiamondHumanity's Last Examτ²-BenchTerminal-Bench HardCritPtGDPval-AASciCodeAA-LCRAA-OmniscienceIFBench

Source: artificialanalysis.ai · Standard/medium inference mode · Not extended thinking

most intelligent (measured)

Gemini 3.1 Pro

Google · AA Index 57.18

Gemini 3.1 ProGoogle

57.2

8.7

quality

Context: 1M tokens · Speed: 91 t/sFull review →

GPT-5.4OpenAIest.

55.5

7.5

quality

Context: 1M tokens · Speed: 95 t/s est.Full review →

GPT-5.3-CodexOpenAI

54.0

7.2

quality

Context: 400K tokens · Speed: 99 t/sFull review →

GPT-5.2OpenAI

51.3

7.5

quality

Context: 400K tokens · Speed: 91 t/sFull review →

Gemini 3 ProGoogle

48.4

7.8

quality

Context: 1M tokens · Speed: 138 t/sFull review →

Claude Opus 4.6Anthropic

46.5

6.4

quality

Context: 200K tokens · Speed: 69 t/sFull review →

Gemini 3 FlashGoogle

46.4

7.8

quality

Context: 1M tokens · Speed: 214 t/sFull review →

Claude Sonnet 4.6Anthropic

44.4

6.6

quality

Context: 200K tokens · Speed: 55 t/sFull review →

Grok 4.2xAIest.

43.0

5.2

quality

Context: 256K tokens · Speed: 85 t/s est.Full review →

GPT-5 MiniOpenAI

41.2

6.3

quality

Context: 400K tokens · Speed: 76 t/sFull review →

Claude Sonnet 4.5Anthropic

37.1

5.8

quality

Context: 200K tokens · Speed: 37.2 t/sFull review →

GPT OSS 120BOpenAI

33.3

5.4

quality

Context: 131K tokens · Speed: 304 t/sFull review →

DeepSeek V3.2DeepSeek

32.1

4.0

quality

Context: 128K tokens · Speed: 43 t/sFull review →

Claude Haiku 4.5Anthropic

31.1

5.5

quality

Context: 200K tokens · Speed: 111 t/sFull review →

Kimi K2Moonshot AI

26.3

4.0

quality

Context: 262K tokens · Speed: 42 t/sFull review →

Grok 4.1xAI

23.6

4.7

quality

Context: 2M tokens · Speed: 133 t/sFull review →

Mistral Large 3Mistral AI

22.8

3.2

quality

Context: 256K tokens · Speed: 50 t/sFull review →

Llama 4 MaverickMeta

18.4

4.0

quality

Context: 1M tokens · Speed: 115 t/sFull review →

Qwen 3 235BAlibaba

17.0

2.7

quality

Context: 262K tokens · Speed: 35 t/sFull review →

Llama 4 ScoutMeta

13.5

3.9

quality

Context: 10M tokens · Speed: 135 t/sFull review →

Why AA Index instead of individual benchmarks?

Single benchmarks saturate (models score 90%+), get gamed by training on leaked test sets, and measure narrow capability. The AA Index combines 10 complementary benchmarks — including hard reasoning (GPQA Diamond, HLE), agentic tool use (τ²-Bench, Terminal-Bench Hard), and scientific reasoning (SciCode, GDPval) — under controlled conditions. No self-reported scores. Every model is run the same way.

Values marked “est.” are extrapolated from the nearest available AA measurement for that model version. Updated when AA completes direct indexing.