Smartest LLMs — Ranked by Intelligence

Ranked by Artificial Analysis Intelligence Index v4.0 — an independently measured composite of 10 standard benchmarks. All models run under the same conditions (standard/medium mode, no extended thinking). These are the most trustworthy intelligence rankings available because AA measures every model the same way, rather than relying on self-reported results.

What the AA Index measures

GPQA DiamondHumanity's Last ExamSWE-bench VerifiedTerminal-Bench Hardτ²-Bench TelecomGDPval-AASciCodeAA-LCRAA-OmniscienceIFBench

Source: artificialanalysis.ai · Standard/medium inference mode · Not extended thinking

most intelligent (measured)

Gemini 3 Pro

Google · AA Index 48.44

Gemini 3 ProGoogle

48.4

8.8

quality

Context: 1M tokens · Speed: 55 t/s est.Full review →

GPT-5.2OpenAI

46.6

8.3

quality

Context: 400K tokens · Speed: 65 t/s est.Full review →

Claude Opus 4.6Anthropic

46.0

7.5

quality

Context: 200K tokens · Speed: 67 t/sFull review →

Claude Sonnet 4.6Anthropic

44.3

8.0

quality

Context: 200K tokens · Speed: 85 t/s est.Full review →

DeepSeek V3.2DeepSeek

41.6

6.3

quality

Context: 128K tokens · Speed: 45 t/s est.Full review →

Grok 4.1xAI

41.4

8.0

quality

Context: 2M tokens · Speed: 90 t/s est.Full review →

GPT-5 miniOpenAI

39.0

7.3

quality

Context: 400K tokens · Speed: 73 t/sFull review →

Llama 4 ScoutMetaest.

38.5

7.5

quality

Context: 10M tokens · Speed: 180 t/sFull review →

Gemini 3 FlashGoogle

35.0

7.3

quality

Context: 1M tokens · Speed: 170 t/sFull review →

Mistral Large 3Mistral

23.0

4.6

quality

Context: 256K tokens · Speed: 56 t/sFull review →

Llama 4 MaverickMeta

18.0

4.4

quality

Context: 1M tokens · Speed: 125 t/sFull review →

Why AA Index instead of individual benchmarks?

Single benchmarks saturate (models score 90%+), get gamed by training on leaked test sets, and measure narrow capability. The AA Index combines 10 complementary benchmarks — including hard reasoning (GPQA Diamond, HLE), real-world coding (SWE-bench, Terminal-Bench), and scientific reasoning (SciCode, GDPval) — under controlled conditions. No self-reported scores. Every model is run the same way.

Values marked “est.” are extrapolated from the nearest available AA measurement for that model version. Updated when AA completes direct indexing.