How We Rate Models

Every rating on this site is computed from a defined formula — not assigned by hand. Each model category (LLMs, image generators, video generators) has its own scoring system built from independently verifiable data. Price is always kept separate from quality so cost never distorts capability rankings.

Why price is not in the quality score

“Which model is smarter?” and “Which is cheapest for my workload?” are different questions. Mixing them into a single score produces misleading results — a cheap model with mediocre capability can outscore an excellent model just because it costs less. Quality and price are both shown on every model page; they are just scored separately.

Large Language Models (LLMs)

Browse LLMs →

LLMs are scored across six dimensions totaling 100 points. The final rating is the total divided by 10. All benchmark data comes from Artificial Analysis — independently measured, not self-reported by providers.

Intelligence

40 pts max

dynamic

Artificial Analysis Intelligence Index v4.0 — an independently measured composite of 10 standard benchmarks: GPQA Diamond, Humanity's Last Exam, Terminal-Bench Hard, τ²-Bench Telecom, GDPval-AA, SciCode, AA-LCR, AA-Omniscience, IFBench, and CritPt. Standard/medium mode only — extended thinking scores are excluded. Field-relative: worst model in the database → 2 pts, best → 40 pts. Recalibrates automatically as new models are added.

Current field range: 13.5 – 57.2

Reliability

15 pts max

dynamic

AA-Omniscience Index (Artificial Analysis) — measures knowledge accuracy and hallucination rate. Range: −100 (hallucinates on most questions) to +100 (highly accurate, refuses when uncertain). 0 = as many correct answers as wrong ones. Field-relative: best in the database → 15 pts, worst → 2 pts. Models without AA Omniscience data receive a neutral 5 pts.

Current field range: -53.0 (worst) – +32.9 (best)

Accessibility

10 pts max

Can everyday users access this model without paying? Absolute checklist: free consumer product (web or mobile) → 5 pts; dedicated chat UI → 3 pts; official mobile app → 2 pts. API-only models with no free consumer product score 0/10 here — capability doesn't help if you can't try it.

Context Window

10 pts max

dynamic

Field-relative on a log scale — each order of magnitude of context counts equally. Best context in the database → 10 pts, worst → 2 pts. Log scale prevents a single outlier (e.g. 10M context) from crushing all 200K–1M models to near-zero. Recalibrates as new models are added.

Current field range: 128,000 – 10,000,000 tokens

Trust & Privacy

15 pts max

Company jurisdiction: US/EU = 7 pts, other = 4 pts, Chinese company = 2 pts. Strong published privacy policy: +5 pts. Open source / open weights: +3 pts — auditable weights, self-hostable, no vendor black box. Tool calling / MCP / parallel tool call support are shown as capability facts on each model page but do not contribute points.

Speed

10 pts max

dynamic

Output tokens per second from the Artificial Analysis speed leaderboard (P50 median over 72 hours). Field-relative: fastest model in the database → 10 pts, slowest → 2 pts. Recalibrates automatically as faster inference hardware arrives.

Current field range: 35 t/s – 304 t/s

Final Rating

total ÷ 10

Total score (max 100) divided by 10, rounded to one decimal place. This is the quality rating shown on all model pages and comparisons. No pricing influence.

LLM Data Sources

Intelligence Index: Artificial Analysis Intelligence Index v4.0. Composite of 10 evals: GPQA Diamond, HLE, Terminal-Bench Hard, τ²-Bench Telecom, GDPval-AA, SciCode, AA-LCR, AA-Omniscience, IFBench, CritPt. Standard/medium inference mode only.
Reliability: Artificial Analysis AA-Omniscience Index — independently measures knowledge accuracy and hallucination. Range −100 to +100.
Speed: Artificial Analysis speed leaderboard — P50 median output tokens per second over 72 hours.
Pricing, accessibility, context window, trust: Manually verified from official provider pages and privacy policies. Updated monthly.

Coming soon

Image Generator Scoring

Scoring criteria for image models — output quality, resolution, speed, text rendering, commercial safety, and API availability — are in development.

Browse image models →

Coming soon

Video Generator Scoring

Scoring criteria for video models — output quality, max duration, resolution, audio support, and image-to-video capability — are in development.

Browse video models →