How We Rate Models
Every rating on this site is computed from a defined formula — not assigned by hand. Each model category (LLMs, image generators, video generators) has its own scoring system built from independently verifiable data. Price is always kept separate from quality so cost never distorts capability rankings.
Why price is not in the quality score
“Which model is smarter?” and “Which is cheapest for my workload?” are different questions. Mixing them into a single score produces misleading results — a cheap model with mediocre capability can outscore an excellent model just because it costs less. Quality and price are both shown on every model page; they are just scored separately.
Category
Large Language Models (LLMs)
LLMs are scored across six dimensions totaling 100 points. The final rating is the total divided by 10. All benchmark data comes from Artificial Analysis — independently measured, not self-reported by providers.
Intelligence
40 pts max
dynamicArtificial Analysis Intelligence Index v4.0 — an independently measured composite of 10 standard benchmarks: GPQA Diamond, Humanity's Last Exam, Terminal-Bench Hard, τ²-Bench Telecom, GDPval-AA, SciCode, AA-LCR, AA-Omniscience, IFBench, and CritPt. Standard/medium mode only — extended thinking scores are excluded. Field-relative: worst model in the database → 2 pts, best → 40 pts. Recalibrates automatically as new models are added.
Current field range: 13.5 – 57.2
Reliability
15 pts max
dynamicAA-Omniscience Index (Artificial Analysis) — measures knowledge accuracy and hallucination rate. Range: −100 (hallucinates on most questions) to +100 (highly accurate, refuses when uncertain). 0 = as many correct answers as wrong ones. Field-relative: best in the database → 15 pts, worst → 2 pts. Models without AA Omniscience data receive a neutral 5 pts.
Current field range: -53.0 (worst) – +32.9 (best)
Accessibility
10 pts max
Can everyday users access this model without paying? Absolute checklist: free consumer product (web or mobile) → 5 pts; dedicated chat UI → 3 pts; official mobile app → 2 pts. API-only models with no free consumer product score 0/10 here — capability doesn't help if you can't try it.
Context Window
10 pts max
dynamicField-relative on a log scale — each order of magnitude of context counts equally. Best context in the database → 10 pts, worst → 2 pts. Log scale prevents a single outlier (e.g. 10M context) from crushing all 200K–1M models to near-zero. Recalibrates as new models are added.
Current field range: 128,000 – 10,000,000 tokens
Trust & Privacy
15 pts max
Company jurisdiction: US/EU = 7 pts, other = 4 pts, Chinese company = 2 pts. Strong published privacy policy: +5 pts. Open source / open weights: +3 pts — auditable weights, self-hostable, no vendor black box. Tool calling / MCP / parallel tool call support are shown as capability facts on each model page but do not contribute points.
Speed
10 pts max
dynamicOutput tokens per second from the Artificial Analysis speed leaderboard (P50 median over 72 hours). Field-relative: fastest model in the database → 10 pts, slowest → 2 pts. Recalibrates automatically as faster inference hardware arrives.
Current field range: 35 t/s – 304 t/s
Final Rating
total ÷ 10
Total score (max 100) divided by 10, rounded to one decimal place. This is the quality rating shown on all model pages and comparisons. No pricing influence.
LLM Data Sources
- Intelligence Index: Artificial Analysis Intelligence Index v4.0. Composite of 10 evals: GPQA Diamond, HLE, Terminal-Bench Hard, τ²-Bench Telecom, GDPval-AA, SciCode, AA-LCR, AA-Omniscience, IFBench, CritPt. Standard/medium inference mode only.
- Reliability: Artificial Analysis AA-Omniscience Index — independently measures knowledge accuracy and hallucination. Range −100 to +100.
- Speed: Artificial Analysis speed leaderboard — P50 median output tokens per second over 72 hours.
- Pricing, accessibility, context window, trust: Manually verified from official provider pages and privacy policies. Updated monthly.
Coming soon
Image Generator Scoring
Scoring criteria for image models — output quality, resolution, speed, text rendering, commercial safety, and API availability — are in development.
Browse image models →Coming soon
Video Generator Scoring
Scoring criteria for video models — output quality, max duration, resolution, audio support, and image-to-video capability — are in development.
Browse video models →