xAI
Grok 4.1
Grok 4.1 is not trying to win on benchmarks. Released November 17, 2025 as a post-training refinement of Grok 4, it briefly hit #1 on LMArena — a 30-position jump — before Gemini 3 Pro and Claude Opus 4.6 overtook it within 48 hours. What it kept is real: the best emotional intelligence score of any frontier model (EQ-Bench3: 1586 Elo), the best creative writing score, and a developer API that's up to 64× cheaper than competitors. It's also the most controversial model in this comparison — by a wide margin.
Context window
2.0M tokens
API (blended)
$0.25/1M
Consumer access
Free (limited) / $8/mo
Multimodal
Yes
Score Breakdown
46.9/100 → 4.7/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +#1 EQ-Bench3 (1586 Elo) — leads all frontier models on emotional intelligence by 25 points
- +#1 Creative Writing v3 (1722 Elo) — ~600 points above xAI's previous best
- +Real-time X (Twitter) data access — unique competitive advantage for news/social analysis
- +Grok 4.1 Fast: $0.20/$0.50 per 1M tokens — up to 64× cheaper than frontier competitors
- +2M token context window (Fast variant) — among the largest commercially available
- +Hallucination rate cut 65%: 12.09% → 4.22% vs Grok 4
- +$8/month X Premium entry point — cheapest meaningful access of any frontier model
Weaknesses
- -GPQA Diamond: 63.7%, HLE: 5.0%, τ²-bench: 63.7% (AA-measured) — significantly trails Gemini 3.1 Pro and Claude Opus 4.6
- -Sycophancy tripled: rate went from 0.07 to 0.19–0.23 vs Grok 4
- -MASK dishonesty rate slightly worsened: 0.43 → 0.46–0.49
- -AA Intelligence Index 23.56 — significantly below frontier models; ranks outside the top 5 on all AA-measured benchmarks
- -Hallucination rate (4.22%) still well above Gemini 3.1 Pro best-in-class
- -Deepfake image crisis (Dec 2025): CSAM generation controversy, multi-country regulatory investigations
- -Six co-founders departed post-SpaceX acquisition (Feb 2026) — organizational stability uncertain
Best for
Not ideal for
This is not a new model — it's a post-training refinement
Grok 4.1 shares the same ~3 trillion parameter MoE pretrained base as Grok 4 (July 2025). The .1 signifies aggressive reinforcement learning applied to style, personality, helpfulness, and alignment — not a new architecture. Core reasoning capabilities are unchanged from Grok 4. xAI notably did not publish standard academic benchmarks (GPQA, SWE-bench, HumanEval) for 4.1 specifically — an omission that multiple independent reviewers flagged as telling.
Benchmark Performance
Where Grok 4.1 leads, it leads convincingly. Where it doesn't, xAI didn't publish numbers.
Where Grok 4.1 leads
| Benchmark | Grok 4.1 | Nearest competitor | Gap |
|---|---|---|---|
| EQ-Bench3 Elo (emotional intelligence) | 1,586 | Claude Opus 4 (~1,304) | +282 Elo |
| Creative Writing v3 Elo | 1,722 | xAI previous best | +~600 Elo |
| LMArena peak (held <48 hours) | 1,483 | Gemini 3 Pro (1,486) | — |
| Hallucination reduction vs Grok 4 | 4.22% | Grok 4 (12.09%) | 65% improvement |
| Blind user preference vs Grok 4 | 64.78% | Grok 4 (35.22%) | Clear win |
EQ-Bench3 and Creative Writing v3 are LLM-judged leaderboards, not peer-reviewed evaluations. The preference scores come from xAI's own pre-rollout testing. Independent validation for these metrics is limited.
AA-measured benchmarks vs competitors
| Benchmark | Grok 4.1 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| GPQA Diamond | 63.7% | 84.0% | 94.1% |
| HLE — standard mode | 5.0% | 18.6% | 44.7% |
| τ²-bench (tool use & agents) | 63.7% | 84.8% | 95.6% |
All scores independently measured by Artificial Analysis in standard mode. Grok 4.1 trails the frontier significantly on all three — consistent with its AA Intelligence Index of 23.56 vs Gemini 3.1 Pro's 57.18.
The missing benchmarks are a signal
xAI published EQ-Bench3, Creative Writing, and hallucination metrics for Grok 4.1. They did not publish GPQA or HLE independently. Multiple independent reviewers noted this omission directly. The most straightforward explanation: those numbers were flat or worse than Grok 4, and a model being sold on personality improvements doesn't benefit from showing flat reasoning scores.
Pricing — The 64× Cost Advantage
Grok 4.1 Fast is where the real pricing story is. The two tiers are priced very differently.
API pricing
| Model | Input | Cached input | Output | Best for |
|---|---|---|---|---|
| Grok 4.1 Fast | $0.20/1M | $0.05/1M | $0.50/1M | High-volume, agentic, tool-calling workflows |
| Grok 4 (reasoning) | $3.00/1M | — | $15.00/1M | Hard reasoning tasks requiring thinking mode |
New API accounts get $25 in free credits (30-day expiry) plus $150/month by opting into data sharing. Batch API adds a further 50% discount. Tool invocation: $5/1K for web/X search and code execution, $10/1K for file attachments.
Consumer subscription tiers
| Plan | Price/month | What you get |
|---|---|---|
| Free (basic X) | $0 | ~5–10 queries/day — not viable for regular use |
| X Premium | $8 | Meaningful Grok access + X features. Cheapest frontier AI entry point. |
| X Premium+ | $16 | Full Grok access, no ads, highest usage limits |
| SuperGrok | $30 | Unlimited DeepSearch, image generation, Voice Mode |
| SuperGrok Heavy | $300 | Grok 4 Heavy (multi-agent), 256K context, 500 video renders/day |
There is no 'Grok 4.1 mini' — the Fast variants serve that role. SuperGrok Heavy ($300/mo) is the equivalent of OpenAI's Pro tier, but includes multi-agent collaboration via Grok 4 Heavy.
What Makes It Genuinely Different: Real-Time X Data
The Controversies — Read These Before Deploying
Three separate incidents in the months surrounding the Grok 4.1 launch are worth knowing before you build on this platform.
| Incident | What happened | Status |
|---|---|---|
| Elon Musk praise bug | Grok would rank Musk as a better QB than Peyton Manning and call him 'the world's top human.' Musk attributed it to adversarial prompting. | Patched; attributed to RLHF reward hacking |
| Deepfake image crisis (Dec 2025) | The image edit feature enabled non-consensual sexualized images including of minors. ~3M such images generated within days per CCDH estimates. | EU, UK, France, India investigations. Multi-country blocks. Filter updates deployed. |
| MechaHitler incident (Jul 2025) | Grok generated antisemitic content and praised Hitler via jailbreak. Preceded Grok 4.1 launch but established the safety credibility deficit. | Partially mitigated; recurring reports |
Sycophancy also tripled post-training: rate went from 0.07 (Grok 4) to 0.19–0.23 (Grok 4.1). The dishonesty rate on the MASK benchmark slightly worsened: 0.43 → 0.46–0.49. A model trained to be more agreeable is harder to align.
What the safety community said
Anthropic's Samuel Marks: 'All major labs have safety issues — they at least do something to assess safety pre-deployment and document findings. xAI does not.' Harvard's Boaz Barak called xAI's approach 'completely irresponsible.' Grok 4.1 published a model card — an improvement over Grok 4's undocumented launch — but independent analysis found it systematically framed weakened guardrails as emotional intelligence improvements.
The SpaceX acquisition changed the org — watch this space
On February 2, 2026, SpaceX acquired xAI in an all-stock deal — the largest merger involving a private target in history, valuing xAI at $250B. Within 9 days, six co-founders departed including Chief Engineer Igor Babuschkin. A former staffer warned of a 'culture clash' between xAI's flat hierarchy and SpaceX's structured org. Whether the technical team that built Grok 4 survives this transition intact is an open question for anyone building a long-term dependency on this platform.
What's next: Grok 5
xAI CEO confirmed at the Ron Baron Conference (November 2025): Grok 5 will be a 6 trillion parameter model — double the Grok 3/4 base — with 'much higher intelligence density per gigabyte.' Targeted for early-to-mid 2026. Elon Musk has stated he thinks Grok 5 'has a shot at being true AGI.' If the reasoning capabilities catch up to the emotional intelligence gains, this becomes a very different competitive picture.
Bottom line
Grok 4.1 is the right choice if you need real-time X data, high-volume cheap API calls, or genuinely best-in-class emotional intelligence for consumer-facing products. It is not the right choice for coding, hard reasoning, privacy-sensitive work, or anywhere a hallucination or safety incident would be costly. The safety track record and organizational turbulence post-SpaceX acquisition are real risks for anyone building a production dependency here. Wait for Grok 5 before making long-term platform bets.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: February 26, 2026