xAI
Grok 4.2
Grok 4.2 ships a genuinely novel architecture — four AI agents debating every hard query in real time — and it's in public beta with weekly updates (Beta 2 dropped March 3). But xAI has still published zero official benchmarks, half the founding team has left, and the model is under regulatory investigation in seven countries. The rapid iteration is real; the lack of verifiable data is also real.
Context window
256K tokens
API (blended)
$9.00/1M
Consumer access
$30/mo
Multimodal
Yes
Score Breakdown
51.5/100 → 5.2/10Intelligence, Reliability, Speed, and Context are field-relative — scores shift as models are added. Accessibility and Trust are absolute checklists. Full methodology →
Strengths
- +First commercial model with multi-agent inference baked in — 4 agents debate every complex query in real time
- +Beta 2 (March 3): improved instruction following, reduced capability hallucinations, better LaTeX, more reliable multi-image rendering
- +Rapid-learning architecture: weekly improvement cycles with published release notes — Beta 2 shipped within days of Beta 1
- +Scales to 16 agents (SuperGrok Heavy) for the most demanding workloads
- +Medical document analysis via photo upload (lab reports, prescriptions, imaging)
- +Real-time X data access — unique for news/social/trend-aware workflows
- +Deepfake crisis prompted genuine (if belated) safety improvements under regulatory pressure
- +50% of training compute spent on RL — unusually high ratio vs other labs
Weaknesses
- -No official benchmarks published — xAI has released no model card, blog post, or technical paper
- -Public beta runs on 500B 'small' model; full Grok 4.2 still training as of March 2026
- -API not yet available — consumer-only access via SuperGrok ($30/mo) and above
- -Hallucination rate improved from ~12% (Grok 4) to ~4.2% (Grok 4.1) — Beta 2 claims further reduction but still above frontier best (<1% for GPT-5.2)
- -Deepfake crisis: 6,700+ NSFW images/hour generated in Jan 2026; multi-country regulatory investigations
- -Six of twelve co-founders departed post-SpaceX acquisition — including research and reasoning leads
- -SpaceX acquisition creates unprecedented single-person control of AI, space, and social media
- -Sycophancy tripled from Grok 4 → 4.1; trend unclear for 4.2
- -Not yet on Artificial Analysis — no independently verified benchmark data available as of March 2026
Best for
Not ideal for
⚠️ Beta 2 — still no official benchmarks published
As of March 5, 2026, xAI has published no model card, blog post, or technical paper for Grok 4.2. Beta 2 shipped March 3 with five targeted fixes (instruction following, capability hallucination reduction, LaTeX quality, image search precision, multi-image reliability). All benchmark numbers in this review remain provisional — sourced from third-party reviews and community reports. The public beta still runs on xAI's 500B 'small' foundation model; the full-size Grok 4.2 is still training.
Beta 2 Updates (March 3, 2026)
Beta 2 shipped five targeted fixes based on user feedback from the first week of public testing.
| Fix | What changed |
|---|---|
| Instruction following | Better adherence to multi-part, structured requests. Tasks requiring strict formatting rules complete correctly on first attempt more reliably. |
| Capability hallucination | Reduced instances where the model claims it can do something it can't. Critical for agentic workflows where false capability claims cause cascading failures. |
| LaTeX / scientific text | Cleaner typesetting for math, chemistry notation, and physics formulas. Less manual correction needed before use in academic documents. |
| Image search triggers | Recalibrated the decision boundary for when to activate image search vs plain text response. |
| Multi-image rendering | Fixed inconsistent rendering when users requested multiple images in a single response. |
These fixes apply across all four agents and their coordination layer, not just a single inference pass. The rapid iteration cycle (weekly updates with release notes) is a genuine differentiator from competitors' quarterly cadence.
The 4-agent system: what it actually is
Cost efficiency claim
xAI claims the 4-agent system costs only 1.5–2.5× a single inference pass (not 4×) thanks to shared model weights, prefix/KV cache reuse, and RL-optimized debate rounds. This claim is not independently verified. SuperGrok Heavy ($300/month) scales to 16 agents for maximum reasoning depth.
⚠️ Grok 4.2 is not yet on Artificial Analysis
Grok 4.2 launched as a public beta in February 2026 — Artificial Analysis has not yet independently measured it. xAI has published no official benchmark results. All third-party claims are unverified. For verified Grok benchmarks, see the Grok 4.1 review. For AA-measured comparisons, see the competitors below.
Competitor benchmarks (AA-measured)
| Benchmark | Claude Opus 4.6 | Gemini 3.1 Pro | GPT-5.2 |
|---|---|---|---|
| GPQA Diamond | 84.0% | 94.1% | 90.3% |
| HLE — standard mode | 18.6% | 44.7% | 35.4% |
| τ²-bench (tool use & agents) | 84.8% | 95.6% | 84.8% |
These are the models Grok 4.2 competes against, all independently measured by Artificial Analysis. Grok 4.2 will be added once AA evaluates it.
Grok 4 verified scores (4.2's floor)
| Benchmark | Grok 4 (verified) | Notes |
|---|---|---|
| AIME 2025 | 100% | Perfect — same as frontier leaders |
| HMMT 2025 math | 96.7% | High-school team math tournament |
| GPQA Diamond | 87–88% | Graduate-level science reasoning |
| AA Intelligence Index | 73 | Was #1 at July 2025 launch |
| LMArena (Grok 4.1 Thinking) | 1483 Elo peak | Slipped to ~1475 (#4) by Feb 2026 |
Grok 4.2 is built on the Grok 4 foundation. These Grok 4 scores are the minimum floor — 4.2's multi-agent system may improve on them, but benchmarks have not been retested.
Alpha Arena stock trading: 12.11% in 14 days
In a live AI stock-trading competition, Grok 4.2 returned 12.11% over 14 days ($10,000 → ~$12,193) while GPT-5.1 and Gemini 3 Pro posted losses. Four Grok variants placed in the top six. Dramatic — but this reflects a single two-week window under specific market conditions. It is not a reproducible benchmark and should not be extrapolated to general financial reasoning ability.
Access tiers
| Plan | Monthly price | Grok 4.2 access | Agents |
|---|---|---|---|
| Free (basic X) | $0 | No — older Grok only | — |
| SuperGrok | $30 / mo ($300/yr) | ✓ Beta access | 4 |
| X Premium+ | $40 / mo | ✓ Beta access + X features | 4 |
| SuperGrok Heavy | $300 / mo | ✓ Heavy variant (still training) | 16 |
| Grok Business | $30 / user / mo | ✓ Enterprise + compliance | 4 |
| Grok Enterprise | Custom | ✓ SOC 2, HIPAA, zero data retention | 4–16 |
| API | Coming soon | Not yet available | — |
Key capabilities
Active controversies (March 2026)
| Issue | Status | Exposure |
|---|---|---|
| Deepfake / CSAM generation | 6,700+ NSFW images/hr in Jan 2026 analysis; 10% depicted minors | Indonesia, Malaysia, Philippines blocked Grok; UK, Ireland, Australia, France investigating |
| Antisemitism / MechaHitler | July 2025 — ADL condemned; Turkey restricted access; bipartisan congressional letter | Poland referred to EC; system prompt updated post-incident |
| SpaceX acquisition (Feb 2, 2026) | xAI is now a wholly-owned SpaceX subsidiary — single-person control of AI, space, and social media | Tesla $2B investment lawsuits (self-dealing); SpaceX IPO planned mid-2026 |
| Founder departures | 6 of 12 co-founders left, including research and reasoning leads (Jimmy Ba, Tony Wu) | Safety team already small; C-suite turnover: GC, CFO, product engineering lead |
| Colossus emissions | Gas turbines in Memphis (predominantly Black neighborhood); ~2GW capacity approaching | EPA revised permit rules Jan 2026 after xAI used 'portable generator' workaround |
These are not background risks — they are active regulatory and legal situations that any enterprise evaluation should account for.
Safety team context
CNN reported that Musk pushed back against guardrails internally and that xAI's safety team 'already small compared to competitors, lost several staffers' before the deepfake crisis. Common Sense Media rated Grok as 'among the most unsafe' chatbots for children and teens. For enterprise and government buyers, VentureBeat's assessment holds: 'The issue isn't infrastructure — it's optics.'
The infrastructure and funding picture
xAI raised $20B in a Series E (Jan 2026, investors: Nvidia, Cisco, Fidelity, Qatar Investment Authority, Abu Dhabi's MGX) at a $230B valuation — then was acquired by SpaceX on February 2. The company burns roughly $1B/month and reported a $1.46B net loss in Q3 2025. Revenue is estimated at ~$100M for 2024, targeting $500M ARR in 2025. Training runs on Colossus in Memphis (~555,000 GPUs now, target 1M by late 2026). Grok 4.2 reportedly devoted ~50% of training compute to RL, compared to the 20–30% typical at other labs.
What's coming: Grok 5
The research on Grok 5 (xAI's next major model) points to approximately 6 trillion parameters — roughly 12× Grok 4's estimated scale — targeting early-to-mid 2026. Colossus 2 (550,000 Blackwell-generation chips) and a planned third facility are being built to support it. If the parameter count is accurate and training goes well, Grok 5 would represent a significant capability jump over anything currently available.
Bottom line
Grok 4.2 is the most architecturally interesting release of early 2026 — baked-in multi-agent inference, a rapid-learning deployment model, and real-time X data access are genuinely differentiated. Beta 2 shows the weekly iteration cycle is real, with concrete fixes landing fast. But it's still a public beta with no official benchmarks, running on a 500B 'small' variant of the full model, without an API, at a $30/month minimum. If you're an early adopter willing to use unverified tooling, it's worth exploring. If you need verified performance data or API access, wait for the full release and AA evaluation.
Pricing details
Subscription plans
API pricing
Prices verified February 2026. LLM pricing changes frequently — verify at the provider's site before budgeting.
Last updated: March 5, 2026