Grok 4

Closed Source

Closed Weights

Parameters

1.7T

Context Length

256K

Modality

Multimodal

Architecture

Dense

License

Proprietary

Release Date

9 Jul 2025

Knowledge Cutoff

Dec 2024

Evaluation Benchmarks

Rank

#24

Benchmark	Score	Rank
QA Assistant ProLLM QA Assistant	0.985	🥇 1
Summarization ProLLM Summarization	0.976	🥉 3
StackUnseen ProLLM Stack Unseen	0.886	5
Coding Aider Coding	0.80	6
Graduate-Level QA GPQA	0.875	6
Reasoning LiveBench Reasoning	0.79	16
Data Analysis LiveBench Data Analysis	0.63	17
Mathematics LiveBench Mathematics	0.83	18
Professional Knowledge MMLU Pro	0.85	19
Coding LiveBench Coding	0.73	28
Agentic Coding LiveBench Agentic	0.30	45

Rankings

Overall Rank

#24

Coding Rank

#12

About Grok 4

Grok 4 is xAI's most intelligent model, trained with reinforcement learning at unprecedented scale using the 200,000 GPU Colossus cluster. Features native tool use with code interpreter and web browsing, real-time search integration across X and the web, and state-of-the-art performance on Humanity's Last Exam (50.7% text-only subset with tools). Achieves 100% on AIME 2025, 99.4% on HMMT 2025, 15.9% on ARC-AGI-2, and dominates Vending-Bench with $4694.15 net worth. Supports 256K context window with multimodal understanding. Available through API with advanced reasoning, coding, and visual processing capabilities.

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

Position Embedding

Absolute Position Embedding

RoPE Theta

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

RMS Normalization

Activation Function

Dimensions

Hidden Dimension Size

Number of Layers

FFN Intermediate Size (Dense)

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

Model Integrity

Total Score

39 / 100

Upstream

9.0 / 30

Model

19.0 / 40

Downstream

11.0 / 30

Grok 4 Model Integrity Report

Total Score

/ 100

Audit Note

Grok 4 exhibits a 'black box' transparency profile, characterized by high-performance claims backed by minimal technical documentation. While the scale of its training infrastructure is publicly acknowledged, critical details regarding architecture, dataset composition, and evaluation methodology remain undisclosed. This lack of verifiable evidence forces users to rely on marketing assertions rather than reproducible scientific data.

Upstream

9.0 / 30

Architectural Provenance

3.0 / 10

While xAI identifies Grok 4 as a 'hybrid' architecture combining dense transformer layers with modular attention mechanisms, there is no technical paper or detailed documentation describing the specific pretraining methodology or architectural modifications. Official sources mention it is a 'large-scale generative model' but fail to provide the depth required for a high score, such as specific layer configurations or the exact nature of the 'modular attention' mentioned in marketing materials.

Dataset Composition

2.0 / 10

Dataset disclosure is extremely limited. Official documentation vaguely refers to 'publicly available Internet data,' 'data from users or contractors,' and 'internally generated data.' There is no public breakdown of dataset proportions (e.g., % code, % web), no detailed filtering/cleaning methodology, and no sample data available for inspection. The claim of 'verifiable training data' is an unverifiable assertion without public access to the verification logs or sources.

Tokenizer Integrity

4.0 / 10

The tokenizer is not publicly available for independent inspection or download. While vocabulary size and specific tokenization logic for Grok 4 remain undocumented, API documentation provides basic token accounting (e.g., 1 token ≈ 0.75 words) and confirms support for multilingual text and image embeddings. However, without access to the tokenizer files or a technical specification of the vocabulary, it remains a 'black box' for developers.

Model

19.0 / 40

Parameter Density

2.0 / 10

Third-party reports and some xAI-affiliated materials claim a parameter count of approximately 1.7 trillion. However, official xAI documentation for Grok 4 conspicuously avoids stating a definitive parameter count or the ratio of active-to-total parameters. The description of a 'hybrid' and 'modular' design suggests a Mixture-of-Experts (MoE) or sparse architecture, but the lack of official disclosure on active parameters makes the 1.7T figure unverifiable and potentially misleading.

Training Compute

6.0 / 10

xAI provides more detail here than in other categories, explicitly naming the 'Colossus' supercomputer and its 200,000 NVIDIA GPU cluster as the training hardware. Independent audits by Epoch AI provide detailed estimates of 246 million H100-hours and 310 GWh of electricity. While xAI itself does not publish a formal carbon footprint or exact cost, the public disclosure of the hardware scale and duration allows for high-confidence third-party verification.

Benchmark Reproducibility

3.0 / 10

xAI reports impressive scores on benchmarks like ARC-AGI-2 (15.9%) and GPQA (87.5%), but does not release the evaluation code, exact prompts, or few-shot examples used to achieve these results. While some third-party verification exists (e.g., Artificial Analysis), the lack of a public reproduction repository or detailed methodology for 'Expert' vs 'Auto' modes prevents independent scientific validation.

Identity Consistency

8.0 / 10

Grok 4 demonstrates strong identity consistency, correctly identifying itself and its version (grok-4-0709) across API and web interfaces. It is transparent about its nature as an AI developed by xAI. There are no widespread reports of the model claiming to be a competitor's product (e.g., GPT-4), though it does occasionally reference the views of xAI's leadership when prompted on controversial topics.

Downstream

11.0 / 30

License Clarity

2.0 / 10

The model is released under a 'Proprietary' license with no open-weights or open-source version available. While the Terms of Service state that users 'own the output,' there are conflicting interpretations regarding commercial use and data ownership in the broader xAI/X ecosystem. The lack of a standard open license (like Apache 2.0 used for Grok-1) and the presence of restrictive usage terms for 'SuperGrok' tiers result in a low transparency score.

Hardware Footprint

5.0 / 10

Basic hardware requirements are indirectly available through API specifications, documenting a 256,000-token context window and tiered pricing for 'extended context' usage. However, because the model weights are not public, there is no official guidance on VRAM requirements for local deployment or quantization accuracy tradeoffs. Information is limited to API-based consumption metrics.

Versioning Drift

4.0 / 10

xAI maintains a basic release log (e.g., Grok 4.1, Grok 4 Fast), but lacks a detailed changelog or technical documentation of model drift. Updates are often 'silent' or announced via social media with minimal technical detail on how weights or safety filters have changed. There is no public mechanism to access or pin specific sub-versions of the model to prevent behavior drift in production.

Resources

Official Documentation

About Grok 4

xAI's frontier intelligence models trained with reinforcement learning at unprecedented scale using the 200,000 GPU Colossus cluster. Grok 4 series demonstrates state-of-the-art performance in reasoning, coding, and multimodal understanding with native tool use capabilities. Features real-time search integration across X and the web, advanced reasoning through scaled RL training, and industry-leading performance on academic benchmarks. Designed for both immediate responses and extended thinking modes with vision capabilities.

Grok 4

Evaluation Benchmarks

Rankings

About Grok 4

Technical Specifications

Model Integrity

Grok 4 Model Integrity Report

Audit Note

Upstream

Model

Downstream

Resources

About Grok 4

Other Grok 4 Models