Parameters
1.7T
Context Length
256K
Modality
Multimodal
Architecture
Dense
License
Proprietary
Release Date
9 Jul 2025
Knowledge Cutoff
Dec 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
-
Key-Value Heads
-
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
-
Dimensions
Hidden Dimension Size
-
Number of Layers
-
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Grok 4 is xAI's most intelligent model, trained with reinforcement learning at unprecedented scale using the 200,000 GPU Colossus cluster. Features native tool use with code interpreter and web browsing, real-time search integration across X and the web, and state-of-the-art performance on Humanity's Last Exam (50.7% text-only subset with tools). Achieves 100% on AIME 2025, 99.4% on HMMT 2025, 15.9% on ARC-AGI-2, and dominates Vending-Bench with $4694.15 net worth. Supports 256K context window with multimodal understanding. Available through API with advanced reasoning, coding, and visual processing capabilities.
xAI's frontier intelligence models trained with reinforcement learning at unprecedented scale using the 200,000 GPU Colossus cluster. Grok 4 series demonstrates state-of-the-art performance in reasoning, coding, and multimodal understanding with native tool use capabilities. Features real-time search integration across X and the web, advanced reasoning through scaled RL training, and industry-leading performance on academic benchmarks. Designed for both immediate responses and extended thinking modes with vision capabilities.
Rank
#24
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.985 | 🥇 1 |
Summarization ProLLM Summarization | 0.976 | 🥉 3 |
StackUnseen ProLLM Stack Unseen | 0.886 | 5 |
Coding Aider Coding | 0.80 | 6 |
Graduate-Level QA GPQA | 0.875 | 6 |
Reasoning LiveBench Reasoning | 0.79 | 16 |
Data Analysis LiveBench Data Analysis | 0.63 | 17 |
Mathematics LiveBench Mathematics | 0.83 | 18 |
Professional Knowledge MMLU Pro | 0.85 | 19 |
Coding LiveBench Coding | 0.73 | 28 |
Agentic Coding LiveBench Agentic | 0.30 | 45 |
Overall Rank
#24
Coding Rank
#12
Total Score
39
/ 100
Grok 4 exhibits a 'black box' transparency profile, characterized by high-performance claims backed by minimal technical documentation. While the scale of its training infrastructure is publicly acknowledged, critical details regarding architecture, dataset composition, and evaluation methodology remain undisclosed. This lack of verifiable evidence forces users to rely on marketing assertions rather than reproducible scientific data.
Architectural Provenance
While xAI identifies Grok 4 as a 'hybrid' architecture combining dense transformer layers with modular attention mechanisms, there is no technical paper or detailed documentation describing the specific pretraining methodology or architectural modifications. Official sources mention it is a 'large-scale generative model' but fail to provide the depth required for a high score, such as specific layer configurations or the exact nature of the 'modular attention' mentioned in marketing materials.
Dataset Composition
Dataset disclosure is extremely limited. Official documentation vaguely refers to 'publicly available Internet data,' 'data from users or contractors,' and 'internally generated data.' There is no public breakdown of dataset proportions (e.g., % code, % web), no detailed filtering/cleaning methodology, and no sample data available for inspection. The claim of 'verifiable training data' is an unverifiable assertion without public access to the verification logs or sources.
Tokenizer Integrity
The tokenizer is not publicly available for independent inspection or download. While vocabulary size and specific tokenization logic for Grok 4 remain undocumented, API documentation provides basic token accounting (e.g., 1 token ≈ 0.75 words) and confirms support for multilingual text and image embeddings. However, without access to the tokenizer files or a technical specification of the vocabulary, it remains a 'black box' for developers.
Parameter Density
Third-party reports and some xAI-affiliated materials claim a parameter count of approximately 1.7 trillion. However, official xAI documentation for Grok 4 conspicuously avoids stating a definitive parameter count or the ratio of active-to-total parameters. The description of a 'hybrid' and 'modular' design suggests a Mixture-of-Experts (MoE) or sparse architecture, but the lack of official disclosure on active parameters makes the 1.7T figure unverifiable and potentially misleading.
Training Compute
xAI provides more detail here than in other categories, explicitly naming the 'Colossus' supercomputer and its 200,000 NVIDIA GPU cluster as the training hardware. Independent audits by Epoch AI provide detailed estimates of 246 million H100-hours and 310 GWh of electricity. While xAI itself does not publish a formal carbon footprint or exact cost, the public disclosure of the hardware scale and duration allows for high-confidence third-party verification.
Benchmark Reproducibility
xAI reports impressive scores on benchmarks like ARC-AGI-2 (15.9%) and GPQA (87.5%), but does not release the evaluation code, exact prompts, or few-shot examples used to achieve these results. While some third-party verification exists (e.g., Artificial Analysis), the lack of a public reproduction repository or detailed methodology for 'Expert' vs 'Auto' modes prevents independent scientific validation.
Identity Consistency
Grok 4 demonstrates strong identity consistency, correctly identifying itself and its version (grok-4-0709) across API and web interfaces. It is transparent about its nature as an AI developed by xAI. There are no widespread reports of the model claiming to be a competitor's product (e.g., GPT-4), though it does occasionally reference the views of xAI's leadership when prompted on controversial topics.
License Clarity
The model is released under a 'Proprietary' license with no open-weights or open-source version available. While the Terms of Service state that users 'own the output,' there are conflicting interpretations regarding commercial use and data ownership in the broader xAI/X ecosystem. The lack of a standard open license (like Apache 2.0 used for Grok-1) and the presence of restrictive usage terms for 'SuperGrok' tiers result in a low transparency score.
Hardware Footprint
Basic hardware requirements are indirectly available through API specifications, documenting a 256,000-token context window and tiered pricing for 'extended context' usage. However, because the model weights are not public, there is no official guidance on VRAM requirements for local deployment or quantization accuracy tradeoffs. Information is limited to API-based consumption metrics.
Versioning Drift
xAI maintains a basic release log (e.g., Grok 4.1, Grok 4 Fast), but lacks a detailed changelog or technical documentation of model drift. Updates are often 'silent' or announced via social media with minimal technical detail on how weights or safety filters have changed. There is no public mechanism to access or pin specific sub-versions of the model to prevent behavior drift in production.
APX AI
Online