Active Parameters
400B
Context Length
1,000K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Llama 4 Community License Agreement
Release Date
5 Apr 2025
Knowledge Cutoff
Aug 2024
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
96
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Irope
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
12,288
Number of Layers
120
FFN Intermediate Size (Dense)
8,192
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
202,048
Mixture of Experts
Total Expert Parameters
17.0B
Number of Experts
128
Active Experts
2
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
The Llama 4 Maverick model is a natively multimodal large language model developed by Meta, released as part of the Llama 4 model family. Its primary purpose is to deliver advanced capabilities in text and image understanding, supporting a wide range of applications including assistant-like conversational AI, creative content generation, complex reasoning, and code generation. Designed for both commercial and research deployment, Llama 4 Maverick aims to provide high-quality performance with improved cost efficiency.
From an architectural perspective, Llama 4 Maverick leverages a Mixture-of-Experts (MoE) design, a significant departure from previous dense transformer models. It comprises 400 billion total parameters, with only 17 billion parameters actively engaged per token during inference. This efficiency is achieved through the use of 128 experts, where processing involves alternating dense and MoE layers. The model integrates different modalities, such as text and images, through an early fusion mechanism, allowing for comprehensive multimodal processing from the initial stages. The internal architecture also incorporates iRoPE for managing and scaling context, further enhancing its capabilities.
Llama 4 Maverick demonstrates robust performance across diverse benchmarks, including coding, reasoning, and multilingual tasks, as well as long-context processing and image understanding. It is engineered for high model throughput and is suitable for production environments that demand low latency and precision. The model's design facilitates its deployment in scenarios requiring sophisticated multimodal interaction and efficient resource utilization, addressing modern AI application requirements.
Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.
Rank
#102
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.949 | 10 |
General Knowledge MMLU | 0.855 | 12 |
Summarization ProLLM Summarization | 0.72 | 21 |
StackUnseen ProLLM Stack Unseen | 0.319 | 30 |
Coding Aider Coding | 0.16 | 31 |
Professional Knowledge MMLU Pro | 0.79 | 39 |
Overall Rank
#102
Coding Rank
#125
Total Score
72
/ 100
Llama 4 Maverick sets a high bar for architectural and compute transparency, providing rare granular details on Mixture-of-Experts routing and environmental impact. However, its transparency profile is weakened by restrictive licensing that geofences entire regions and significant discrepancies between official benchmark claims and independent third-party reproductions. While technically well-documented, the model's 'openness' is heavily qualified by corporate and geographic constraints.
Architectural Provenance
Meta provides comprehensive documentation for the Llama 4 Maverick architecture, explicitly identifying it as a Mixture-of-Experts (MoE) model with 128 experts and alternating dense/MoE layers. Technical details include the use of 'early fusion' for native multimodality and 'iRoPE' for context scaling. The model is documented as being pretrained from scratch on a 22-trillion token multimodal dataset, a significant departure from the dense architectures of previous Llama generations.
Dataset Composition
While Meta discloses the scale of the pretraining data (22 trillion tokens for Maverick) and general categories (publicly available web data, licensed data, and Meta product data like Instagram/Facebook posts), it lacks a granular percentage breakdown of sources. The methodology for filtering and cleaning is mentioned in high-level terms but lacks the detailed documentation found in fully open-source datasets. The inclusion of user interaction data from Meta AI is noted but not quantified.
Tokenizer Integrity
The tokenizer is publicly accessible via the official Llama GitHub and Hugging Face repositories. It supports a stated vocabulary and 12 primary languages (including English, Hindi, and Thai). Documentation confirms the tokenizer's alignment with the multimodal training data, and its performance is verifiable through standard library integrations like 'transformers'.
Parameter Density
Meta is highly transparent regarding the MoE structure of Maverick, clearly distinguishing between the 400 billion total parameters and the 17 billion active parameters engaged per token. The documentation specifies the expert count (128) and the routing mechanism (one shared expert plus one routed expert per layer). This level of detail exceeds industry standards for sparse models.
Training Compute
Meta provides exemplary transparency regarding training compute. Official model cards disclose 2.38 million H100 GPU hours for Maverick, hardware specifications (H100-80GB), and detailed environmental impact metrics, including location-based greenhouse gas emissions (645 tons CO2eq) and a market-based estimate of 0 tons due to renewable energy matching.
Benchmark Reproducibility
Meta reports strong results on standard benchmarks (MMLU-Pro, GPQA, etc.), but independent reproducibility has been inconsistent. While some evaluation metrics are provided, third-party audits have noted significant performance gaps between official claims and public checkpoints. The lack of public evaluation code for the exact 'experimental' variants used in some leaderboard submissions limits full verification.
Identity Consistency
The model consistently identifies itself as part of the Llama 4 family in standard deployments. However, there have been documented instances of the model exhibiting 'identity confusion' or hallucinating capabilities (e.g., character counting errors) common in large-scale LLMs. Version tracking is clear, with distinct labels for 'Instruct' and 'Experimental' builds.
License Clarity
The Llama 4 Community License Agreement is a custom 'open-weights' license rather than a standard OSI open-source license. It contains significant restrictions, most notably a total ban on use by individuals or companies domiciled in the European Union for multimodal models. It also includes a 700-million monthly active user (MAU) threshold for commercial use, which creates legal ambiguity for large-scale deployments.
Hardware Footprint
Hardware requirements are well-documented for various precisions (FP16, FP8, INT4). Meta provides guidance on VRAM needs for different context lengths, noting that Maverick requires a multi-GPU setup (e.g., an 8xH100 node) for efficient inference. Quantization trade-offs are mentioned, though specific accuracy-loss curves for the 1M token context window are less detailed.
Versioning Drift
Meta uses versioned releases on Hugging Face and maintains a basic changelog. However, the community has reported 'silent' differences between experimental builds used for benchmarks and the final weights released to the public. While semantic versioning is present, the documentation for behavioral drift between minor updates is not comprehensive.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online