ApX logoApX logo

Llama 4 Maverick

Active Parameters

400B

Context Length

1,000K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Llama 4 Community License Agreement

Release Date

5 Apr 2025

Knowledge Cutoff

Aug 2024

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

96

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Irope

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

12,288

Number of Layers

120

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

202,048

Mixture of Experts

Total Expert Parameters

17.0B

Number of Experts

128

Active Experts

2

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingHidden: 12.3k · Context: 1,000k · Vocab: 202kx 120 layersRMSNormPre-AttentionGrouped-Query Attention96Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (2/128 experts)Swish+Final RMSNormOutput Logits

Llama 4 Maverick

The Llama 4 Maverick model is a natively multimodal large language model developed by Meta, released as part of the Llama 4 model family. Its primary purpose is to deliver advanced capabilities in text and image understanding, supporting a wide range of applications including assistant-like conversational AI, creative content generation, complex reasoning, and code generation. Designed for both commercial and research deployment, Llama 4 Maverick aims to provide high-quality performance with improved cost efficiency.

From an architectural perspective, Llama 4 Maverick leverages a Mixture-of-Experts (MoE) design, a significant departure from previous dense transformer models. It comprises 400 billion total parameters, with only 17 billion parameters actively engaged per token during inference. This efficiency is achieved through the use of 128 experts, where processing involves alternating dense and MoE layers. The model integrates different modalities, such as text and images, through an early fusion mechanism, allowing for comprehensive multimodal processing from the initial stages. The internal architecture also incorporates iRoPE for managing and scaling context, further enhancing its capabilities.

Llama 4 Maverick demonstrates robust performance across diverse benchmarks, including coding, reasoning, and multilingual tasks, as well as long-context processing and image understanding. It is engineered for high model throughput and is suitable for production environments that demand low latency and precision. The model's design facilitates its deployment in scenarios requiring sophisticated multimodal interaction and efficient resource utilization, addressing modern AI application requirements.

About Llama 4

Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.


Other Llama 4 Models

Evaluation Benchmarks

Rank

#102

BenchmarkScoreRank

0.949

10

General Knowledge

MMLU

0.855

12

0.72

21

0.319

30

0.16

31

Professional Knowledge

MMLU Pro

0.79

39

Rankings

Overall Rank

#102

Coding Rank

#125

Model Integrity

Total Score

B+

72 / 100

Llama 4 Maverick Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

Llama 4 Maverick sets a high bar for architectural and compute transparency, providing rare granular details on Mixture-of-Experts routing and environmental impact. However, its transparency profile is weakened by restrictive licensing that geofences entire regions and significant discrepancies between official benchmark claims and independent third-party reproductions. While technically well-documented, the model's 'openness' is heavily qualified by corporate and geographic constraints.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

Meta provides comprehensive documentation for the Llama 4 Maverick architecture, explicitly identifying it as a Mixture-of-Experts (MoE) model with 128 experts and alternating dense/MoE layers. Technical details include the use of 'early fusion' for native multimodality and 'iRoPE' for context scaling. The model is documented as being pretrained from scratch on a 22-trillion token multimodal dataset, a significant departure from the dense architectures of previous Llama generations.

Dataset Composition

4.5 / 10

While Meta discloses the scale of the pretraining data (22 trillion tokens for Maverick) and general categories (publicly available web data, licensed data, and Meta product data like Instagram/Facebook posts), it lacks a granular percentage breakdown of sources. The methodology for filtering and cleaning is mentioned in high-level terms but lacks the detailed documentation found in fully open-source datasets. The inclusion of user interaction data from Meta AI is noted but not quantified.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the official Llama GitHub and Hugging Face repositories. It supports a stated vocabulary and 12 primary languages (including English, Hindi, and Thai). Documentation confirms the tokenizer's alignment with the multimodal training data, and its performance is verifiable through standard library integrations like 'transformers'.

Model

31.5 / 40

Parameter Density

9.0 / 10

Meta is highly transparent regarding the MoE structure of Maverick, clearly distinguishing between the 400 billion total parameters and the 17 billion active parameters engaged per token. The documentation specifies the expert count (128) and the routing mechanism (one shared expert plus one routed expert per layer). This level of detail exceeds industry standards for sparse models.

Training Compute

9.5 / 10

Meta provides exemplary transparency regarding training compute. Official model cards disclose 2.38 million H100 GPU hours for Maverick, hardware specifications (H100-80GB), and detailed environmental impact metrics, including location-based greenhouse gas emissions (645 tons CO2eq) and a market-based estimate of 0 tons due to renewable energy matching.

Benchmark Reproducibility

5.0 / 10

Meta reports strong results on standard benchmarks (MMLU-Pro, GPQA, etc.), but independent reproducibility has been inconsistent. While some evaluation metrics are provided, third-party audits have noted significant performance gaps between official claims and public checkpoints. The lack of public evaluation code for the exact 'experimental' variants used in some leaderboard submissions limits full verification.

Identity Consistency

8.0 / 10

The model consistently identifies itself as part of the Llama 4 family in standard deployments. However, there have been documented instances of the model exhibiting 'identity confusion' or hallucinating capabilities (e.g., character counting errors) common in large-scale LLMs. Version tracking is clear, with distinct labels for 'Instruct' and 'Experimental' builds.

Downstream

19.5 / 30

License Clarity

6.0 / 10

The Llama 4 Community License Agreement is a custom 'open-weights' license rather than a standard OSI open-source license. It contains significant restrictions, most notably a total ban on use by individuals or companies domiciled in the European Union for multimodal models. It also includes a 700-million monthly active user (MAU) threshold for commercial use, which creates legal ambiguity for large-scale deployments.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various precisions (FP16, FP8, INT4). Meta provides guidance on VRAM needs for different context lengths, noting that Maverick requires a multi-GPU setup (e.g., an 8xH100 node) for efficient inference. Quantization trade-offs are mentioned, though specific accuracy-loss curves for the 1M token context window are less detailed.

Versioning Drift

6.0 / 10

Meta uses versioned releases on Hugging Face and maintains a basic changelog. However, the community has reported 'silent' differences between experimental builds used for benchmarks and the final weights released to the public. While semantic versioning is present, the documentation for behavioral drift between minor updates is not comprehensive.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
488k
977k

VRAM Required:

Recommended GPUs