Active Parameters
2T
Context Length
-
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Llama 4 Community License Agreement
Release Date
-
Knowledge Cutoff
-
Total Expert Parameters
288.0B
Number of Experts
16
Active Experts
2
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
16384
Number of Layers
160
Attention Heads
128
Key-Value Heads
8
Activation Function
-
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
Llama 4 Behemoth is an unreleased, large-scale multimodal model developed by Meta. Its primary function within the Llama 4 model family is to act as a teacher model, facilitating the distillation of advanced intelligence and knowledge into smaller, more deployable models such as Llama 4 Scout and Llama 4 Maverick. This strategic role aims to enhance the capabilities of these student models across various tasks. While Llama 4 Behemoth is Meta's largest and most powerful model, it is currently still in training and has not been released for public use, with reports indicating potential delays in its public debut. Its designation as a foundational teacher model implies its use in internal research and development to advance the boundaries of AI performance.
The architectural design of Llama 4 Behemoth is based on a Mixture-of-Experts (MoE) configuration. This architecture incorporates approximately 2 trillion total parameters, with 288 billion active parameters engaged during inference. The model integrates 16 distinct expert networks. Llama 4 Behemoth is natively multimodal, capable of processing and understanding text, images, and video data through an early fusion mechanism. Training for Llama 4 Behemoth involved significant computational resources, including 32,000 GPUs, utilizing FP8 precision to optimize efficiency while processing over 30 trillion tokens of diverse data. This architecture enables efficient scaling and advanced performance characteristics, leveraging a novel distillation loss function to dynamically balance soft and hard targets during the knowledge transfer process to student models.
While Llama 4 Behemoth is not yet publicly available, internal evaluations indicate its performance. It has demonstrated capabilities that include outperforming various models on STEM-focused benchmarks, such as those related to mathematical problem-solving, multilingual understanding, and image reasoning. The model's primary use cases within Meta are for advanced AI research and for generating high-quality synthetic data, which is then used for training smaller, deployable models like Llama 4 Maverick. The application of MoE architecture in Llama 4 models contributes to computational efficiency by activating only a subset of parameters for each token during inference, which reduces compute costs while maintaining performance.
Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.
Ranking is for Local LLMs.
No evaluation benchmarks for Llama 4 Behemoth available.
Overall Rank
-
Coding Rank
-
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens