趋近智
活跃参数
2T
上下文长度
-
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
Llama 4 Community License Agreement
发布日期
-
知识截止
-
专家参数总数
288.0B
专家数量
16
活跃专家
2
注意力结构
Grouped-Query Attention
隐藏维度大小
16384
层数
160
注意力头
128
键值头
8
激活函数
-
归一化
-
位置嵌入
Absolute Position Embedding
不同量化方法和上下文大小的显存要求
Llama 4 Behemoth is an unreleased, large-scale multimodal model developed by Meta. Its primary function within the Llama 4 model family is to act as a teacher model, facilitating the distillation of advanced intelligence and knowledge into smaller, more deployable models such as Llama 4 Scout and Llama 4 Maverick. This strategic role aims to enhance the capabilities of these student models across various tasks. While Llama 4 Behemoth is Meta's largest and most powerful model, it is currently still in training and has not been released for public use, with reports indicating potential delays in its public debut. Its designation as a foundational teacher model implies its use in internal research and development to advance the boundaries of AI performance.
The architectural design of Llama 4 Behemoth is based on a Mixture-of-Experts (MoE) configuration. This architecture incorporates approximately 2 trillion total parameters, with 288 billion active parameters engaged during inference. The model integrates 16 distinct expert networks. Llama 4 Behemoth is natively multimodal, capable of processing and understanding text, images, and video data through an early fusion mechanism. Training for Llama 4 Behemoth involved significant computational resources, including 32,000 GPUs, utilizing FP8 precision to optimize efficiency while processing over 30 trillion tokens of diverse data. This architecture enables efficient scaling and advanced performance characteristics, leveraging a novel distillation loss function to dynamically balance soft and hard targets during the knowledge transfer process to student models.
While Llama 4 Behemoth is not yet publicly available, internal evaluations indicate its performance. It has demonstrated capabilities that include outperforming various models on STEM-focused benchmarks, such as those related to mathematical problem-solving, multilingual understanding, and image reasoning. The model's primary use cases within Meta are for advanced AI research and for generating high-quality synthetic data, which is then used for training smaller, deployable models like Llama 4 Maverick. The application of MoE architecture in Llama 4 models contributes to computational efficiency by activating only a subset of parameters for each token during inference, which reduces compute costs while maintaining performance.
Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.
排名适用于本地LLM。
没有可用的 Llama 4 Behemoth 评估基准。