趋近智
活跃参数
2T
上下文长度
10,000K
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
Llama 4 Community License Agreement
发布日期
-
训练数据截止日期
Aug 2024
专家参数总数
288.0B
专家数量
16
活跃专家
2
注意力结构
Grouped-Query Attention
隐藏维度大小
16384
层数
160
注意力头
128
键值头
8
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
Llama 4 Behemoth is a large-scale multimodal foundation model developed by Meta, designed to serve as the primary teacher model within the Llama 4 family. As a non-deployed frontier model, its principal function is to generate high-quality synthetic data and provide the knowledge base for distilling smaller, production-ready variants such as Llama 4 Maverick and Scout. It integrates a native multimodal architecture capable of processing interleaved sequences of text, images, and video through an early fusion mechanism, which unifies visual and linguistic tokens within a single transformer backbone rather than utilizing separate modality-specific encoders.
The model utilizes a sparse Mixture-of-Experts (MoE) architecture to achieve a total parameter count of approximately 2 trillion. During inference, the routing mechanism activates a subset of approximately 288 billion parameters across 16 experts. Technical innovations include the use of Grouped-Query Attention (GQA) to manage memory bandwidth and a training regime optimized with FP8 precision on large-scale GPU clusters. The model's architecture incorporates interleaved attention layers and a novel distillation loss function designed to balance soft and hard targets during the knowledge transfer process to student models.
Developed as a research-centric artifact, Llama 4 Behemoth is optimized for complex reasoning tasks, mathematical problem-solving, and cross-modal understanding. By processing over 30 trillion tokens of diverse data, it establishes a high-capacity latent space that supports the training of highly efficient downstream models. While the model remains in a research preview status, its architectural design provides the technical foundation for the broader Llama 4 ecosystem, emphasizing scalability through sparsity and native cross-modal integration.
Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.
没有可用的 Llama 4 Behemoth 评估基准。