ApX 标志ApX 标志

趋近智

Llama 4 Behemoth

活跃参数

2T

上下文长度

10,000K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

Llama 4 Community License Agreement

发布日期

-

训练数据截止日期

Aug 2024

技术规格

专家参数总数

288.0B

专家数量

16

活跃专家

2

注意力结构

Grouped-Query Attention

隐藏维度大小

16384

层数

160

注意力头

128

键值头

8

激活函数

SwigLU

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

Llama 4 Behemoth

Llama 4 Behemoth is a large-scale multimodal foundation model developed by Meta, designed to serve as the primary teacher model within the Llama 4 family. As a non-deployed frontier model, its principal function is to generate high-quality synthetic data and provide the knowledge base for distilling smaller, production-ready variants such as Llama 4 Maverick and Scout. It integrates a native multimodal architecture capable of processing interleaved sequences of text, images, and video through an early fusion mechanism, which unifies visual and linguistic tokens within a single transformer backbone rather than utilizing separate modality-specific encoders.

The model utilizes a sparse Mixture-of-Experts (MoE) architecture to achieve a total parameter count of approximately 2 trillion. During inference, the routing mechanism activates a subset of approximately 288 billion parameters across 16 experts. Technical innovations include the use of Grouped-Query Attention (GQA) to manage memory bandwidth and a training regime optimized with FP8 precision on large-scale GPU clusters. The model's architecture incorporates interleaved attention layers and a novel distillation loss function designed to balance soft and hard targets during the knowledge transfer process to student models.

Developed as a research-centric artifact, Llama 4 Behemoth is optimized for complex reasoning tasks, mathematical problem-solving, and cross-modal understanding. By processing over 30 trillion tokens of diverse data, it establishes a high-capacity latent space that supports the training of highly efficient downstream models. While the model remains in a research preview status, its architectural design provides the technical foundation for the broader Llama 4 ecosystem, emphasizing scalability through sparsity and native cross-modal integration.

关于 Llama 4

Meta's Llama 4 model family implements a Mixture-of-Experts (MoE) architecture for efficient scaling. It features native multimodality through early fusion of text, images, and video. This iteration also supports significantly extended context lengths, with models capable of processing up to 10 million tokens.


其他 Llama 4 模型

评估基准

没有可用的 Llama 4 Behemoth 评估基准。

排名

排名

-

编程排名

-

模型透明度

总分

C+

56 / 100

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
4883k
9766k

所需显存:

推荐 GPU