趋近智
活跃参数
21B
上下文长度
131.072K
模态
Text
架构
Mixture of Experts (MoE)
许可证
Apache 2.0
发布日期
30 Jun 2025
训练数据截止日期
Dec 2024
专家参数总数
3.0B
专家数量
64
活跃专家
6
注意力结构
Grouped-Query Attention
隐藏维度大小
2560
层数
28
注意力头
20
键值头
4
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
The ERNIE-4.5-21B-A3B-Base model is a text-focused Mixture-of-Experts (MoE) transformer and a core component of Baidu's ERNIE 4.5 model family. This specific variant is derived through a process of modality-specific extraction, where text-related parameters are isolated from a larger multimodal pre-training phase that incorporates trillions of tokens. Its architecture is characterized by a heterogeneous MoE structure that supports parameter sharing across modalities during training while maintaining dedicated experts for specific data types. This design ensures that textual representations are not compromised by multimodal joint training, allowing for high-performance natural language understanding and generation in both Chinese and English.
Technically, the model employs a sparse architecture featuring 64 experts per layer, with a routing mechanism that activates 6 experts per token, resulting in approximately 3 billion active parameters per forward pass. This sparsity provides a significant reduction in computational overhead while maintaining the representative capacity of a much larger 21-billion parameter model. The attention mechanism utilizes Grouped-Query Attention (GQA) with 20 query heads and 4 key-value heads, which optimizes memory bandwidth and inference speed. The integration of 2D Rotary Position Embeddings (RoPE) and support for a 131,072-token context window makes it highly effective for processing long-form documents and complex reasoning tasks.
To facilitate efficient deployment, the ERNIE 4.5 family is built on the PaddlePaddle framework and incorporates several hardware-level optimizations, including FP8 mixed-precision training and multi-expert parallel collaboration. The model supports advanced quantization techniques such as 4-bit and 2-bit lossless compression, enabling it to run on diverse hardware platforms with reduced memory requirements. By utilizing modality-isolated routing and specialized router losses, the model achieves high parameter efficiency, making it suitable for industrial-grade applications ranging from sophisticated summarization to cross-modal reasoning within a production environment.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
没有可用的 ERNIE-4.5-21B-A3B-Base 评估基准。