趋近智
活跃参数
424B
上下文长度
131.072K
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
Apache 2.0
发布日期
30 Jun 2025
训练数据截止日期
Jun 2025
专家参数总数
47.0B
专家数量
128
活跃专家
16
注意力结构
Grouped-Query Attention
隐藏维度大小
-
层数
54
注意力头
64
键值头
8
激活函数
-
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.
The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.
Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
没有可用的 ERNIE-4.5-VL-424B-A47B 评估基准。