趋近智
活跃参数
28B
上下文长度
131.072K
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
Apache 2.0
发布日期
30 Jun 2025
训练数据截止日期
Nov 2024
专家参数总数
3.0B
专家数量
130
活跃专家
14
注意力结构
Grouped-Query Attention
隐藏维度大小
-
层数
28
注意力头
20
键值头
4
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
Absolute Position Embedding
ERNIE-4.5-VL-28B-A3B-Base is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu as part of the ERNIE 4.5 model family. Specifically engineered for sophisticated vision-language tasks, the model integrates 28 billion total parameters while activating only 3 billion parameters per token during inference. This sparse activation strategy allows the model to maintain the extensive knowledge capacity of a larger system while significantly reducing the computational overhead and latency typically associated with high-parameter models. It is designed to process and synthesize information across multiple modalities, including text, images, and video, supporting a substantial context length of up to 131,072 tokens.
The technical architecture of the ERNIE-4.5-VL series introduces a heterogeneous MoE structure that facilitates both parameter sharing across modalities and the use of dedicated parameters for individual modalities. Key innovations include modality-isolated routing, which prevents interference between textual and visual learning, as well as router orthogonal loss and multimodal token-balanced loss mechanisms to ensure stable expert utilization. The model employs Grouped-Query Attention (GQA) for efficient memory management and utilizes Rotary Position Embeddings (RoPE) to handle extended context windows. Training is conducted within the PaddlePaddle deep learning framework using advanced parallelization strategies, including intra-node expert parallelism and FP8 mixed-precision training.
In operation, the ERNIE-4.5-VL-28B-A3B-Base serves as a versatile backbone for applications requiring high-fidelity cross-modal reasoning. It supports distinct functional modes, including a "thinking" mode for enhanced logical reasoning and a "non-thinking" mode optimized for perceptual tasks such as document analysis, optical character recognition (OCR), and visual knowledge retrieval. Its capabilities extend to agentic interactions, where it can utilize external tools for fine-grained image zooming or search. The model is released with open weights under the Apache 2.0 license, providing a flexible resource for developers and researchers to deploy multimodal solutions across various hardware platforms.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
没有可用的 ERNIE-4.5-VL-28B-A3B-Base 评估基准。