趋近智
活跃参数
28B
上下文长度
131.072K
模态
Multimodal
架构
Mixture of Experts (MoE)
许可证
Apache 2.0
发布日期
30 Jun 2025
知识截止
-
专家参数总数
3.0B
专家数量
130
活跃专家
14
注意力结构
Grouped-Query Attention
隐藏维度大小
-
层数
28
注意力头
20
键值头
4
激活函数
-
归一化
-
位置嵌入
Absolute Position Embedding
不同量化方法和上下文大小的显存要求
The ERNIE-4.5-VL-28B-A3B is a member of the Baidu ERNIE 4.5 model family, a recent collection of large-scale multimodal foundation models. This specific variant functions as a lightweight vision-language model, engineered to process both textual and visual inputs. Its core purpose involves enabling advanced multimodal understanding, encompassing tasks such as image comprehension, text generation informed by visual context, and cross-modal reasoning. The model aims to achieve an equilibrium between performance efficacy and computational resource efficiency, rendering it suitable for enterprise applications and diverse real-world deployment scenarios that necessitate robust multimodal capabilities.
Architecturally, ERNIE-4.5-VL-28B-A3B is constructed upon a fine-grained Mixture-of-Experts (MoE) backbone, a key innovation across the ERNIE 4.5 series. This heterogeneous MoE structure facilitates joint training on textual and visual modalities. It incorporates modality-isolated routing and employs techniques such as router orthogonal loss and multimodal token-balanced loss to prevent interference between modalities and ensure effective representation and mutual reinforcement during training. The model further benefits from modality-specific post-training optimizations, including supervised fine-tuning, direct preference optimization, and Reinforcement Learning with Verifiable Rewards (RLVR), to enhance its performance in vision-language tasks. Visual inputs are processed by a variable-resolution Vision Transformer (ViT) encoder, with representations then projected into a shared embedding space via an adapter.
For performance characteristics, the ERNIE-4.5-VL-28B-A3B supports both "thinking" and "non-thinking" modes, offering flexibility in reasoning approaches. The model demonstrates proficiency in visual perception, document and chart understanding, and visual knowledge, maintaining strong performance across relevant tasks. Efficient inference is achieved through methods like multi-expert parallel collaboration and convolutional code quantization, enabling 4-bit/2-bit lossless quantization for deployment across a range of hardware platforms. The model can process long-form text inputs with a substantial context length, supporting extended conversations and complex reasoning that combines textual knowledge with visual perception.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
排名适用于本地LLM。
没有可用的 ERNIE-4.5-VL-28B-A3B 评估基准。