ApX 标志ApX 标志

趋近智

ERNIE-4.5-VL-28B-A3B-Base

活跃参数

28B

上下文长度

131.072K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

Apache 2.0

发布日期

30 Jun 2025

训练数据截止日期

Nov 2024

技术规格

专家参数总数

3.0B

专家数量

130

活跃专家

14

注意力结构

Grouped-Query Attention

隐藏维度大小

-

层数

28

注意力头

20

键值头

4

激活函数

SwigLU

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

ERNIE-4.5-VL-28B-A3B-Base

ERNIE-4.5-VL-28B-A3B-Base is a multimodal Mixture-of-Experts (MoE) foundation model developed by Baidu as part of the ERNIE 4.5 model family. Specifically engineered for sophisticated vision-language tasks, the model integrates 28 billion total parameters while activating only 3 billion parameters per token during inference. This sparse activation strategy allows the model to maintain the extensive knowledge capacity of a larger system while significantly reducing the computational overhead and latency typically associated with high-parameter models. It is designed to process and synthesize information across multiple modalities, including text, images, and video, supporting a substantial context length of up to 131,072 tokens.

The technical architecture of the ERNIE-4.5-VL series introduces a heterogeneous MoE structure that facilitates both parameter sharing across modalities and the use of dedicated parameters for individual modalities. Key innovations include modality-isolated routing, which prevents interference between textual and visual learning, as well as router orthogonal loss and multimodal token-balanced loss mechanisms to ensure stable expert utilization. The model employs Grouped-Query Attention (GQA) for efficient memory management and utilizes Rotary Position Embeddings (RoPE) to handle extended context windows. Training is conducted within the PaddlePaddle deep learning framework using advanced parallelization strategies, including intra-node expert parallelism and FP8 mixed-precision training.

In operation, the ERNIE-4.5-VL-28B-A3B-Base serves as a versatile backbone for applications requiring high-fidelity cross-modal reasoning. It supports distinct functional modes, including a "thinking" mode for enhanced logical reasoning and a "non-thinking" mode optimized for perceptual tasks such as document analysis, optical character recognition (OCR), and visual knowledge retrieval. Its capabilities extend to agentic interactions, where it can utilize external tools for fine-grained image zooming or search. The model is released with open weights under the Apache 2.0 license, providing a flexible resource for developers and researchers to deploy multimodal solutions across various hardware platforms.

关于 ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


其他 ERNIE 4.5 模型

评估基准

没有可用的 ERNIE-4.5-VL-28B-A3B-Base 评估基准。

排名

排名

-

编程排名

-

模型透明度

总分

B

67 / 100

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
64k
128k

所需显存:

推荐 GPU