ERNIE-4.5-VL-28B-A3B: Specifications and GPU VRAM Requirements

ERNIE-4.5-VL-28B-A3B

开源

开放权重

活跃参数

28B

上下文长度

131.072K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

Apache 2.0

发布日期

30 Jun 2025

知识截止

技术规格

专家参数总数

3.0B

专家数量

130

活跃专家

注意力结构

Grouped-Query Attention

隐藏维度大小

层数

注意力头

键值头

激活函数

归一化

位置嵌入

Absolute Position Embedding

系统要求

不同量化方法和上下文大小的显存要求

ERNIE-4.5-VL-28B-A3B

The ERNIE-4.5-VL-28B-A3B is a member of the Baidu ERNIE 4.5 model family, a recent collection of large-scale multimodal foundation models. This specific variant functions as a lightweight vision-language model, engineered to process both textual and visual inputs. Its core purpose involves enabling advanced multimodal understanding, encompassing tasks such as image comprehension, text generation informed by visual context, and cross-modal reasoning. The model aims to achieve an equilibrium between performance efficacy and computational resource efficiency, rendering it suitable for enterprise applications and diverse real-world deployment scenarios that necessitate robust multimodal capabilities.

Architecturally, ERNIE-4.5-VL-28B-A3B is constructed upon a fine-grained Mixture-of-Experts (MoE) backbone, a key innovation across the ERNIE 4.5 series. This heterogeneous MoE structure facilitates joint training on textual and visual modalities. It incorporates modality-isolated routing and employs techniques such as router orthogonal loss and multimodal token-balanced loss to prevent interference between modalities and ensure effective representation and mutual reinforcement during training. The model further benefits from modality-specific post-training optimizations, including supervised fine-tuning, direct preference optimization, and Reinforcement Learning with Verifiable Rewards (RLVR), to enhance its performance in vision-language tasks. Visual inputs are processed by a variable-resolution Vision Transformer (ViT) encoder, with representations then projected into a shared embedding space via an adapter.

For performance characteristics, the ERNIE-4.5-VL-28B-A3B supports both "thinking" and "non-thinking" modes, offering flexibility in reasoning approaches. The model demonstrates proficiency in visual perception, document and chart understanding, and visual knowledge, maintaining strong performance across relevant tasks. Efficient inference is achieved through methods like multi-expert parallel collaboration and convolutional code quantization, enabling 4-bit/2-bit lossless quantization for deployment across a range of hardware platforms. The model can process long-form text inputs with a substantial context length, supporting extended conversations and complex reasoning that combines textual knowledge with visual perception.

关于 ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.

其他 ERNIE 4.5 模型

评估基准

排名适用于本地LLM。

没有可用的 ERNIE-4.5-VL-28B-A3B 评估基准。

排名

编程排名

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

64k

128k

所需显存:

资源

官方文档发布说明阅读论文下载权重源代码

ERNIE-4.5-VL-28B-A3B

技术规格

系统要求

ERNIE-4.5-VL-28B-A3B

关于 ERNIE 4.5

其他 ERNIE 4.5 模型

评估基准

排名

GPU 要求

所需显存:

推荐 GPU

资源