ERNIE-4.5-VL-424B-A47B：规格和 GPU 显存要求

ERNIE-4.5-VL-424B-A47B

开源

开放权重

活跃参数

424B

上下文长度

131.072K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

Apache 2.0

发布日期

30 Jun 2025

训练数据截止日期

Jun 2025

技术规格

专家参数总数

47.0B

专家数量

128

活跃专家

注意力结构

Grouped-Query Attention

隐藏维度大小

层数

注意力头

键值头

激活函数

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

ERNIE-4.5-VL-424B-A47B

ERNIE-4.5-VL-424B-A47B is a multimodal foundation model developed by Baidu, representing the flagship variant of the ERNIE 4.5 family. It is engineered to process and generate content across textual and visual modalities using a large-scale Mixture of Experts (MoE) framework. By integrating 424 billion total parameters with a sparse activation of 47 billion parameters per token, the model maintains high-capacity representation while optimizing computational throughput. Its design facilitates applications requiring advanced logic, comprehensive document analysis, and sophisticated multimodal conversational interactions.

The model employs a heterogeneous MoE architecture that differentiates between text and vision processing while maintaining a unified hidden state. It incorporates 128 experts in total, including 64 specialized experts for text and 64 for vision, with a routing mechanism that selects 8 active experts per modality for each token. To ensure effective cross-modal integration without performance degradation in specific domains, the system utilizes shared self-attention layers and shared experts alongside modality-isolated routing. The attention mechanism is based on Grouped Query Attention (GQA) with 64 heads and 8 key-value heads, optimized for a context window of 131,072 tokens.

Training and inference are facilitated by the PaddlePaddle deep learning framework, supporting industrial-grade deployment through 4-bit and 2-bit lossless quantization. The architecture supports two distinct operational modes: a standard inference mode for rapid perception tasks and a reasoning-heavy mode for complex logical problems. Primary use cases involve visual question answering, complex chart and document interpretation, and automated multimodal content generation. The inclusion of 2D rotary position embeddings (RoPE) in the vision encoder and absolute position embeddings in the transformer backbone ensures precise spatial and sequential modeling across diverse input types.

关于 ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.

其他 ERNIE 4.5 模型

评估基准

没有可用的 ERNIE-4.5-VL-424B-A47B 评估基准。

排名

编程排名

模型透明度

总分

70 / 100

上游

19.5 / 30

模型

28.0 / 40

下游

22.0 / 30

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

64k

128k

所需显存:

资源

官方文档发布说明阅读论文下载权重源代码

ERNIE-4.5-VL-424B-A47B

技术规格

ERNIE-4.5-VL-424B-A47B

关于 ERNIE 4.5

其他 ERNIE 4.5 模型

评估基准

排名

模型透明度

GPU 要求

所需显存:

推荐 GPU

资源