ApX 标志

趋近智

ERNIE-4.5-VL-424B-A47B

活跃参数

424B

上下文长度

131.072K

模态

Multimodal

架构

Mixture of Experts (MoE)

许可证

Apache 2.0

发布日期

30 Jun 2025

知识截止

Jun 2025

技术规格

专家参数总数

47.0B

专家数量

128

活跃专家

16

注意力结构

Grouped-Query Attention

隐藏维度大小

-

层数

54

注意力头

64

键值头

8

激活函数

-

归一化

RMS Normalization

位置嵌入

Absolute Position Embedding

系统要求

不同量化方法和上下文大小的显存要求

ERNIE-4.5-VL-424B-A47B

ERNIE 4.5 is a large-scale multimodal foundation model family developed by Baidu, designed to integrate and process information across both textual and visual modalities. The ERNIE-4.5-VL-424B-A47B variant is specifically engineered for advanced comprehension and generation capabilities, supporting applications that demand complex understanding and creative output from diverse data types. These applications include sophisticated conversational AI, multimodal content creation, and intelligent analysis systems, all aiming to provide high performance across a wide spectrum of tasks.

This model variant employs a heterogeneous Mixture of Experts (MoE) architecture, comprising 424 billion total parameters with 47 billion activated parameters per token. A key architectural innovation is its novel design that supports parameter sharing across modalities while also allowing for dedicated expert parameters for each individual modality. This structure enhances multimodal understanding without compromising performance on text-related tasks. The model features 54 layers and utilizes Grouped Query Attention (GQA) with 64 attention heads and 8 key-value heads. Its positional encoding strategy integrates multimodal positional embeddings for unified hidden states and incorporates 2D rotary position embedding (RoPE) within the vision encoder. The system routes text and vision features to distinct sets of experts while simultaneously using shared experts and self-attention layers for all tokens, thereby facilitating cross-modal knowledge integration. The architecture includes 64 distinct text experts and 64 distinct vision experts, with 8 active experts selected for each modality per token. Furthermore, it incorporates modality-isolated routing, router orthogonal loss, and multimodal token-balanced loss to optimize training and prevent interference between modalities.

ERNIE-4.5-VL-424B-A47B is engineered for tasks requiring cross-modal comprehension and generation, supporting a context length of 131,072 tokens. Its substantial parameter base and efficient MoE design enable the model to process extensive and complex inputs, fostering deep semantic understanding and coherent long-form generation across both text and images. The model offers distinct "thinking" and "non-thinking" modes to accommodate varied reasoning approaches. Potential use cases encompass multimodal content generation, advanced dialogue systems, comprehensive visual question answering, document and chart understanding, and general multimodal analysis where the synthesis of different data types is critical. For enhanced inference efficiency, the model supports deployment with quantization, including 4-bit and 2-bit lossless quantization. The entire ERNIE 4.5 family, including this variant, is built on the PaddlePaddle deep learning framework, which contributes to its high-performance inference and streamlined deployment capabilities.

关于 ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


其他 ERNIE 4.5 模型

评估基准

排名适用于本地LLM。

没有可用的 ERNIE-4.5-VL-424B-A47B 评估基准。

排名

排名

-

编程排名

-

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
64k
128k

所需显存:

推荐 GPU