ApX logoApX logo

ERNIE-4.5-21B-A3B-Base

Active Parameters

21B

Context Length

131.072K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

30 Jun 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

20

Key-Value Heads

4

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,560

Number of Layers

28

FFN Intermediate Size (Dense)

1,536

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

103,424

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

64

Active Experts

6

Shared Experts

2

FFN Intermediate Size (per Expert)

1,536

Dense Layers Before MoE

1

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 2.6k · Context: 131.1k · Vocab: 103.4kx 28 layersRMSNormPre-AttentionGrouped-Query Attention20Q / 4KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (6/64 experts)SwiGLUIntermediate: 1.5k+Final RMSNormOutput Logits

ERNIE-4.5-21B-A3B-Base

The ERNIE-4.5-21B-A3B-Base model is a text-focused Mixture-of-Experts (MoE) transformer and a core component of Baidu's ERNIE 4.5 model family. This specific variant is derived through a process of modality-specific extraction, where text-related parameters are isolated from a larger multimodal pre-training phase that incorporates trillions of tokens. Its architecture is characterized by a heterogeneous MoE structure that supports parameter sharing across modalities during training while maintaining dedicated experts for specific data types. This design ensures that textual representations are not compromised by multimodal joint training, allowing for high-performance natural language understanding and generation in both Chinese and English.

Technically, the model employs a sparse architecture featuring 64 experts per layer, with a routing mechanism that activates 6 experts per token, resulting in approximately 3 billion active parameters per forward pass. This sparsity provides a significant reduction in computational overhead while maintaining the representative capacity of a much larger 21-billion parameter model. The attention mechanism utilizes Grouped-Query Attention (GQA) with 20 query heads and 4 key-value heads, which optimizes memory bandwidth and inference speed. The integration of 2D Rotary Position Embeddings (RoPE) and support for a 131,072-token context window makes it highly effective for processing long-form documents and complex reasoning tasks.

To facilitate efficient deployment, the ERNIE 4.5 family is built on the PaddlePaddle framework and incorporates several hardware-level optimizations, including FP8 mixed-precision training and multi-expert parallel collaboration. The model supports advanced quantization techniques such as 4-bit and 2-bit lossless compression, enabling it to run on diverse hardware platforms with reduced memory requirements. By utilizing modality-isolated routing and specialized router losses, the model achieves high parameter efficiency, making it suitable for industrial-grade applications ranging from sophisticated summarization to cross-modal reasoning within a production environment.

About ERNIE 4.5

The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.


Other ERNIE 4.5 Models

Evaluation Benchmarks

No evaluation benchmarks for ERNIE-4.5-21B-A3B-Base available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

73 / 100

ERNIE-4.5-21B-A3B-Base Model Integrity Report

Total Score

73

/ 100

B+

Audit Note

ERNIE-4.5-21B-A3B-Base demonstrates a high level of architectural and licensing transparency, providing precise details on its Mixture-of-Experts structure and a permissive Apache 2.0 license. While it offers clear hardware requirements and tokenizer documentation, it remains less transparent regarding the specific sources of its training data and the total compute resources consumed. The model's identity and versioning are well-maintained, though more granular benchmark reproduction tools would further strengthen its profile.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model's architectural details are extensively documented in the ERNIE 4.5 Technical Report and official Hugging Face model cards. It is explicitly identified as a text-focused Mixture-of-Experts (MoE) variant derived from a larger multimodal pre-training phase. Key architectural components such as the 28-layer structure, Grouped-Query Attention (GQA) with 20 query and 4 KV heads, and the use of 2D Rotary Position Embeddings (RoPE) with progressive frequency scaling (10K to 500K) are clearly stated. The 'modality-specific extraction' process from the multimodal base is well-described, providing high clarity on its origin.

Dataset Composition

4.5 / 10

While the technical report mentions training on 'trillions of tokens' and 'diverse multimodal data sources,' it lacks a granular breakdown of the dataset composition (e.g., specific percentages of web, code, or academic data). It describes a 'REEAO' data processing system that ensures bitwise-deterministic token sequences and prevents duplication, which provides some methodological transparency. However, the specific sources and their proportions remain largely undisclosed, falling into the category of general data categories without specific verification of the underlying corpus.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the 'tokenization_ernie4_5.py' script on Hugging Face and is based on SentencePiece. The vocabulary size and special tokens (e.g., <mask:1>, <mask:7>) are explicitly defined in the code. Documentation confirms support for a 131,072-token context window, and the tokenizer is compatible with standard libraries like Hugging Face Transformers (v4.54.0+), allowing for direct public verification of its behavior and alignment with claimed language support.

Model

29.0 / 40

Parameter Density

9.0 / 10

Baidu provides precise figures for parameter density: 21 billion total parameters with approximately 3 billion active parameters per forward pass. The MoE structure is detailed as having 64 experts per layer, with 6 experts activated per token plus 2 shared experts. This level of detail—distinguishing between total, active, shared, and specialized experts—is exemplary for an MoE model and avoids the common pitfall of misleading total-parameter marketing.

Training Compute

5.0 / 10

The documentation mentions the use of Baidu's PaddlePaddle framework and hardware-level optimizations like FP8 mixed-precision training. While it references a large-scale cluster (30,000 Kunlun chips) and mentions achieving 47% Model FLOPs Utilization (MFU), it does not provide the specific total GPU/TPU hours or the exact hardware count used for this specific 21B variant's training run. Carbon footprint and specific cost estimates are also absent.

Benchmark Reproducibility

6.0 / 10

The technical report lists performance across various benchmarks (BBH, CMATH, GSM8K, HumanEval+) and compares them to competitors like Qwen3. While the evaluation results are detailed, the specific evaluation code and exact prompts used for every benchmark are not fully centralized in a single reproducible repository, though some integration with standard evaluation frameworks is implied. Third-party verification is starting to appear on leaderboards like SnakeBench, but a comprehensive reproduction guide is missing.

Identity Consistency

9.0 / 10

The model consistently identifies as part of the ERNIE 4.5 family across all official documentation, GitHub repositories, and API providers. It maintains clear versioning (e.g., the 'Thinking' vs 'Base' variants) and does not exhibit identity confusion with other models like GPT-4. The capabilities and limitations regarding its text-only nature (extracted from a multimodal base) are transparently communicated.

Downstream

23.0 / 30

License Clarity

9.5 / 10

The model is explicitly released under the Apache License 2.0, which is a standard, highly permissive open-source license. This is clearly stated on Hugging Face, GitHub, and in the technical report. The license permits commercial use and derivative works without conflicting proprietary terms, providing a high degree of legal transparency for developers.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented, with specific VRAM estimates provided for different precisions (e.g., ~48GB for FP16/BF16). The documentation explicitly mentions support for 4-bit and 2-bit 'lossless' quantization using convolutional code quantization, and provides guidance on the necessary GPU resources (e.g., 2x RTX 4090 or 1x A6000). Context length memory scaling is also addressed through the mention of FlashMask and optimized memory scheduling.

Versioning Drift

5.5 / 10

Baidu uses a clear naming convention (ERNIE-4.5-21B-A3B-Base) and maintains a changelog for its ERNIEKit and FastDeploy tools on GitHub. However, the frequency of model weight updates and a formal semantic versioning system for the weights themselves (to track potential silent drift) are less clearly defined compared to the software tools. There is a history of 'Thinking' upgrades, but a centralized, detailed weight-level changelog is not fully evident.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs