Active Parameters
21B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Dec 2024
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
20
Key-Value Heads
4
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,560
Number of Layers
28
FFN Intermediate Size (Dense)
1,536
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
64
Active Experts
6
Shared Experts
2
FFN Intermediate Size (per Expert)
1,536
Dense Layers Before MoE
1
The ERNIE-4.5-21B-A3B-Base model is a text-focused Mixture-of-Experts (MoE) transformer and a core component of Baidu's ERNIE 4.5 model family. This specific variant is derived through a process of modality-specific extraction, where text-related parameters are isolated from a larger multimodal pre-training phase that incorporates trillions of tokens. Its architecture is characterized by a heterogeneous MoE structure that supports parameter sharing across modalities during training while maintaining dedicated experts for specific data types. This design ensures that textual representations are not compromised by multimodal joint training, allowing for high-performance natural language understanding and generation in both Chinese and English.
Technically, the model employs a sparse architecture featuring 64 experts per layer, with a routing mechanism that activates 6 experts per token, resulting in approximately 3 billion active parameters per forward pass. This sparsity provides a significant reduction in computational overhead while maintaining the representative capacity of a much larger 21-billion parameter model. The attention mechanism utilizes Grouped-Query Attention (GQA) with 20 query heads and 4 key-value heads, which optimizes memory bandwidth and inference speed. The integration of 2D Rotary Position Embeddings (RoPE) and support for a 131,072-token context window makes it highly effective for processing long-form documents and complex reasoning tasks.
To facilitate efficient deployment, the ERNIE 4.5 family is built on the PaddlePaddle framework and incorporates several hardware-level optimizations, including FP8 mixed-precision training and multi-expert parallel collaboration. The model supports advanced quantization techniques such as 4-bit and 2-bit lossless compression, enabling it to run on diverse hardware platforms with reduced memory requirements. By utilizing modality-isolated routing and specialized router losses, the model achieves high parameter efficiency, making it suitable for industrial-grade applications ranging from sophisticated summarization to cross-modal reasoning within a production environment.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-21B-A3B-Base available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
ERNIE-4.5-21B-A3B-Base demonstrates a high level of architectural and licensing transparency, providing precise details on its Mixture-of-Experts structure and a permissive Apache 2.0 license. While it offers clear hardware requirements and tokenizer documentation, it remains less transparent regarding the specific sources of its training data and the total compute resources consumed. The model's identity and versioning are well-maintained, though more granular benchmark reproduction tools would further strengthen its profile.
Architectural Provenance
The model's architectural details are extensively documented in the ERNIE 4.5 Technical Report and official Hugging Face model cards. It is explicitly identified as a text-focused Mixture-of-Experts (MoE) variant derived from a larger multimodal pre-training phase. Key architectural components such as the 28-layer structure, Grouped-Query Attention (GQA) with 20 query and 4 KV heads, and the use of 2D Rotary Position Embeddings (RoPE) with progressive frequency scaling (10K to 500K) are clearly stated. The 'modality-specific extraction' process from the multimodal base is well-described, providing high clarity on its origin.
Dataset Composition
While the technical report mentions training on 'trillions of tokens' and 'diverse multimodal data sources,' it lacks a granular breakdown of the dataset composition (e.g., specific percentages of web, code, or academic data). It describes a 'REEAO' data processing system that ensures bitwise-deterministic token sequences and prevents duplication, which provides some methodological transparency. However, the specific sources and their proportions remain largely undisclosed, falling into the category of general data categories without specific verification of the underlying corpus.
Tokenizer Integrity
The tokenizer is publicly accessible via the 'tokenization_ernie4_5.py' script on Hugging Face and is based on SentencePiece. The vocabulary size and special tokens (e.g., <mask:1>, <mask:7>) are explicitly defined in the code. Documentation confirms support for a 131,072-token context window, and the tokenizer is compatible with standard libraries like Hugging Face Transformers (v4.54.0+), allowing for direct public verification of its behavior and alignment with claimed language support.
Parameter Density
Baidu provides precise figures for parameter density: 21 billion total parameters with approximately 3 billion active parameters per forward pass. The MoE structure is detailed as having 64 experts per layer, with 6 experts activated per token plus 2 shared experts. This level of detail—distinguishing between total, active, shared, and specialized experts—is exemplary for an MoE model and avoids the common pitfall of misleading total-parameter marketing.
Training Compute
The documentation mentions the use of Baidu's PaddlePaddle framework and hardware-level optimizations like FP8 mixed-precision training. While it references a large-scale cluster (30,000 Kunlun chips) and mentions achieving 47% Model FLOPs Utilization (MFU), it does not provide the specific total GPU/TPU hours or the exact hardware count used for this specific 21B variant's training run. Carbon footprint and specific cost estimates are also absent.
Benchmark Reproducibility
The technical report lists performance across various benchmarks (BBH, CMATH, GSM8K, HumanEval+) and compares them to competitors like Qwen3. While the evaluation results are detailed, the specific evaluation code and exact prompts used for every benchmark are not fully centralized in a single reproducible repository, though some integration with standard evaluation frameworks is implied. Third-party verification is starting to appear on leaderboards like SnakeBench, but a comprehensive reproduction guide is missing.
Identity Consistency
The model consistently identifies as part of the ERNIE 4.5 family across all official documentation, GitHub repositories, and API providers. It maintains clear versioning (e.g., the 'Thinking' vs 'Base' variants) and does not exhibit identity confusion with other models like GPT-4. The capabilities and limitations regarding its text-only nature (extracted from a multimodal base) are transparently communicated.
License Clarity
The model is explicitly released under the Apache License 2.0, which is a standard, highly permissive open-source license. This is clearly stated on Hugging Face, GitHub, and in the technical report. The license permits commercial use and derivative works without conflicting proprietary terms, providing a high degree of legal transparency for developers.
Hardware Footprint
Hardware requirements are well-documented, with specific VRAM estimates provided for different precisions (e.g., ~48GB for FP16/BF16). The documentation explicitly mentions support for 4-bit and 2-bit 'lossless' quantization using convolutional code quantization, and provides guidance on the necessary GPU resources (e.g., 2x RTX 4090 or 1x A6000). Context length memory scaling is also addressed through the mention of FlashMask and optimized memory scheduling.
Versioning Drift
Baidu uses a clear naming convention (ERNIE-4.5-21B-A3B-Base) and maintains a changelog for its ERNIEKit and FastDeploy tools on GitHub. However, the frequency of model weight updates and a formal semantic versioning system for the weights themselves (to track potential silent drift) are less clearly defined compared to the software tools. There is a history of 'Thinking' upgrades, but a centralized, detailed weight-level changelog is not fully evident.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online