Active Parameters
300B
Context Length
131.072K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
30 Jun 2025
Knowledge Cutoff
Jun 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
64
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
12,288
Number of Layers
54
FFN Intermediate Size (Dense)
3,584
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
103,424
Mixture of Experts
Total Expert Parameters
47.0B
Number of Experts
64
Active Experts
8
Shared Experts
0
FFN Intermediate Size (per Expert)
3,584
Dense Layers Before MoE
3
The ERNIE-4.5-300B-A47B-Base model, developed by Baidu, is a large-scale language model utilizing a Mixture-of-Experts (MoE) architecture. As a prominent member of the ERNIE 4.5 family, it contains 300 billion total parameters while activating 47 billion parameters per token through a sparse gated mechanism. This design allows the model to scale its knowledge capacity significantly without a linear increase in per-token inference costs. The model is specifically optimized for advanced text-based reasoning, code generation, and complex instruction following across both English and Chinese languages.
Technically, the model introduces a multimodal heterogeneous MoE structure, which was pre-trained on trillions of tokens using a joint textual and visual modality framework. A key architectural innovation is the modality-isolated routing technique, which ensures that expert specialization for one modality does not negatively impact the performance of another. This variant, the A47B-Base, represents the extracted text-related parameters following this extensive multimodal pre-training phase. It employs Grouped Query Attention (GQA) with 64 query heads and 8 key-value heads to achieve a balance between attention quality and memory efficiency during long-context processing.
The architecture is built upon the PaddlePaddle deep learning framework and supports an expansive context window of 131,072 tokens. To manage the computational demands of a 300B parameter system, Baidu implemented scaling-efficient infrastructure features such as intra-node expert parallelism, memory-efficient pipeline scheduling, and FP8 mixed-precision training. The model is designed for high-throughput deployment environments and supports advanced inference optimizations, including Prefill-Decode (PD) disaggregation with dynamic role switching to maximize hardware utilization.
The Baidu ERNIE 4.5 family consists of ten large-scale multimodal models. They utilize a heterogeneous Mixture-of-Experts (MoE) architecture, which enables parameter sharing across modalities while also employing dedicated parameters for specific modalities, supporting efficient language and multimodal processing.
No evaluation benchmarks for ERNIE-4.5-300B-A47B-Base available.
Overall Rank
-
Coding Rank
-
Total Score
67
/ 100
ERNIE 4.5 demonstrates a significant shift toward transparency through its permissive licensing and detailed architectural disclosures regarding its MoE routing and parameter density. However, the profile is undermined by a lack of granular data provenance and a total absence of compute resource disclosure. While the technical framework is open, the 'upstream' training pipeline remains largely a black box.
Architectural Provenance
The model's Mixture-of-Experts (MoE) architecture is well-documented in the ERNIE 4.5 Technical Report, detailing a 'multimodal heterogeneous' structure. It specifies the use of modality-isolated routing, router orthogonal loss, and multimodal token-balanced loss to manage expert specialization. The base model is explicitly identified as an extraction of text-related parameters from a larger multimodal pre-training phase, and the use of Grouped Query Attention (GQA) with specific head counts (64 query, 8 KV) is verified in technical documentation.
Dataset Composition
While Baidu claims the model was trained on 'trillions of tokens' across textual and visual modalities, there is no public breakdown of the dataset composition (e.g., specific percentages of web, code, or academic data). The documentation mentions a 'human-model-in-the-loop' cleaning process but lacks transparency regarding the specific sources or the ratio of synthetic to organic data, falling into the category of vague marketing assertions.
Tokenizer Integrity
The tokenizer is publicly accessible via the PaddlePaddle and Hugging Face repositories. It uses a SentencePiece-based approach with a confirmed vocabulary size of 103,424 tokens. Documentation explicitly covers special tokens (e.g., mask, bos, eos) and supports both Chinese and English, aligning with the model's stated multilingual capabilities.
Parameter Density
Baidu provides exemplary transparency regarding parameter density, clearly distinguishing between the 300 billion total parameters and the 47 billion active parameters per token. The architectural breakdown includes specific details on the FFN expert categorization (text, vision, and shared experts) and the scaling-efficient infrastructure used to manage this density.
Training Compute
Transparency regarding compute is poor. While the technical report mentions a Model FLOPs Utilization (MFU) of 47% and the use of FP8 mixed-precision training on the PaddlePaddle framework, it fails to disclose the total GPU/TPU hours, the specific hardware cluster size, the training duration, or the environmental carbon footprint.
Benchmark Reproducibility
Evaluation code is provided within the ERNIEKit repository, and scores are reported for standard benchmarks like IFEval and MMLU-Pro. However, the score is limited because the primary results rely on internal evaluations with significant skepticism from the research community regarding the lack of independent third-party verification and the absence of direct comparisons with contemporary industry leaders like Gemini 2.0 or GPT-4.5.
Identity Consistency
The model consistently identifies itself as part of the ERNIE 4.5 family across API responses and documentation. There are no documented instances of the model claiming a competitor's identity or misrepresenting its versioning, and it maintains clear distinction between its 'Base' and 'Thinking' (VL) variants.
License Clarity
The model and its associated development toolkits (ERNIEKit, FastDeploy) are released under the Apache License 2.0. This is a highly transparent, permissive open-source license that explicitly allows for commercial use and derivative works without conflicting proprietary terms.
Hardware Footprint
Baidu provides specific hardware requirements for various deployment scenarios, including VRAM needs for 4-bit (4x 80GB GPUs) and 2-bit (1x 141GB GPU) quantization. The documentation also details the memory scaling for its 131,072-token context window and provides guidance on using the FastDeploy toolkit for optimization.
Versioning Drift
The model uses semantic versioning (4.5) and maintains a changelog on GitHub. However, because the model was released in mid-2025, there is insufficient public data to track long-term performance drift or the impact of silent 'alignment taxes' over time. The documentation for previous version deprecation is currently minimal.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online