Active Parameters
389B
Context Length
28K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Tencent Hunyuan Community License
Release Date
5 Nov 2024
Knowledge Cutoff
Sep 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
64
Key-Value Heads
64
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
60
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Mixture of Experts
Total Expert Parameters
52.0B
Number of Experts
32
Active Experts
2
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
Hunyuan-DiT is a large-scale Mixture-of-Experts (MoE) diffusion transformer designed for high-fidelity image generation. It represents Tencent's advancement in generative AI, applying a transformer architecture directly to the latent space of image generation. Its primary function is to synthesize diverse and high-quality images from textual prompts, thereby enabling content creation and visual design applications. This model is notable for its modular architecture, allowing efficient scaling and inference.
The Hunyuan-DiT model employs a diffusion transformer architecture, specifically leveraging a Mixture-of-Experts (MoE) design. This architecture partitions the model's parameters into multiple "experts," where only a subset of these experts is activated for each input token during inference. This approach allows the model to achieve a large total parameter count of approximately 389 billion while maintaining a manageable number of active parameters, approximately 52 billion, enhancing computational efficiency. The model incorporates 60 transformer layers with 64 attention heads, utilizing GeLU activation and Layer Normalization. Its design supports flexible image resolutions and uses absolute positional embeddings, integrating Rotary Positional Encoding for enhanced performance. It further utilizes a combination of bilingual CLIP and multilingual T5 encoders for robust text understanding in prompts.
Hunyuan-DiT is engineered for generating high-resolution and visually consistent images, supporting resolutions up to 4096x4096. Its MoE architecture contributes to efficient scaling, making it suitable for deployment in scenarios demanding both high quality and computational prudence. Primary use cases involve creative content generation, visual asset production, and applications requiring advanced text-to-image synthesis capabilities, such as advertising, digital art, and virtual environment design. It also supports multi-turn multimodal dialogue, enabling iterative image refinement based on user interactions.
Tencent Hunyuan large language models with various capabilities.
Rank
#100
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1326 | 70 |
General Text Text Arena | 1326 | 78 |
Overall Rank
#100
Coding Rank
#78
Total Score
62
/ 100
Hunyuan-DiT exhibits strong transparency regarding its Mixture-of-Experts architecture and hardware requirements, providing clear distinctions between total and active parameters. However, it is significantly hampered by a highly restrictive and geographically limited license and a near-total lack of disclosure regarding training compute and specific data sources. While technically well-documented, its legal and environmental transparency profiles remain weak.
Architectural Provenance
The model's architecture is extensively documented in the official technical report and GitHub repository. It is a Mixture-of-Experts (MoE) Diffusion Transformer (DiT) utilizing 60 transformer layers and 64 attention heads. The report details the integration of a bilingual CLIP and a multilingual T5 encoder for text understanding. While the base components are well-described, the specific pre-training methodology for the DiT backbone itself is less detailed than the post-training and dialogue fine-tuning procedures.
Dataset Composition
Tencent provides a general overview of the data pipeline, including the use of a Multimodal Large Language Model (MLLM) for caption refinement and a category-balanced dataset (subject, style, scene). However, specific dataset sources, exact proportions of data types, and the total volume of the image-text pairs used for the DiT training are not disclosed. The documentation focuses more on the 'how' of the pipeline rather than the 'what' of the data itself.
Tokenizer Integrity
The model utilizes a combination of standard tokenizers: a bilingual CLIP tokenizer and an mT5 tokenizer (specifically t5-v1_1-xxl). These are publicly available and their vocabulary sizes (e.g., 128K for the related Hunyuan-Large LLM, though DiT uses the standard CLIP/T5 ones) are verifiable through the provided model files on Hugging Face. The approach to handling bilingual prompts is clearly documented.
Parameter Density
The model is transparent about its MoE structure, explicitly stating a total parameter count of 389 billion with approximately 52 billion active parameters. The architectural breakdown (60 layers, 64 heads) and the expert configuration (shared vs. specialized experts) are clearly provided in the technical documentation, allowing for a precise understanding of the model's density and inference efficiency.
Training Compute
There is almost no specific information regarding the total compute resources used for training. While the technical report mentions testing on NVIDIA V100 and A100 GPUs, it fails to disclose the total GPU hours, the specific hardware cluster size used for the full training run, the training duration, or the environmental impact/carbon footprint.
Benchmark Reproducibility
Tencent provides a custom evaluation protocol involving 50+ human evaluators and some automated metrics. While they name the benchmarks and provide some comparison results against DALL-E 3 and Midjourney, the exact prompts used for all evaluations and the full evaluation codebase are not as comprehensive as those found in top-tier open-source projects. Third-party verification is limited primarily to community-run leaderboards.
Identity Consistency
The model and its documentation maintain a consistent identity as 'Hunyuan-DiT' or 'Hunyuan-Large' (depending on the variant). It does not attempt to masquerade as a competitor's model and is transparent about its nature as a Tencent-developed AI. Versioning (v1.1, v1.2) is clearly communicated in the repository and model cards.
License Clarity
The model uses the 'Tencent Hunyuan Community License'. While the terms for commercial use (requiring a separate license if exceeding 100M MAU) are stated, the license contains highly restrictive and unusual geographic clauses. Specifically, it explicitly excludes the European Union, UK, and South Korea from the 'Territory' where the model can be used or its outputs displayed, creating significant legal ambiguity for global users.
Hardware Footprint
Hardware requirements are well-documented for various use cases. The repository provides specific VRAM estimates for standard inference (32GB recommended, 11GB minimum) and offers a 'low-VRAM' version capable of running on 6GB. Quantization options (8-bit T5) and acceleration libraries (TensorRT) are also documented with their respective performance impacts.
Versioning Drift
The project maintains a clear version history (v1.0, v1.1, v1.2) with a changelog on GitHub. Updates include specific bug fixes (e.g., mitigating oversaturation) and feature additions (ControlNet, IP-Adapter). However, the documentation of performance drift between versions is qualitative rather than quantitative, and older versions are not always easily accessible for comparative testing.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online