Parameters
72B
Context Length
131K
Modality
Text
Architecture
Dense
License
Qwen License
Release Date
19 Sept 2024
Knowledge Cutoff
Jan 2025
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
128
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
12,288
Number of Layers
80
FFN Intermediate Size (Dense)
29,568
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
152,064
Qwen2.5-72B is a core component of the Qwen2.5 series of large language models developed by Alibaba. This model is built upon a Transformer architecture and operates as a causal language model. Its design incorporates Rotary Position Embeddings (RoPE), SwiGLU as the activation function, and RMSNorm for normalization, complemented by an attention mechanism that includes QKV bias. These architectural choices provide a robust foundation for general-purpose language processing tasks.
The Qwen2.5-72B model features advancements compared to its predecessor, Qwen2. It exhibits enhanced capabilities in handling complex knowledge, excelling in areas such as coding and mathematics. The model also demonstrates improved instruction following, making it more adaptable to diverse user prompts and conditional scenarios. Its design focuses on practical applications requiring high fidelity in output generation.
This model is engineered for extensive text processing, supporting context lengths up to 131,072 tokens and generating outputs up to 8,192 tokens. It is proficient in generating long-form content, understanding structured data formats like tables, and producing structured outputs such as JSON. Additionally, Qwen2.5-72B provides multilingual support across more than 29 languages, making it suitable for a wide array of content generation, coding assistance, and advanced artificial intelligence applications like chatbots and virtual assistants.
Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.
Rank
#119
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.935 | 12 |
Summarization ProLLM Summarization | 0.742 | 19 |
Professional Knowledge MMLU Pro | 0.71 | 62 |
Overall Rank
#119
Coding Rank
-
Total Score
65
/ 100
Qwen2.5-72B exhibits strong transparency in its architectural specifications and tokenizer implementation, providing clear technical details for its dense Transformer structure. However, it remains opaque regarding the specific composition of its 18-trillion-token training set and the total compute resources consumed during training. While the model is highly accessible through open weights, its custom license and lack of detailed environmental reporting are notable weaknesses in its overall transparency profile.
Architectural Provenance
The model architecture is explicitly documented as a dense decoder-only Transformer. Key components such as Grouped Query Attention (GQA), SwiGLU activation, RMSNorm, and Rotary Position Embeddings (RoPE) are detailed in the official technical report and blog posts. While the pre-training methodology is described as a staged process (initially 4k context, then 32k), the specific transition points and full hyperparameter sets for each stage are only partially disclosed. The model is a successor to Qwen2, and the evolution of architectural choices is well-documented.
Dataset Composition
Alibaba discloses that the model was trained on 18 trillion tokens, a significant increase from its predecessor. General categories are provided (web, books, code, math, and multilingual data), and the use of synthetic data generated by previous Qwen models is acknowledged. However, there is no specific percentage breakdown of the dataset composition (e.g., web vs. code), and the exact sources of the 'high-quality' data remain proprietary. Filtering and cleaning methodologies are mentioned but lack the granular detail required for a high score.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository and is well-documented. It uses Byte Pair Encoding (BPE) with a large vocabulary of 151,646 tokens, which is specifically designed to support over 29 languages without 'unknown' tokens. The vocabulary size and tokenization approach are consistent across official documentation and third-party implementations like vLLM and Ollama.
Parameter Density
The model's parameter count is clearly stated as 72.7 billion total, with 70.0 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is explicitly confirmed in the technical report. Detailed architectural specifications, including the number of layers (80), hidden dimension (12,288), and attention heads (64 Q, 8 KV), are publicly available.
Training Compute
Information regarding training compute is extremely limited. While the technical report mentions the use of 'large-scale distributed infrastructure' and provides some optimizer settings (AdamW, learning rate schedules), it fails to disclose the total GPU/TPU hours, the specific hardware cluster size, or the training duration. No carbon footprint or environmental impact data is provided, which is a significant transparency gap for a model of this scale.
Benchmark Reproducibility
Alibaba provides comprehensive benchmark results across standard sets (MMLU, GSM8K, HumanEval, etc.) and specifies the versions used (e.g., MMLU-Pro, LiveCodeBench). However, while the evaluation results are detailed, the exact evaluation code and full prompt templates used to generate these specific scores are not fully centralized or provided in a 'one-click' reproducible format. Third-party verification is available through leaderboards like LMSYS Chatbot Arena, which adds credibility but does not replace the need for official reproduction scripts.
Identity Consistency
The model consistently identifies itself as Qwen, developed by Alibaba Cloud, in its system prompts and official documentation. It maintains a clear versioning identity (Qwen2.5) and is transparent about its nature as an AI. There are no documented cases of the model claiming to be a competitor's product in its official weights.
License Clarity
The model is released under the 'Qwen License Agreement,' which is a custom license rather than a standard open-source license like Apache 2.0. While the terms are publicly accessible and clearly state that commercial use is permitted for entities with fewer than 100 million monthly active users, the requirement for a separate agreement beyond that threshold introduces a proprietary restriction. This 'open-weights but not open-source' distinction is clearly communicated but limits the score compared to truly permissive licenses.
Hardware Footprint
Official documentation and the model card provide clear guidance on context length (128k tokens) and output limits. While Alibaba's own documentation on specific VRAM requirements for various quantization levels is somewhat sparse, the community and third-party providers (e.g., Hugging Face, Ollama) have extensively documented the VRAM needs for FP16 (~144GB), 8-bit (~77GB), and 4-bit (~47GB) versions. The model's memory scaling with context length is also well-understood by the community.
Versioning Drift
The model follows a clear naming convention (Qwen2 -> Qwen2.5), but detailed changelogs for minor weight updates or 'silent' refinements are not consistently maintained in a public-facing ledger. While the transition from Qwen2 to 2.5 was a major, well-documented event, there is limited transparency regarding the ongoing maintenance or potential behavior drift of the weights within the 2.5-72B-Instruct repository since its initial release.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online