ApX logoApX logo

Qwen2.5-72B

Parameters

72B

Context Length

131K

Modality

Text

Architecture

Dense

License

Qwen License

Release Date

19 Sept 2024

Knowledge Cutoff

Jan 2025

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

128

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

12,288

Number of Layers

80

FFN Intermediate Size (Dense)

29,568

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

152,064

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 12.3k · Context: 131K · Vocab: 152.1kx 80 layersRMSNormPre-AttentionGrouped-Query Attention128Q / 8KV headsHead dim: 96+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 29.6k+Final RMSNormOutput Logits

Qwen2.5-72B

Qwen2.5-72B is a core component of the Qwen2.5 series of large language models developed by Alibaba. This model is built upon a Transformer architecture and operates as a causal language model. Its design incorporates Rotary Position Embeddings (RoPE), SwiGLU as the activation function, and RMSNorm for normalization, complemented by an attention mechanism that includes QKV bias. These architectural choices provide a robust foundation for general-purpose language processing tasks.

The Qwen2.5-72B model features advancements compared to its predecessor, Qwen2. It exhibits enhanced capabilities in handling complex knowledge, excelling in areas such as coding and mathematics. The model also demonstrates improved instruction following, making it more adaptable to diverse user prompts and conditional scenarios. Its design focuses on practical applications requiring high fidelity in output generation.

This model is engineered for extensive text processing, supporting context lengths up to 131,072 tokens and generating outputs up to 8,192 tokens. It is proficient in generating long-form content, understanding structured data formats like tables, and producing structured outputs such as JSON. Additionally, Qwen2.5-72B provides multilingual support across more than 29 languages, making it suitable for a wide array of content generation, coding assistance, and advanced artificial intelligence applications like chatbots and virtual assistants.

About Qwen2.5

Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.


Other Qwen2.5 Models

Evaluation Benchmarks

Rank

#119

BenchmarkScoreRank

0.935

12

0.742

19

Professional Knowledge

MMLU Pro

0.71

62

Rankings

Overall Rank

#119

Coding Rank

-

Model Integrity

Total Score

B

65 / 100

Qwen2.5-72B Model Integrity Report

Total Score

65

/ 100

B

Audit Note

Qwen2.5-72B exhibits strong transparency in its architectural specifications and tokenizer implementation, providing clear technical details for its dense Transformer structure. However, it remains opaque regarding the specific composition of its 18-trillion-token training set and the total compute resources consumed during training. While the model is highly accessible through open weights, its custom license and lack of detailed environmental reporting are notable weaknesses in its overall transparency profile.

Upstream

20.5 / 30

Architectural Provenance

7.5 / 10

The model architecture is explicitly documented as a dense decoder-only Transformer. Key components such as Grouped Query Attention (GQA), SwiGLU activation, RMSNorm, and Rotary Position Embeddings (RoPE) are detailed in the official technical report and blog posts. While the pre-training methodology is described as a staged process (initially 4k context, then 32k), the specific transition points and full hyperparameter sets for each stage are only partially disclosed. The model is a successor to Qwen2, and the evolution of architectural choices is well-documented.

Dataset Composition

4.5 / 10

Alibaba discloses that the model was trained on 18 trillion tokens, a significant increase from its predecessor. General categories are provided (web, books, code, math, and multilingual data), and the use of synthetic data generated by previous Qwen models is acknowledged. However, there is no specific percentage breakdown of the dataset composition (e.g., web vs. code), and the exact sources of the 'high-quality' data remain proprietary. Filtering and cleaning methodologies are mentioned but lack the granular detail required for a high score.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face repository and is well-documented. It uses Byte Pair Encoding (BPE) with a large vocabulary of 151,646 tokens, which is specifically designed to support over 29 languages without 'unknown' tokens. The vocabulary size and tokenization approach are consistent across official documentation and third-party implementations like vLLM and Ollama.

Model

26.0 / 40

Parameter Density

8.0 / 10

The model's parameter count is clearly stated as 72.7 billion total, with 70.0 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is explicitly confirmed in the technical report. Detailed architectural specifications, including the number of layers (80), hidden dimension (12,288), and attention heads (64 Q, 8 KV), are publicly available.

Training Compute

3.0 / 10

Information regarding training compute is extremely limited. While the technical report mentions the use of 'large-scale distributed infrastructure' and provides some optimizer settings (AdamW, learning rate schedules), it fails to disclose the total GPU/TPU hours, the specific hardware cluster size, or the training duration. No carbon footprint or environmental impact data is provided, which is a significant transparency gap for a model of this scale.

Benchmark Reproducibility

6.0 / 10

Alibaba provides comprehensive benchmark results across standard sets (MMLU, GSM8K, HumanEval, etc.) and specifies the versions used (e.g., MMLU-Pro, LiveCodeBench). However, while the evaluation results are detailed, the exact evaluation code and full prompt templates used to generate these specific scores are not fully centralized or provided in a 'one-click' reproducible format. Third-party verification is available through leaderboards like LMSYS Chatbot Arena, which adds credibility but does not replace the need for official reproduction scripts.

Identity Consistency

9.0 / 10

The model consistently identifies itself as Qwen, developed by Alibaba Cloud, in its system prompts and official documentation. It maintains a clear versioning identity (Qwen2.5) and is transparent about its nature as an AI. There are no documented cases of the model claiming to be a competitor's product in its official weights.

Downstream

18.5 / 30

License Clarity

6.5 / 10

The model is released under the 'Qwen License Agreement,' which is a custom license rather than a standard open-source license like Apache 2.0. While the terms are publicly accessible and clearly state that commercial use is permitted for entities with fewer than 100 million monthly active users, the requirement for a separate agreement beyond that threshold introduces a proprietary restriction. This 'open-weights but not open-source' distinction is clearly communicated but limits the score compared to truly permissive licenses.

Hardware Footprint

7.0 / 10

Official documentation and the model card provide clear guidance on context length (128k tokens) and output limits. While Alibaba's own documentation on specific VRAM requirements for various quantization levels is somewhat sparse, the community and third-party providers (e.g., Hugging Face, Ollama) have extensively documented the VRAM needs for FP16 (~144GB), 8-bit (~77GB), and 4-bit (~47GB) versions. The model's memory scaling with context length is also well-understood by the community.

Versioning Drift

5.0 / 10

The model follows a clear naming convention (Qwen2 -> Qwen2.5), but detailed changelogs for minor weight updates or 'silent' refinements are not consistently maintained in a public-facing ledger. While the transition from Qwen2 to 2.5 was a major, well-documented event, there is limited transparency regarding the ongoing maintenance or potential behavior drift of the weights within the 2.5-72B-Instruct repository since its initial release.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs