Parameters
72B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Tongyi Qianwen LICENSE AGREEMENT
Release Date
7 Jun 2024
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
128
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
8,192
Number of Layers
80
FFN Intermediate Size (Dense)
29,568
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
152,064
Qwen2-72B is a significant iteration within the Qwen2 large language model series, developed by Alibaba. This model is engineered to handle a diverse array of natural language processing tasks, encompassing both comprehension and generation, alongside proficiency in coding and mathematical problem-solving. It functions as a foundational model, intended for further specialized fine-tuning to address particular application domains.
The architectural foundation of Qwen2-72B is the Transformer, augmented with several advancements to enhance computational efficiency and model performance. Key innovations include the adoption of the SwiGLU activation function and the implementation of Group Query Attention (GQA), which optimizes the attention mechanism for reduced memory footprint and accelerated inference. Furthermore, the model incorporates an enhanced tokenizer, designed to process a wide spectrum of natural languages and programming code effectively. Notably, Qwen2-72B maintains a dense model architecture, distinguishing it from Mixture-of-Experts (MoE) configurations found in other variants within the broader Qwen2 family.
From a functional perspective, Qwen2-72B demonstrates capabilities across multiple critical areas. It is designed to excel in tasks requiring sophisticated natural language understanding, robust language generation, and adeptness in coding and mathematical reasoning. While positioned as a base model, it provides a strong pre-trained foundation suitable for post-training methodologies such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). This design philosophy supports its application in scenarios demanding extensive multilingual understanding, complex code manipulation, or advanced mathematical computation.
The Alibaba Qwen2 model family comprises large language models built upon the Transformer architecture. It includes both dense and Mixture-of-Experts (MoE) variants, designed for diverse language tasks. Technical features include Grouped Query Attention and support for extended context lengths up to 131,072 tokens, optimizing memory footprint for inference.
Rank
#98
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.823 | 19 |
Web Development WebDev Arena | 1261 | 69 |
Overall Rank
#98
Coding Rank
#86
Total Score
63
/ 100
Qwen2-72B exhibits strong transparency in its architectural specifications and tokenizer implementation, providing clear technical details for its dense Transformer structure. However, it remains opaque regarding training compute resources and the specific percentage-based composition of its 7-trillion-token dataset. While the model is highly accessible through open weights, its custom license and lack of detailed environmental impact data represent significant transparency gaps.
Architectural Provenance
The Qwen2-72B model is well-documented as a dense, decoder-only Transformer. The technical report and official blog posts explicitly detail the use of Grouped Query Attention (GQA), SwiGLU activation, and Rotary Positional Embeddings (RoPE) with a modified base frequency for long-context support. While the architecture is clearly defined as a successor to Qwen1.5, the specific pre-training methodology (e.g., exact curriculum or stage-by-stage hyperparameter shifts) is described in general terms rather than exhaustive detail.
Dataset Composition
Alibaba discloses that the model was trained on 7 trillion tokens of multilingual data across 27+ languages. However, the breakdown of this data (e.g., specific percentages of web, code, books, or academic papers) is not publicly quantified. The documentation mentions 'high-quality' and 'meticulously curated' sources including web crawls and code (referencing CodeQwen1.5), but lacks a verifiable, detailed composition report or access to sample data for independent verification.
Tokenizer Integrity
The tokenizer is publicly available via the official GitHub repository and Hugging Face. It uses byte-level Byte Pair Encoding (BPE) with a large vocabulary size of 151,646 tokens, which is documented to improve compression across multiple languages. The vocabulary size and tokenization approach are consistently reported across all official sources and are verifiable through the provided code and model files.
Parameter Density
The model is explicitly stated to be a dense architecture with 72.7 billion total parameters (70.0 billion non-embedding). The architectural configuration is fully disclosed, including 80 layers, 64 query heads, and 8 KV heads for GQA. This level of detail is high, though it lacks a deeper breakdown of parameter distribution across specific components like attention vs. FFN beyond what can be inferred from the layer specs.
Training Compute
There is virtually no public disclosure regarding the specific compute resources used for Qwen2-72B. The technical report mentions that models were trained on 'powerful GPUs' but fails to provide GPU/TPU hours, hardware counts, training duration, or the associated carbon footprint. This information is conspicuously absent, likely for competitive reasons, resulting in a low score.
Benchmark Reproducibility
Alibaba provides extensive benchmark results across 16+ datasets (MMLU, GSM8K, HumanEval, etc.) and specifies the few-shot settings used (e.g., 5-shot for MMLU). However, the exact evaluation code and the specific prompts/few-shot examples used to achieve these scores are not fully public in a single reproducible repository. Third-party evaluations (like Open LLM Leaderboard) exist but sometimes show variance from official claims.
Identity Consistency
The model consistently identifies itself as Qwen and is transparent about its versioning (Qwen2 vs Qwen1.5). It does not exhibit the identity confusion seen in some other models (e.g., claiming to be GPT-4). It is clear about its nature as an AI developed by Alibaba and provides consistent version tracking across its weights and documentation.
License Clarity
The model uses the 'Tongyi Qianwen LICENSE AGREEMENT'. While the terms are publicly accessible and allow for commercial use, there is a significant restriction: entities with more than 100 million monthly active users must request a separate license. This 'open-weights' but not 'open-source' (per OSI standards) nature creates some ambiguity for large-scale commercial adoption compared to Apache 2.0 models.
Hardware Footprint
VRAM requirements are well-documented by both the provider and the community. Official documentation notes the reduced KV cache size due to GQA, and community resources (like Hugging Face and vLLM) provide clear guidance on the ~144GB VRAM needed for BF16 and the benefits of 4-bit/8-bit quantization. However, official documentation on the specific accuracy-tradeoffs for various quantization levels is less comprehensive than community-driven data.
Versioning Drift
Alibaba uses a clear naming convention (Qwen1.5 -> Qwen2 -> Qwen2.5) and maintains a GitHub repository with some changelog information. However, there are reports of performance drift and 'alignment tax' in instruction-tuned versions that are not formally documented in a centralized changelog. Silent updates to weights are less common here than in API-only models, but the transition between minor versions lacks granular diffs.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online