Parameters
500M
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
7 Jun 2024
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
16
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
131,072
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
896
Number of Layers
24
FFN Intermediate Size (Dense)
4,864
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
The Qwen2-0.5B model represents a compact yet capable entry in the Qwen2 series of large language models, developed by the Qwen team at Alibaba. This model is engineered to deliver foundational language processing functionalities, making it suitable for deployment in environments with constrained computational resources. As a base language model, its primary purpose is to serve as a robust starting point for further specialization through post-training methodologies, such as supervised fine-tuning or reinforcement learning from human feedback. It is designed to facilitate a range of natural language processing tasks efficiently.
The Alibaba Qwen2 model family comprises large language models built upon the Transformer architecture. It includes both dense and Mixture-of-Experts (MoE) variants, designed for diverse language tasks. Technical features include Grouped Query Attention and support for extended context lengths up to 131,072 tokens, optimizing memory footprint for inference.
No evaluation benchmarks for Qwen2-0.5B available.
Overall Rank
-
Coding Rank
-
Total Score
64
/ 100
Qwen2-0.5B demonstrates strong transparency regarding its architecture and licensing, providing clear technical specifications and a permissive Apache 2.0 license. However, it suffers from significant opacity in its training data composition and compute resources, relying on vague descriptions of 'high-quality' data without specific source disclosure. While highly accessible for deployment, the lack of verifiable environmental impact data and granular versioning for weight updates limits its overall transparency profile.
Architectural Provenance
The model is explicitly identified as a dense, decoder-only Transformer. The technical report and official documentation detail the use of SwiGLU activation, Rotary Position Embeddings (RoPE), and RMSNorm. It specifically notes the use of Grouped Query Attention (GQA) with 14 query heads and 2 key-value heads for this variant. While the pre-training methodology is described as next-token prediction followed by post-training (SFT and DPO), the specific architectural modifications for the 0.5B scale compared to larger variants are well-documented in the technical report's hyper-parameter tables.
Dataset Composition
Alibaba discloses that Qwen2-0.5B was pre-trained on a 12 trillion token dataset, which is larger but of lower 'quality threshold' than the 7 trillion token set used for larger models. However, the actual composition breakdown (e.g., percentage of web, code, books) is not provided. The sources are described vaguely as 'large-scale high-quality multilingual' data without naming specific datasets or providing a verifiable distribution. Filtering and cleaning methodologies are mentioned as 'meticulous' but lack public technical specifics for reproduction.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face 'transformers' library and GitHub. It uses byte-level Byte-Pair Encoding (BBPE) with a large vocabulary size of 151,646 tokens, which is consistent across the Qwen2 family. Documentation confirms it is designed for multilingual support (29+ languages) and code, with specific control tokens for chat and tool use. The vocabulary size and pre-tokenization rules are explicitly stated in the technical report and model configuration files.
Parameter Density
The model's parameter count is precisely disclosed as 0.49 billion total parameters, with 0.36 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is clearly stated to distinguish it from the MoE variants in the same family. The architectural breakdown, including the number of layers (24) and hidden dimension size (896), is fully transparent in the technical report.
Training Compute
There is no public disclosure of the specific GPU/TPU hours, hardware cluster specifications, or total energy consumption used to train the 0.5B variant. While the technical report mentions general training stability techniques and batch sizes, it lacks the verifiable compute metrics required for a high score. No carbon footprint calculations or estimated training costs are provided by the developer.
Benchmark Reproducibility
Alibaba provides results for standard benchmarks (MMLU, HumanEval, GSM8K) with specified shot counts (e.g., 5-shot for MMLU). However, the exact evaluation prompts and full reproduction code for the base model's specific results are not as detailed as the instruction-tuned variants. While some evaluation code is on GitHub, the lack of a comprehensive, one-click reproduction suite for the 0.5B base model results limits its score.
Identity Consistency
The model consistently identifies as part of the Qwen2 family and is transparent about its status as a base model not intended for direct chat without fine-tuning. It does not exhibit the identity confusion seen in some other open-weights models that claim to be GPT-4. Versioning is clear, distinguishing it from the later Qwen2.5-0.5B release.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. This is explicitly stated on the Hugging Face repository, the official blog, and the GitHub repository. The terms for commercial use and derivative works are clear and follow standard Apache 2.0 protocols without the restrictive 'Qwen License' applied to the 72B and 3B variants.
Hardware Footprint
VRAM requirements are well-documented by both the official team and third-party communities. The model requires approximately 1GB for weights (FP16) and roughly 2GB total for inference. Documentation on the impact of context length on memory (supporting up to 32K/128K tokens) is available, though official quantization-accuracy tradeoff curves for the 0.5B variant specifically are less detailed than for the 7B+ models.
Versioning Drift
While the model uses clear naming (Qwen2-0.5B), there is no detailed public changelog for weight updates or a formal system for tracking silent drift. The transition to Qwen2.5 is documented, but intermediate updates to the Qwen2 weights lack granular versioning. Users have reported non-deterministic behavior and performance changes in related variants without clear documentation from the provider.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online