Parameters
500M
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
19 Sept 2024
Knowledge Cutoff
-
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
768
Number of Layers
24
Attention Heads
16
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
ROPE
Qwen2.5-0.5B is a foundational large language model developed by the Qwen team at Alibaba Cloud. It is part of the Qwen2.5 series, which represents an advancement in language model capabilities, featuring improvements in knowledge acquisition, coding proficiency, and mathematical reasoning. This variant, with approximately 0.49 billion parameters, serves as a robust base model, primarily designed for pretraining and subsequent fine-tuning for specialized applications. Its architecture is engineered to handle complex language tasks efficiently across multiple languages.
Architecturally, Qwen2.5-0.5B is a dense, decoder-only Transformer model. It incorporates Rotary Position Embedding (RoPE) for effective positional encoding, SwiGLU as its activation function, and RMSNorm for normalization. The attention mechanism utilizes Grouped Query Attention (GQA), specifically configured with 14 query heads and 2 key-value heads for this model size. The model is structured with 24 layers, contributing to its depth and capacity for learning intricate patterns in language data.
As a causal language model, Qwen2.5-0.5B is suitable for a range of downstream applications following post-training processes such as supervised fine-tuning or reinforcement learning from human feedback. Its capabilities include instruction following, generating extended text sequences, and processing structured data formats like JSON. The model supports a full context length of 32,768 tokens, with the broader Qwen2.5 series capable of handling contexts up to 128,000 tokens and generating outputs up to 8,000 tokens. It offers multilingual support, encompassing over 29 languages.
Qwen2.5 by Alibaba is a family of dense, decoder-only language models available in various sizes, with some variants utilizing Mixture-of-Experts. These models are pretrained on large-scale datasets, supporting extended context lengths and multilingual communication. The family includes specialized models for coding, mathematics, and multimodal tasks, such as vision and audio processing.
No evaluation benchmarks for Qwen2.5-0.5B available.
Overall Rank
-
Coding Rank
-
Total Score
67
/ 100
Qwen2.5-0.5B demonstrates strong transparency in its architectural specifications, licensing, and tokenizer implementation, providing clear technical details for developers. However, it significantly lacks disclosure regarding training compute resources and granular dataset composition. While benchmark results are provided, concerns regarding their reproducibility and the lack of environmental impact data represent notable gaps in its transparency profile.
Architectural Provenance
The model's architecture is extensively documented in the Qwen2.5 technical report and official GitHub repository. It is a dense, decoder-only Transformer utilizing Rotary Position Embedding (RoPE), SwiGLU activation, and RMSNorm. Specifically for the 0.5B variant, the Grouped Query Attention (GQA) configuration is detailed with 14 query heads and 2 key-value heads across 24 layers. The transition from the Qwen2 base is clearly explained, and the model weights are publicly accessible on Hugging Face with clear configuration files.
Dataset Composition
While the total token count is disclosed (expanded from 7 trillion in Qwen2 to 18 trillion in Qwen2.5), the specific composition breakdown (e.g., percentages of web, code, and math data) is not provided for the general 0.5B base model. Documentation mentions 'massive high-quality domain-balanced training sets' and 'expertly curated' data but lacks a granular public breakdown of sources or specific filtering thresholds, relying on high-level descriptions of data types.
Tokenizer Integrity
The tokenizer is publicly available and fully documented. It uses Byte-level Byte Pair Encoding (BBPE) with a vocabulary size of 151,643 regular tokens and 3 control tokens, ensuring no 'unknown' words. The vocabulary is shared across all Qwen2.5 model sizes, and the compression rates and multilingual efficiency are verified in the technical report. Tokenizer configuration files (tokenizer.json, vocab.json) are accessible in the official repositories.
Parameter Density
The parameter count is precisely stated as 0.49 billion total, with a further breakdown of 0.36 billion non-embedding parameters. As a dense model, all parameters are active during inference, which is explicitly confirmed in the technical documentation. The architectural specifications (layers, hidden dimensions, and attention heads) are clearly mapped to the parameter count.
Training Compute
Information regarding training compute is extremely limited. While the hardware type (NVIDIA A100/H100) is implied by the scale of the project and mentioned in inference benchmarks, the specific GPU hours, total compute budget, and carbon footprint for the 0.5B variant are not disclosed. The technical report focuses on performance metrics rather than resource expenditure.
Benchmark Reproducibility
The model provides scores for standard benchmarks (MMLU, MATH, HumanEval) in its technical report. However, while evaluation code is available on GitHub, the exact few-shot prompts and specific versions for all benchmarks are not consistently detailed for the 0.5B variant. Independent researchers have noted significant performance drops on 'clean' versions of benchmarks released after the model's training cutoff, suggesting potential issues with the reported scores' generalizability.
Identity Consistency
The model consistently identifies itself as part of the Qwen series developed by Alibaba Cloud. It maintains clear versioning (Qwen2.5-0.5B) and distinguishes between its base and instruction-tuned variants. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its foundational architecture.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license allowing for commercial use, modification, and distribution. The license is clearly stated on the official GitHub, Hugging Face repository, and in the technical report, with no conflicting proprietary terms found for this specific variant.
Hardware Footprint
VRAM requirements are well-documented for various precisions (BF16, INT8, INT4). Official documentation and third-party tools provide specific memory footprints (e.g., ~0.97GB for BF16 at 1k context) and scaling data for context lengths up to 32,768 tokens. Quantization tradeoffs are also addressed with speed and memory benchmarks provided for GPTQ and AWQ formats.
Versioning Drift
The model uses a clear versioning scheme (2.5), and a changelog is maintained on the official GitHub. However, there is limited documentation regarding long-term drift or specific weight updates within the 2.5 release cycle. While the transition from 2.0 to 2.5 is documented, the granularity of updates for the 0.5B variant specifically is moderate.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens