Parameters
1.7B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
29 Apr 2025
Knowledge Cutoff
Dec 2024
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
32
FFN Intermediate Size (Dense)
6,144
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
Qwen3-1.7B is a dense causal language model engineered by the Alibaba Qwen team as a high-efficiency solution for general-purpose language processing and reasoning. Introduced as part of the Qwen3 series on April 29, 2025, the model is designed to operate effectively across diverse hardware environments, including mobile devices and edge computing platforms. It supports a native context length of 32,768 tokens, which can be further extended using YaRN-based rotary embedding scaling techniques, enabling the processing of extensive documents and prolonged multi-turn interactions.
Technically, the model is built on a transformer architecture comprising 28 layers with a hidden dimension of 2048. It utilizes Grouped Query Attention (GQA) with 16 query heads and 8 key-value heads to reduce memory overhead during inference while maintaining high performance. The architecture incorporates advanced stabilization and optimization techniques, including RMSNorm with pre-normalization, SwiGLU activation functions, and the introduction of QK-Norm to enhance attention layer stability in long-context scenarios. Positional information is managed through Rotary Positional Embeddings (RoPE), specifically utilizing an Adjusted Base Frequency (ABF) approach to maintain accuracy over the model's large context window.
A primary innovation of the Qwen3-1.7B model is its native dual-mode operational capability, which allows it to function in both Thinking and Non-Thinking modes within a single weight set. Thinking mode activates a step-by-step reasoning process, making the model suitable for complex logical deduction, mathematical problem-solving, and code generation. Non-Thinking mode provides direct, high-speed responses for standard conversational applications. This hybrid system supports dynamic switching via user directives or API parameters, allowing developers to allocate a computational thinking budget that balances output quality with inference latency.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
No evaluation benchmarks for Qwen3-1.7B available.
Overall Rank
-
Coding Rank
-
Total Score
72
/ 100
Qwen3-1.7B exhibits strong transparency in its architectural documentation and licensing, providing precise parameter counts and a clear open-source framework. However, it remains opaque regarding the specific composition of its 36-trillion-token training set and the total compute resources expended during its creation. The model's innovative dual-mode reasoning is well-documented, though the exact data used for its reasoning-stage fine-tuning is not disclosed.
Architectural Provenance
The model is part of the Qwen3 series, with its architecture extensively documented in the Qwen3 Technical Report (arXiv:2505.09388). It is a dense causal transformer with 28 layers, a hidden dimension of 2048, and 16/8 GQA heads. Specific technical refinements such as SwiGLU, RMSNorm with pre-normalization, and QK-Norm for long-context stability are explicitly detailed. The training methodology is described as a three-stage process (General, Reasoning, and Long Context), though specific hyperparameters for each stage are not fully disclosed.
Dataset Composition
Alibaba discloses that the model was trained on approximately 36 trillion tokens covering 119 languages. The documentation mentions broad categories (web data, books, PDFs) and the use of synthetic data generated by Qwen2.5-Math and Qwen2.5-Coder. While the three-stage curriculum is explained (e.g., Stage 2 increasing STEM/coding proportions), the exact percentage breakdown of the 36T tokens across these categories is not provided, and the specific data sources remain proprietary.
Tokenizer Integrity
The tokenizer is publicly available via the Qwen GitHub and Hugging Face repositories. It uses Byte-Level Byte-Pair Encoding (BBPE) with a vocabulary size of 151,669 (or 151,936 including special tokens). It supports 119 languages and includes specific tokens for its 'thinking' mode (<think>...</think>). The alignment between the tokenizer and the claimed multilingual support is verifiable through the provided technical report and model files.
Parameter Density
The model is explicitly defined as a dense architecture with 1.70 billion total parameters. Documentation further clarifies that it contains 1.4 billion non-embedding parameters. This level of precision, distinguishing between total and non-embedding counts, is exemplary for transparency in parameter density.
Training Compute
While the technical report mentions that the training was 'large-scale' and utilized 'significant resources,' it fails to provide specific GPU/TPU hours, hardware cluster specifications, or the total energy consumption/carbon footprint. There are no public estimates for the training cost or duration, which is a significant gap in transparency regarding the model's environmental and resource impact.
Benchmark Reproducibility
The technical report provides extensive benchmark results across standard sets like MMLU, GSM8K, and HumanEval, often comparing performance in 'thinking' vs 'non-thinking' modes. However, while some evaluation settings (like temperature and top-p) are recommended in the model card, the full evaluation code and exact few-shot prompts used to generate the official report scores are not fully public, limiting independent verification.
Identity Consistency
The model consistently identifies as part of the Qwen series and is transparent about its dual-mode (thinking/non-thinking) capabilities. It provides clear versioning within the Qwen3 family. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its 1.7B scale.
License Clarity
The model and its weights are released under the Apache 2.0 license, which is a standard, permissive open-source license. This is explicitly stated in the official blog, the technical report, and the Hugging Face repository, with no conflicting proprietary terms found in the documentation.
Hardware Footprint
Hardware requirements are well-documented, with specific VRAM estimates provided for different quantization levels (e.g., ~4.74 GB for FP16/Base). The impact of context length on memory (KV cache scaling) is addressed, and third-party verification on consumer hardware (RTX 3060/3090) is widely available and consistent with official claims.
Versioning Drift
The model follows the Qwen team's standard release cycle, but there is no formal public changelog or semantic versioning for minor weight updates. While major versions are clear (Qwen2.5 to Qwen3), tracking subtle behavior drift or silent updates to the 'thinking' mode logic is difficult for end-users.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online