Parameters
2B
Context Length
262K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
8
Key-Value Heads
2
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
24
FFN Intermediate Size (Dense)
6,144
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Qwen3.5-2B is Alibaba Cloud's small-scale multimodal foundation model with 2B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in a 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. In thinking mode, it achieves MMLU-Pro (74.0%), GPQA Diamond (65.8%), and GPQA (51.6%). Features unified vision-language capabilities, 262k native context, multi-token prediction training, and supports both thinking and non-thinking modes for prototyping, fine-tuning, and research purposes across 201 languages.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
No evaluation benchmarks for Qwen3.5-2B available.
Overall Rank
-
Coding Rank
-
Total Score
69
/ 100
Qwen3.5-2B exhibits strong transparency in its architectural design and licensing, providing detailed structural specifications and a permissive Apache 2.0 license. However, it falls short in disclosing its specific training data composition and the environmental/compute costs associated with its development. While hardware requirements and tokenizer details are exemplary, the lack of a detailed data provenance report remains a significant gap in its transparency profile.
Architectural Provenance
The model's architecture is extensively documented in official Hugging Face model cards and technical blog posts. It utilizes a specific hybrid design consisting of 24 layers with a 6×(3×Gated DeltaNet → FFN → 1×Gated Attention → FFN) pattern. Technical specifications including hidden dimensions (2048), head dimensions for both linear and gated attention, and the use of Rotary Position Embeddings (RoPE) are clearly stated. While the 'Gated DeltaNet' is a specialized linear attention variant, the integration of these components is well-described, though a full peer-reviewed paper for the 3.5 series specifically was not found at the time of audit.
Dataset Composition
Information regarding the training data is highly generalized. Official sources mention a scale of approximately 36 trillion tokens (inherited from the Qwen3 lineage) and the inclusion of 201 languages. However, there is no specific percentage breakdown of data sources (e.g., web vs. books vs. code) or detailed disclosure of the specific datasets used. The documentation mentions the use of 'PDF-like documents' and synthetic data generated by previous Qwen models, but lacks the granularity required for a high transparency score.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository and integrated into major frameworks like Transformers and Keras. It uses a Byte Pair Encoding (BPE) approach with a clearly stated vocabulary size of 151,646 tokens. Documentation explicitly details the handling of special control tokens (e.g., <|im_start|>, <|im_end|>) and supports the claimed 201 languages. The alignment between the tokenizer and the model's multilingual capabilities is verifiable through public code and third-party implementations.
Parameter Density
The parameter count is explicitly stated as 2.0 billion. Unlike the larger MoE variants in the Qwen 3.5 family, the 2B variant is a dense model, meaning all parameters are active during inference. The architectural breakdown (layers, attention heads, and intermediate dimensions) is fully provided in the configuration files and model cards, leaving no ambiguity regarding the model's density or active parameter count.
Training Compute
There is almost no verifiable information regarding the specific compute resources used to train the Qwen3.5-2B variant. While the general 'Next-Generation Training Infrastructure' is mentioned in marketing materials, specific details such as GPU/TPU hours, hardware types used for this specific 2B training run, and the resulting carbon footprint or environmental impact are absent from public documentation.
Benchmark Reproducibility
The model provides a wide array of benchmark results (MMLU-Pro, GPQA, etc.) for both 'thinking' and 'non-thinking' modes. Some technical details on evaluation settings are provided, such as temperature (0.6 for thinking) and specific prompts for MathVision. However, the full evaluation code and the exact datasets/seeds required for 1:1 reproduction are not centrally hosted in a single reproducible repository, and some results rely on 'internal' versions of benchmarks like MMLU-Redux.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Qwen model and distinguishing between its thinking and non-thinking modes. It does not exhibit the common 'identity crisis' seen in models that claim to be GPT-4 or other competitors. Versioning is clear within the Qwen 3.5 family hierarchy, and its capabilities/limitations regarding multimodal vs. text-only tasks are well-defined in the documentation.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license file is explicitly included in the Hugging Face repository and GitHub, clearly allowing for commercial use, modification, and distribution. There are no conflicting proprietary 'Acceptable Use Policies' that override the open-source terms for this specific variant.
Hardware Footprint
Hardware requirements are well-documented by both the provider and third-party deployment frameworks. VRAM requirements for various contexts (up to 262k) and quantization levels (FP16, INT8, INT4) are available. For example, it is documented that the model requires ~4.25 GB of disk space and ~6.7 GiB of VRAM for a practical target on consumer hardware. The impact of the hybrid linear attention on KV cache scaling is also technically explained.
Versioning Drift
The model uses a clear naming convention (Qwen3.5-2B), but a detailed, granular changelog for weight updates or minor revisions is not consistently maintained in a centralized location. While major releases are announced via blog posts and GitHub news, tracking subtle 'silent' updates or behavior drift over time remains difficult for end-users without manual checksum verification.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online