ApX logoApX logo

Qwen3.5-2B

Parameters

2B

Context Length

262K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

8

Key-Value Heads

2

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

24

FFN Intermediate Size (Dense)

6,144

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 262K · Vocab: 248.3kx 24 layersRMSNormPre-AttentionGrouped-Query Attention8Q / 2KV headsHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 6.1k+Final RMSNormOutput Logits

Qwen3.5-2B

Qwen3.5-2B is Alibaba Cloud's small-scale multimodal foundation model with 2B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in a 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. In thinking mode, it achieves MMLU-Pro (74.0%), GPQA Diamond (65.8%), and GPQA (51.6%). Features unified vision-language capabilities, 262k native context, multi-token prediction training, and supports both thinking and non-thinking modes for prototyping, fine-tuning, and research purposes across 201 languages.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen3.5-2B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

69 / 100

Qwen3.5-2B Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Qwen3.5-2B exhibits strong transparency in its architectural design and licensing, providing detailed structural specifications and a permissive Apache 2.0 license. However, it falls short in disclosing its specific training data composition and the environmental/compute costs associated with its development. While hardware requirements and tokenizer details are exemplary, the lack of a detailed data provenance report remains a significant gap in its transparency profile.

Upstream

20.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in official Hugging Face model cards and technical blog posts. It utilizes a specific hybrid design consisting of 24 layers with a 6×(3×Gated DeltaNet → FFN → 1×Gated Attention → FFN) pattern. Technical specifications including hidden dimensions (2048), head dimensions for both linear and gated attention, and the use of Rotary Position Embeddings (RoPE) are clearly stated. While the 'Gated DeltaNet' is a specialized linear attention variant, the integration of these components is well-described, though a full peer-reviewed paper for the 3.5 series specifically was not found at the time of audit.

Dataset Composition

3.0 / 10

Information regarding the training data is highly generalized. Official sources mention a scale of approximately 36 trillion tokens (inherited from the Qwen3 lineage) and the inclusion of 201 languages. However, there is no specific percentage breakdown of data sources (e.g., web vs. books vs. code) or detailed disclosure of the specific datasets used. The documentation mentions the use of 'PDF-like documents' and synthetic data generated by previous Qwen models, but lacks the granularity required for a high transparency score.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the Hugging Face repository and integrated into major frameworks like Transformers and Keras. It uses a Byte Pair Encoding (BPE) approach with a clearly stated vocabulary size of 151,646 tokens. Documentation explicitly details the handling of special control tokens (e.g., <|im_start|>, <|im_end|>) and supports the claimed 201 languages. The alignment between the tokenizer and the model's multilingual capabilities is verifiable through public code and third-party implementations.

Model

26.0 / 40

Parameter Density

9.0 / 10

The parameter count is explicitly stated as 2.0 billion. Unlike the larger MoE variants in the Qwen 3.5 family, the 2B variant is a dense model, meaning all parameters are active during inference. The architectural breakdown (layers, attention heads, and intermediate dimensions) is fully provided in the configuration files and model cards, leaving no ambiguity regarding the model's density or active parameter count.

Training Compute

2.0 / 10

There is almost no verifiable information regarding the specific compute resources used to train the Qwen3.5-2B variant. While the general 'Next-Generation Training Infrastructure' is mentioned in marketing materials, specific details such as GPU/TPU hours, hardware types used for this specific 2B training run, and the resulting carbon footprint or environmental impact are absent from public documentation.

Benchmark Reproducibility

6.0 / 10

The model provides a wide array of benchmark results (MMLU-Pro, GPQA, etc.) for both 'thinking' and 'non-thinking' modes. Some technical details on evaluation settings are provided, such as temperature (0.6 for thinking) and specific prompts for MathVision. However, the full evaluation code and the exact datasets/seeds required for 1:1 reproduction are not centrally hosted in a single reproducible repository, and some results rely on 'internal' versions of benchmarks like MMLU-Redux.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Qwen model and distinguishing between its thinking and non-thinking modes. It does not exhibit the common 'identity crisis' seen in models that claim to be GPT-4 or other competitors. Versioning is clear within the Qwen 3.5 family hierarchy, and its capabilities/limitations regarding multimodal vs. text-only tasks are well-defined in the documentation.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license file is explicitly included in the Hugging Face repository and GitHub, clearly allowing for commercial use, modification, and distribution. There are no conflicting proprietary 'Acceptable Use Policies' that override the open-source terms for this specific variant.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both the provider and third-party deployment frameworks. VRAM requirements for various contexts (up to 262k) and quantization levels (FP16, INT8, INT4) are available. For example, it is documented that the model requires ~4.25 GB of disk space and ~6.7 GiB of VRAM for a practical target on consumer hardware. The impact of the hybrid linear attention on KV cache scaling is also technically explained.

Versioning Drift

5.0 / 10

The model uses a clear naming convention (Qwen3.5-2B), but a detailed, granular changelog for weight updates or minor revisions is not consistently maintained in a centralized location. While major releases are announced via blog posts and GitHub news, tracking subtle 'silent' updates or behavior drift over time remains difficult for end-users without manual checksum verification.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-2B: Specifications and GPU VRAM Requirements