Parameters
4B
Context Length
262K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
16
Key-Value Heads
4
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,560
Number of Layers
32
FFN Intermediate Size (Dense)
9,216
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Qwen3.5-4B is Alibaba Cloud's compact multimodal foundation model with 4B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in an 8×(3×DeltaNet→FFN→1×Attention→FFN) pattern. It achieves MMLU-Pro (79.1%), GPQA Diamond (76.2%), HMMT benchmarks (74%/77%), and strong vision-language scores. Features unified vision-language capabilities, 262k native context (extensible to 1M), multi-token prediction training, and delivers efficient performance across reasoning, coding, multimodal understanding, and multilingual tasks covering 201 languages.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
No evaluation benchmarks for Qwen3.5-4B available.
Overall Rank
-
Coding Rank
-
Total Score
65
/ 100
Qwen3.5-4B exhibits strong transparency in its architectural specifications and licensing, providing clear technical details on its hybrid attention mechanism and permissive open-source terms. However, it suffers from significant opacity regarding its training data composition and compute resources, which remain largely proprietary. While benchmark performance is high, the lack of reproducible evaluation artifacts and known data contamination issues necessitate a skeptical approach to its reported scores.
Architectural Provenance
The model architecture is extensively documented on its official Hugging Face page and GitHub repository. It specifies a hybrid layout of 8 blocks, each containing 3 Gated DeltaNet layers followed by 1 Gated Attention layer, with detailed dimensions for hidden layers (2560), heads, and intermediate FFN (9216). While the training methodology (multi-token prediction and early fusion) is described, a formal peer-reviewed paper for the 3.5 series is not yet linked, though it references the Qwen3 technical report (arXiv:2505.09388) for foundational methods.
Dataset Composition
Transparency regarding the training data is low. While the provider mentions a 'trillions of tokens' multimodal corpus including web, code, and books, and specifies support for 201 languages, there is no public breakdown of dataset proportions, specific sources, or detailed filtering/cleaning methodologies. The documentation vaguely refers to 'high-quality data' and 'curated' sets without providing verifiable composition metrics.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face 'transformers' library and is well-documented. It uses a Byte Pair Encoding (BPE) approach with a large, padded vocabulary size of 248,320 tokens. The documentation explicitly lists control tokens for chat, vision, and tool use, and the vocabulary's efficiency across 201 languages is verifiable through the provided configuration files.
Parameter Density
The model clearly states its total parameter count as 4.0 billion. As a dense variant within the Qwen 3.5 family, it avoids the ambiguity of active vs. total parameters found in its MoE counterparts. However, it lacks a detailed breakdown of parameter allocation between the vision encoder and the language backbone in the primary model card, though some layer-wise dimensions are provided.
Training Compute
There is virtually no public information regarding the compute resources used to train the 4B variant. No GPU/TPU hours, hardware cluster specifications, or carbon footprint data are disclosed. The documentation only mentions a 'Next-Generation Training Infrastructure' in marketing terms without providing verifiable technical metrics.
Benchmark Reproducibility
While the model provides a comprehensive list of scores across standard benchmarks (MMLU-Pro: 79.1%, GPQA Diamond: 76.2%), it lacks public evaluation code or the exact prompts/few-shot examples used to achieve these results. The reliance on 'Thinking mode' for certain benchmarks is mentioned but not fully documented for independent reproduction. Automatic penalties were applied due to documented concerns regarding benchmark contamination in the Qwen series (e.g., RandomCalculation and MATH-500 studies).
Identity Consistency
The model demonstrates high identity consistency, correctly identifying its version (Qwen 3.5) and its multimodal capabilities in official documentation and API responses. It clearly distinguishes itself from previous generations (Qwen 3) and other family variants (MoE vs. Dense).
License Clarity
The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms are clearly stated on Hugging Face and GitHub, explicitly allowing for commercial use, modification, and distribution without conflicting proprietary restrictions.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Official and third-party documentation provide VRAM estimates for FP16 (~10.6GB) and quantized versions (e.g., 4-bit requiring ~2-4GB). It also provides guidance on context length memory scaling, noting native support for 262K tokens and the impact of RoPE scaling.
Versioning Drift
The model follows a clear semantic versioning path (Qwen3.5-4B) and maintains a basic changelog on GitHub. However, the documentation of 'silent' updates or behavioral drift is limited, and while previous versions are accessible on Hugging Face, the detailed delta between minor iterations is not always transparently documented.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online