ApX logoApX logo

Qwen3.5-4B

Parameters

4B

Context Length

262K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

16

Key-Value Heads

4

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,560

Number of Layers

32

FFN Intermediate Size (Dense)

9,216

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2.6k · Context: 262K · Vocab: 248.3kx 32 layersRMSNormPre-AttentionGrouped-Query Attention16Q / 4KV headsHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 9.2k+Final RMSNormOutput Logits

Qwen3.5-4B

Qwen3.5-4B is Alibaba Cloud's compact multimodal foundation model with 4B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in an 8×(3×DeltaNet→FFN→1×Attention→FFN) pattern. It achieves MMLU-Pro (79.1%), GPQA Diamond (76.2%), HMMT benchmarks (74%/77%), and strong vision-language scores. Features unified vision-language capabilities, 262k native context (extensible to 1M), multi-token prediction training, and delivers efficient performance across reasoning, coding, multimodal understanding, and multilingual tasks covering 201 languages.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen3.5-4B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

65 / 100

Qwen3.5-4B Model Integrity Report

Total Score

65

/ 100

B

Audit Note

Qwen3.5-4B exhibits strong transparency in its architectural specifications and licensing, providing clear technical details on its hybrid attention mechanism and permissive open-source terms. However, it suffers from significant opacity regarding its training data composition and compute resources, which remain largely proprietary. While benchmark performance is high, the lack of reproducible evaluation artifacts and known data contamination issues necessitate a skeptical approach to its reported scores.

Upstream

20.0 / 30

Architectural Provenance

8.0 / 10

The model architecture is extensively documented on its official Hugging Face page and GitHub repository. It specifies a hybrid layout of 8 blocks, each containing 3 Gated DeltaNet layers followed by 1 Gated Attention layer, with detailed dimensions for hidden layers (2560), heads, and intermediate FFN (9216). While the training methodology (multi-token prediction and early fusion) is described, a formal peer-reviewed paper for the 3.5 series is not yet linked, though it references the Qwen3 technical report (arXiv:2505.09388) for foundational methods.

Dataset Composition

3.0 / 10

Transparency regarding the training data is low. While the provider mentions a 'trillions of tokens' multimodal corpus including web, code, and books, and specifies support for 201 languages, there is no public breakdown of dataset proportions, specific sources, or detailed filtering/cleaning methodologies. The documentation vaguely refers to 'high-quality data' and 'curated' sets without providing verifiable composition metrics.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the Hugging Face 'transformers' library and is well-documented. It uses a Byte Pair Encoding (BPE) approach with a large, padded vocabulary size of 248,320 tokens. The documentation explicitly lists control tokens for chat, vision, and tool use, and the vocabulary's efficiency across 201 languages is verifiable through the provided configuration files.

Model

21.0 / 40

Parameter Density

7.0 / 10

The model clearly states its total parameter count as 4.0 billion. As a dense variant within the Qwen 3.5 family, it avoids the ambiguity of active vs. total parameters found in its MoE counterparts. However, it lacks a detailed breakdown of parameter allocation between the vision encoder and the language backbone in the primary model card, though some layer-wise dimensions are provided.

Training Compute

1.0 / 10

There is virtually no public information regarding the compute resources used to train the 4B variant. No GPU/TPU hours, hardware cluster specifications, or carbon footprint data are disclosed. The documentation only mentions a 'Next-Generation Training Infrastructure' in marketing terms without providing verifiable technical metrics.

Benchmark Reproducibility

4.0 / 10

While the model provides a comprehensive list of scores across standard benchmarks (MMLU-Pro: 79.1%, GPQA Diamond: 76.2%), it lacks public evaluation code or the exact prompts/few-shot examples used to achieve these results. The reliance on 'Thinking mode' for certain benchmarks is mentioned but not fully documented for independent reproduction. Automatic penalties were applied due to documented concerns regarding benchmark contamination in the Qwen series (e.g., RandomCalculation and MATH-500 studies).

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying its version (Qwen 3.5) and its multimodal capabilities in official documentation and API responses. It clearly distinguishes itself from previous generations (Qwen 3) and other family variants (MoE vs. Dense).

Downstream

24.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms are clearly stated on Hugging Face and GitHub, explicitly allowing for commercial use, modification, and distribution without conflicting proprietary restrictions.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Official and third-party documentation provide VRAM estimates for FP16 (~10.6GB) and quantized versions (e.g., 4-bit requiring ~2-4GB). It also provides guidance on context length memory scaling, noting native support for 262K tokens and the impact of RoPE scaling.

Versioning Drift

6.0 / 10

The model follows a clear semantic versioning path (Qwen3.5-4B) and maintains a basic changelog on GitHub. However, the documentation of 'silent' updates or behavioral drift is limited, and while previous versions are accessible on Hugging Face, the detailed delta between minor iterations is not always transparently documented.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-4B: Specifications and GPU VRAM Requirements