ApX logoApX logo

Qwen3.5-9B

Parameters

9B

Context Length

262K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

16

Key-Value Heads

4

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

12,288

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 262K · Vocab: 248.3kx 32 layersRMSNormPre-AttentionGrouped-Query Attention16Q / 4KV headsHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 12.3k+Final RMSNormOutput Logits

Qwen3.5-9B

Qwen3.5-9B is Alibaba Cloud's efficient multimodal foundation model with 9B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in an 8×(3×DeltaNet→FFN→1×Attention→FFN) pattern. It achieves strong scores on MMLU-Pro (82.5%), GPQA Diamond (81.7%), HMMT benchmarks (90%/90%), and LiveCodeBench v6 (82.7%). Features unified vision-language capabilities, 262k native context (extensible to 1M), multi-token prediction training, and excels in multimodal reasoning, coding, agents, and multilingual tasks across 201 languages.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen3.5-9B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

71 / 100

Qwen3.5-9B Model Integrity Report

Total Score

71

/ 100

B+

Audit Note

Qwen3.5-9B exhibits strong transparency in its architectural specifications and licensing, providing clear technical details on its hybrid Gated DeltaNet structure and permissive Apache 2.0 terms. However, it remains opaque regarding its specific training data proportions and total compute resources consumed. While hardware requirements are well-documented for deployment, the lack of detailed data provenance and training logs limits a full independent audit of its upstream development.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model architecture is extensively documented in the official Hugging Face model card and release blog. It utilizes a sophisticated hybrid structure consisting of 32 layers in an 8×(3×Gated DeltaNet → FFN → 1×Gated Attention → FFN) pattern. Technical specifications for the Gated DeltaNet (32 V heads, 16 QK heads, 128 head dim) and Gated Attention (16 Q heads, 4 KV heads, 256 head dim) are explicitly provided. The model is a native multimodal foundation model trained with multi-token prediction (MTP) and strong-to-weak distillation, though the specific 'strong' teacher models are not fully detailed.

Dataset Composition

4.0 / 10

While the total token count for the Qwen3.5 series is stated to be in the trillions (building on the 36 trillion tokens of Qwen3), the specific breakdown for the 9B variant is vague. Documentation mentions broad categories like web content, PDF-like documents (processed via Qwen2.5-VL), and synthetic data for math and coding. However, exact percentage distributions (e.g., code vs. web vs. books) and detailed filtering/cleaning methodologies for the Qwen3.5-specific training run are not publicly disclosed.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the Hugging Face repository and is fully compatible with the Transformers library. It uses a Byte-level Byte Pair Encoding (BBPE) approach with a large, well-documented vocabulary of 248,320 padded tokens. It supports 201 languages and dialects, and the vocabulary includes specific control tokens for chat, tool use, vision, and coding, all of which are explicitly listed in the technical documentation.

Model

27.0 / 40

Parameter Density

10.0 / 10

The model is explicitly identified as a 9B dense model. Unlike the larger MoE variants in the Qwen3.5 family (e.g., 397B-A17B), the 9B variant has 100% active parameters. The architectural breakdown, including hidden dimensions (4096), FFN intermediate dimensions (12288), and layer counts (32), is clearly stated, leaving no ambiguity regarding parameter density or active vs. total counts.

Training Compute

2.0 / 10

Information regarding training compute is extremely limited. While the 'Next-Generation Training Infrastructure' is mentioned as having near-100% multimodal training efficiency, there are no public disclosures of total GPU/TPU hours, hardware cluster size, training duration, or carbon footprint specifically for the 9B model. Most compute-related claims are high-level marketing statements rather than verifiable technical data.

Benchmark Reproducibility

6.0 / 10

Qwen provides comprehensive benchmark results across standard sets (MMLU-Pro: 82.5%, GPQA Diamond: 81.7%, LiveCodeBench v6: 82.7%). While they specify versions and some evaluation strategies (e.g., context-folding for long context), the exact evaluation code and full prompt sets for all reported benchmarks are not consistently provided in a single reproducible repository, though some datasets like HLE-Verified are open-sourced.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as Qwen3.5-9B and maintaining awareness of its version and multimodal capabilities. It distinguishes between its 'thinking' (reasoning) and 'non-thinking' modes via a toggleable parameter (enable_thinking), and there are no reported instances of the model claiming to be a competitor's product.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license is clearly stated on the Hugging Face model card and in the official GitHub repository, explicitly allowing for commercial use, modification, and distribution without conflicting proprietary terms.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both the provider and third-party tools like Unsloth. VRAM requirements are specified for various precisions: ~18GB for BF16 and ~5GB for 4-bit quantization. The impact of the 262k context window on KV cache memory (approx. 8GB at full context) is also detailed, providing clear guidance for consumer and enterprise deployment.

Versioning Drift

5.0 / 10

The model follows a clear naming convention (Qwen3.5-9B) and is part of a structured release cycle. However, there is no detailed public changelog or version history tracking subtle weights updates or 'silent' safety alignment changes post-release. While major versions are clear, tracking drift within the 3.5-9B lifecycle remains difficult for external auditors.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-9B: Specifications and GPU VRAM Requirements