Parameters
9B
Context Length
262K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
16
Key-Value Heads
4
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
32
FFN Intermediate Size (Dense)
12,288
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Qwen3.5-9B is Alibaba Cloud's efficient multimodal foundation model with 9B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in an 8×(3×DeltaNet→FFN→1×Attention→FFN) pattern. It achieves strong scores on MMLU-Pro (82.5%), GPQA Diamond (81.7%), HMMT benchmarks (90%/90%), and LiveCodeBench v6 (82.7%). Features unified vision-language capabilities, 262k native context (extensible to 1M), multi-token prediction training, and excels in multimodal reasoning, coding, agents, and multilingual tasks across 201 languages.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
No evaluation benchmarks for Qwen3.5-9B available.
Overall Rank
-
Coding Rank
-
Total Score
71
/ 100
Qwen3.5-9B exhibits strong transparency in its architectural specifications and licensing, providing clear technical details on its hybrid Gated DeltaNet structure and permissive Apache 2.0 terms. However, it remains opaque regarding its specific training data proportions and total compute resources consumed. While hardware requirements are well-documented for deployment, the lack of detailed data provenance and training logs limits a full independent audit of its upstream development.
Architectural Provenance
The model architecture is extensively documented in the official Hugging Face model card and release blog. It utilizes a sophisticated hybrid structure consisting of 32 layers in an 8×(3×Gated DeltaNet → FFN → 1×Gated Attention → FFN) pattern. Technical specifications for the Gated DeltaNet (32 V heads, 16 QK heads, 128 head dim) and Gated Attention (16 Q heads, 4 KV heads, 256 head dim) are explicitly provided. The model is a native multimodal foundation model trained with multi-token prediction (MTP) and strong-to-weak distillation, though the specific 'strong' teacher models are not fully detailed.
Dataset Composition
While the total token count for the Qwen3.5 series is stated to be in the trillions (building on the 36 trillion tokens of Qwen3), the specific breakdown for the 9B variant is vague. Documentation mentions broad categories like web content, PDF-like documents (processed via Qwen2.5-VL), and synthetic data for math and coding. However, exact percentage distributions (e.g., code vs. web vs. books) and detailed filtering/cleaning methodologies for the Qwen3.5-specific training run are not publicly disclosed.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository and is fully compatible with the Transformers library. It uses a Byte-level Byte Pair Encoding (BBPE) approach with a large, well-documented vocabulary of 248,320 padded tokens. It supports 201 languages and dialects, and the vocabulary includes specific control tokens for chat, tool use, vision, and coding, all of which are explicitly listed in the technical documentation.
Parameter Density
The model is explicitly identified as a 9B dense model. Unlike the larger MoE variants in the Qwen3.5 family (e.g., 397B-A17B), the 9B variant has 100% active parameters. The architectural breakdown, including hidden dimensions (4096), FFN intermediate dimensions (12288), and layer counts (32), is clearly stated, leaving no ambiguity regarding parameter density or active vs. total counts.
Training Compute
Information regarding training compute is extremely limited. While the 'Next-Generation Training Infrastructure' is mentioned as having near-100% multimodal training efficiency, there are no public disclosures of total GPU/TPU hours, hardware cluster size, training duration, or carbon footprint specifically for the 9B model. Most compute-related claims are high-level marketing statements rather than verifiable technical data.
Benchmark Reproducibility
Qwen provides comprehensive benchmark results across standard sets (MMLU-Pro: 82.5%, GPQA Diamond: 81.7%, LiveCodeBench v6: 82.7%). While they specify versions and some evaluation strategies (e.g., context-folding for long context), the exact evaluation code and full prompt sets for all reported benchmarks are not consistently provided in a single reproducible repository, though some datasets like HLE-Verified are open-sourced.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as Qwen3.5-9B and maintaining awareness of its version and multimodal capabilities. It distinguishes between its 'thinking' (reasoning) and 'non-thinking' modes via a toggleable parameter (enable_thinking), and there are no reported instances of the model claiming to be a competitor's product.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license is clearly stated on the Hugging Face model card and in the official GitHub repository, explicitly allowing for commercial use, modification, and distribution without conflicting proprietary terms.
Hardware Footprint
Hardware requirements are well-documented by both the provider and third-party tools like Unsloth. VRAM requirements are specified for various precisions: ~18GB for BF16 and ~5GB for 4-bit quantization. The impact of the 262k context window on KV cache memory (approx. 8GB at full context) is also detailed, providing clear guidance for consumer and enterprise deployment.
Versioning Drift
The model follows a clear naming convention (Qwen3.5-9B) and is part of a structured release cycle. However, there is no detailed public changelog or version history tracking subtle weights updates or 'silent' safety alignment changes post-release. While major versions are clear, tracking drift within the 3.5-9B lifecycle remains difficult for external auditors.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online