Active Parameters
397B
Context Length
262.144K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
2
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
60
FFN Intermediate Size (Dense)
1,024
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Mixture of Experts
Total Expert Parameters
17.0B
Number of Experts
512
Active Experts
11
Shared Experts
-
FFN Intermediate Size (per Expert)
1,024
Dense Layers Before MoE
-
Qwen3.5-397B-A17B is Alibaba Cloud's largest and most capable multimodal foundation model, released February 2026. With 397B total parameters and 17B activated through a Mixture-of-Experts architecture (512 experts), it achieves state-of-the-art scores on MMLU-Pro (87.8%), GPQA Diamond (88.4%), SWE-bench Verified (80.0%), and Terminal-Bench 2.0 (54.0%). It features unified vision-language capabilities, extended context up to 1M tokens, and excels in coding agents, general agents, multimodal reasoning, and multilingual understanding across 201 languages.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
Rank
#32
| Benchmark | Score | Rank |
|---|---|---|
StackUnseen ProLLM Stack Unseen | 0.763 | 14 |
Web Development WebDev Arena | 1389 | 24 |
Overall Rank
#32
Coding Rank
#31
Total Score
66
/ 100
Qwen3.5-397B-A17B exhibits high transparency in its architectural specifications and licensing, providing clear distinctions between total and active parameters. However, the model is significantly opaque regarding its training data composition and total compute resources. While hardware requirements are well-documented for local deployment, the lack of verifiable training provenance and evaluation code limits its overall transparency profile.
Architectural Provenance
The model architecture is extensively documented as a Hybrid Mixture-of-Experts (MoE) with Gated DeltaNet layers. Technical specifications are highly detailed, including the number of layers (60), hidden dimension (4096), and specific hidden layout (15 blocks of 3x Gated DeltaNet to MoE followed by 1x Gated Attention to MoE). The use of 512 total experts with 10 routed and 1 shared expert per token is explicitly stated. While the base model lineage is clear within the Qwen 3.5 family, the specific pre-training methodology is described in high-level technical terms (early fusion multimodal training) rather than a step-by-step procedural paper.
Dataset Composition
Data transparency is a significant weakness. While the model is described as being trained on 'trillions of multimodal tokens' across 201 languages, specific dataset names, sources, and exact composition percentages (e.g., % code vs % web) are not disclosed. Documentation vaguely refers to 'automated collection' and 'public benchmark datasets' for evaluation, but the pre-training corpus remains a 'black box' with no public sample data or detailed filtering methodology provided.
Tokenizer Integrity
The tokenizer is publicly available and well-documented with a vocabulary size of 248,320 (often rounded to 250k in documentation). It supports 201 languages and dialects, and its efficiency for non-Latin scripts is highlighted in technical reviews. The vocabulary size and tokenization approach (supporting text, image, and video tokens via early fusion) are verifiable through the official Hugging Face configuration files and community tools like vLLM and SGLang.
Parameter Density
Qwen provides exemplary transparency regarding parameter density. It explicitly distinguishes between total parameters (397B) and active parameters (17B). The MoE structure is further detailed with the exact number of experts (512) and the routing mechanism (10 routed + 1 shared). This level of detail prevents the common 'parameter inflation' seen in other MoE models and allows for accurate compute estimation.
Training Compute
There is almost no verifiable information regarding the total compute budget. While the hardware used for inference (H100, H200, B200) is mentioned, the actual training duration, total GPU/TPU hours, and carbon footprint are not disclosed. The documentation mentions 'Next-Generation Training Infrastructure' but lacks the concrete metrics required for a high transparency score.
Benchmark Reproducibility
While the model provides impressive scores on standard benchmarks (MMLU-Pro, GPQA, SWE-bench), the evaluation code is not fully public. Some specific setups are mentioned (e.g., using fixes from Claude 4.5 system cards for TAU2-Bench), but a comprehensive, one-click reproduction repository is missing. Furthermore, there are documented concerns regarding potential contamination in common benchmarks like MATH-500 for the Qwen series, which necessitates a skeptical view of the reported zero-shot gains.
Identity Consistency
The model demonstrates strong identity consistency, correctly identifying as a Qwen 3.5 series model in most deployments. It is transparent about its 'Thinking Mode' and the use of <think> tags. However, some user reports indicate minor alignment issues where the model may fail to generate answers or exhibit whitespace normalization errors, though it does not typically claim to be a competitor's model.
License Clarity
The model is released under a clear Apache 2.0 license, which is explicitly stated in the GitHub repository, Hugging Face model card, and official blog posts. This license allows for both commercial and non-commercial use, derivative works, and redistribution without the restrictive 'custom' terms often found in other 'open' weights releases.
Hardware Footprint
Hardware requirements are well-documented by both the provider and third-party community members (e.g., Unsloth, NVIDIA). VRAM requirements for various quantization levels (FP16, FP8, 4-bit GGUF) are available, with clear guidance that the full model requires ~807GB on disk. The distinction between total memory for loading (397B) and active compute for inference (17B) is clearly explained for local deployment.
Versioning Drift
Versioning is handled through standard Hugging Face repository updates, but a formal, detailed changelog or semantic versioning system for the weights themselves is not prominently maintained. While the release date (Feb 2026) is clear, there is limited information on how future 'silent' updates or safety alignment shifts will be communicated to the community.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online