ApX logoApX logo

Qwen3.5-397B-A17B

Active Parameters

397B

Context Length

262.144K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

2

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

60

FFN Intermediate Size (Dense)

1,024

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Mixture of Experts

Total Expert Parameters

17.0B

Number of Experts

512

Active Experts

11

Shared Experts

-

FFN Intermediate Size (per Expert)

1,024

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 262.1k · Vocab: 248.3kx 60 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 2KV headsHead dim: 256+RMSNormPre-FFNSparse MoE FFN (11/512 experts)SwiGLUIntermediate: 1k+Final RMSNormOutput Logits

Qwen3.5-397B-A17B

Qwen3.5-397B-A17B is Alibaba Cloud's largest and most capable multimodal foundation model, released February 2026. With 397B total parameters and 17B activated through a Mixture-of-Experts architecture (512 experts), it achieves state-of-the-art scores on MMLU-Pro (87.8%), GPQA Diamond (88.4%), SWE-bench Verified (80.0%), and Terminal-Bench 2.0 (54.0%). It features unified vision-language capabilities, extended context up to 1M tokens, and excels in coding agents, general agents, multimodal reasoning, and multilingual understanding across 201 languages.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

Rank

#32

BenchmarkScoreRank

0.763

14

Web Development

WebDev Arena

1389

24

Rankings

Overall Rank

#32

Coding Rank

#31

Model Integrity

Total Score

B

66 / 100

Qwen3.5-397B-A17B Model Integrity Report

Total Score

66

/ 100

B

Audit Note

Qwen3.5-397B-A17B exhibits high transparency in its architectural specifications and licensing, providing clear distinctions between total and active parameters. However, the model is significantly opaque regarding its training data composition and total compute resources. While hardware requirements are well-documented for local deployment, the lack of verifiable training provenance and evaluation code limits its overall transparency profile.

Upstream

20.0 / 30

Architectural Provenance

8.0 / 10

The model architecture is extensively documented as a Hybrid Mixture-of-Experts (MoE) with Gated DeltaNet layers. Technical specifications are highly detailed, including the number of layers (60), hidden dimension (4096), and specific hidden layout (15 blocks of 3x Gated DeltaNet to MoE followed by 1x Gated Attention to MoE). The use of 512 total experts with 10 routed and 1 shared expert per token is explicitly stated. While the base model lineage is clear within the Qwen 3.5 family, the specific pre-training methodology is described in high-level technical terms (early fusion multimodal training) rather than a step-by-step procedural paper.

Dataset Composition

3.0 / 10

Data transparency is a significant weakness. While the model is described as being trained on 'trillions of multimodal tokens' across 201 languages, specific dataset names, sources, and exact composition percentages (e.g., % code vs % web) are not disclosed. Documentation vaguely refers to 'automated collection' and 'public benchmark datasets' for evaluation, but the pre-training corpus remains a 'black box' with no public sample data or detailed filtering methodology provided.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available and well-documented with a vocabulary size of 248,320 (often rounded to 250k in documentation). It supports 201 languages and dialects, and its efficiency for non-Latin scripts is highlighted in technical reviews. The vocabulary size and tokenization approach (supporting text, image, and video tokens via early fusion) are verifiable through the official Hugging Face configuration files and community tools like vLLM and SGLang.

Model

23.5 / 40

Parameter Density

9.5 / 10

Qwen provides exemplary transparency regarding parameter density. It explicitly distinguishes between total parameters (397B) and active parameters (17B). The MoE structure is further detailed with the exact number of experts (512) and the routing mechanism (10 routed + 1 shared). This level of detail prevents the common 'parameter inflation' seen in other MoE models and allows for accurate compute estimation.

Training Compute

2.0 / 10

There is almost no verifiable information regarding the total compute budget. While the hardware used for inference (H100, H200, B200) is mentioned, the actual training duration, total GPU/TPU hours, and carbon footprint are not disclosed. The documentation mentions 'Next-Generation Training Infrastructure' but lacks the concrete metrics required for a high transparency score.

Benchmark Reproducibility

4.0 / 10

While the model provides impressive scores on standard benchmarks (MMLU-Pro, GPQA, SWE-bench), the evaluation code is not fully public. Some specific setups are mentioned (e.g., using fixes from Claude 4.5 system cards for TAU2-Bench), but a comprehensive, one-click reproduction repository is missing. Furthermore, there are documented concerns regarding potential contamination in common benchmarks like MATH-500 for the Qwen series, which necessitates a skeptical view of the reported zero-shot gains.

Identity Consistency

8.0 / 10

The model demonstrates strong identity consistency, correctly identifying as a Qwen 3.5 series model in most deployments. It is transparent about its 'Thinking Mode' and the use of <think> tags. However, some user reports indicate minor alignment issues where the model may fail to generate answers or exhibit whitespace normalization errors, though it does not typically claim to be a competitor's model.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model is released under a clear Apache 2.0 license, which is explicitly stated in the GitHub repository, Hugging Face model card, and official blog posts. This license allows for both commercial and non-commercial use, derivative works, and redistribution without the restrictive 'custom' terms often found in other 'open' weights releases.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented by both the provider and third-party community members (e.g., Unsloth, NVIDIA). VRAM requirements for various quantization levels (FP16, FP8, 4-bit GGUF) are available, with clear guidance that the full model requires ~807GB on disk. The distinction between total memory for loading (397B) and active compute for inference (17B) is clearly explained for local deployment.

Versioning Drift

5.0 / 10

Versioning is handled through standard Hugging Face repository updates, but a formal, detailed changelog or semantic versioning system for the weights themselves is not prominently maintained. While the release date (Feb 2026) is clear, there is limited information on how future 'silent' updates or safety alignment shifts will be communicated to the community.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-397B-A17B: Specifications and GPU VRAM Requirements