ApX logoApX logo

Qwen3.5-35B-A3B

Active Parameters

35B

Context Length

262K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

16

Key-Value Heads

2

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

40

FFN Intermediate Size (Dense)

512

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Mixture of Experts

Total Expert Parameters

3.0B

Number of Experts

256

Active Experts

9

Shared Experts

-

FFN Intermediate Size (per Expert)

512

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 262K · Vocab: 248.3kx 40 layersRMSNormPre-AttentionGrouped-Query Attention16Q / 2KV headsHead dim: 256+RMSNormPre-FFNSparse MoE FFN (9/256 experts)SwiGLUIntermediate: 512+Final RMSNormOutput Logits

Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is Alibaba Cloud's efficient multimodal foundation model, released February 2026. With 35B total parameters and 3B activated through a Mixture-of-Experts architecture (256 experts), it delivers strong performance with minimal compute. It achieves MMLU-Pro (85.3%), GPQA Diamond (84.2%), SWE-bench Verified (69.2%), and Terminal-Bench 2.0 (40.5%). Qwen3.5-Flash is the hosted API version. Features unified vision-language capabilities, 262k native context (extensible to 1M), and strong performance on multimodal reasoning, coding, and multilingual tasks.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

Rank

#101

BenchmarkScoreRank

General Text

Text Arena

1396

54

Web Development

WebDev Arena

1249

89

Rankings

Overall Rank

#101

Coding Rank

#104

Model Integrity

Total Score

B+

72 / 100

Qwen3.5-35B-A3B Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

Qwen3.5-35B-A3B exhibits a strong transparency profile regarding its complex hybrid architecture and parameter density, providing clear distinctions between total and active weights. The model is highly accessible through its permissive Apache 2.0 license and detailed hardware requirements for local deployment. However, it remains opaque concerning its specific training data sources and the total compute resources utilized during its development.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The model's architecture is extensively documented in official Hugging Face model cards and technical blog posts. It is a hybrid Gated DeltaNet and sparse Mixture-of-Experts (MoE) transformer. Documentation specifies 40 layers with a 10x block layout (3x Gated DeltaNet -> MoE followed by 1x Gated Attention -> MoE). It details linear attention head counts (32 for V, 16 for QK) and head dimensions (128). The pre-training methodology involves a three-stage process (General, Reasoning, and Long Context) which is publicly described.

Dataset Composition

4.5 / 10

Alibaba discloses that the model was trained on approximately 36 trillion tokens across 119 languages. While general categories like web data, PDF-like documents (extracted via Qwen2.5-VL), and synthetic data (generated by Qwen2.5-Math/Coder) are mentioned, there is no granular breakdown of specific data sources or exact percentage compositions. The filtering methodology is described at a high level (multilingual annotation system labeling for educational value and safety), but specific datasets remain proprietary.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the Hugging Face repository and is compatible with standard libraries like Transformers and vLLM. It uses a Byte Pair Encoding (BPE) scheme with a large vocabulary of 151,646 tokens (padded to 248,320). Documentation explicitly states the inclusion of functional control tokens (<|im_start|>, <|im_end|>) and supports 201 languages/dialects, which is verifiable through the provided configuration files.

Model

26.5 / 40

Parameter Density

9.5 / 10

Transparency regarding parameter density is exemplary for an MoE model. The provider explicitly distinguishes between the 35B total parameters and the 3B active parameters per token. The MoE structure is detailed as having 256 total experts, with 8 routed experts and 1 shared expert activated per forward pass. This level of detail prevents the common 'parameter inflation' marketing trap and provides clear technical specs for compute estimation.

Training Compute

2.0 / 10

Information regarding the specific compute resources used for training is almost entirely absent. While the training stages and token counts are disclosed, there is no public data on GPU/TPU hours, hardware specifications used for the run, total energy consumption, or carbon footprint. This is a significant gap in an otherwise technical profile.

Benchmark Reproducibility

6.0 / 10

The model provides results for several standard benchmarks (MMLU-Pro: 85.3%, GPQA Diamond: 84.2%, SWE-bench: 69.2%). While evaluation results are listed on Hugging Face, the specific evaluation code and exact prompt templates used for these official scores are not fully centralized in a single reproducible repository. Third-party verification from platforms like Artificial Analysis and community tests on r/LocalLLaMA provide some external validation, but official reproduction instructions are limited.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying its version and family in official documentation and API responses. It is transparent about its nature as a mixture-of-experts model and its multimodal capabilities. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its 3B active parameter count as a 35B dense model.

Downstream

23.5 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license file is explicitly included in the Hugging Face repository and allows for commercial use, modification, and distribution without conflicting proprietary terms. This is the highest level of licensing transparency possible.

Hardware Footprint

8.5 / 10

Hardware requirements are well-documented by both the provider and the community. Official documentation provides guidance for 8-GPU tensor parallel setups for 262k context. Community documentation (e.g., Unsloth, llama.cpp) provides precise VRAM requirements for various quantization levels (e.g., Q4_K_M requiring ~20GB VRAM, Q8_1 requiring ~37GB). The impact of context length on memory scaling is also documented through community benchmarks.

Versioning Drift

5.0 / 10

The model uses a naming convention that includes the version (3.5), but a formal semantic versioning changelog is not prominently maintained. While updates are pushed to Hugging Face (e.g., the March 5 GGUF update), these often rely on commit history rather than a structured, public-facing versioning system with deprecation notices. Tracking silent behavior drift over time remains difficult for end-users.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-35B-A3B: Specifications and GPU VRAM Requirements