Active Parameters
35B
Context Length
262K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
16
Key-Value Heads
2
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
40
FFN Intermediate Size (Dense)
512
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Mixture of Experts
Total Expert Parameters
3.0B
Number of Experts
256
Active Experts
9
Shared Experts
-
FFN Intermediate Size (per Expert)
512
Dense Layers Before MoE
-
Qwen3.5-35B-A3B is Alibaba Cloud's efficient multimodal foundation model, released February 2026. With 35B total parameters and 3B activated through a Mixture-of-Experts architecture (256 experts), it delivers strong performance with minimal compute. It achieves MMLU-Pro (85.3%), GPQA Diamond (84.2%), SWE-bench Verified (69.2%), and Terminal-Bench 2.0 (40.5%). Qwen3.5-Flash is the hosted API version. Features unified vision-language capabilities, 262k native context (extensible to 1M), and strong performance on multimodal reasoning, coding, and multilingual tasks.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
Rank
#101
| Benchmark | Score | Rank |
|---|---|---|
General Text Text Arena | 1396 | 54 |
Web Development WebDev Arena | 1249 | 89 |
Overall Rank
#101
Coding Rank
#104
Total Score
72
/ 100
Qwen3.5-35B-A3B exhibits a strong transparency profile regarding its complex hybrid architecture and parameter density, providing clear distinctions between total and active weights. The model is highly accessible through its permissive Apache 2.0 license and detailed hardware requirements for local deployment. However, it remains opaque concerning its specific training data sources and the total compute resources utilized during its development.
Architectural Provenance
The model's architecture is extensively documented in official Hugging Face model cards and technical blog posts. It is a hybrid Gated DeltaNet and sparse Mixture-of-Experts (MoE) transformer. Documentation specifies 40 layers with a 10x block layout (3x Gated DeltaNet -> MoE followed by 1x Gated Attention -> MoE). It details linear attention head counts (32 for V, 16 for QK) and head dimensions (128). The pre-training methodology involves a three-stage process (General, Reasoning, and Long Context) which is publicly described.
Dataset Composition
Alibaba discloses that the model was trained on approximately 36 trillion tokens across 119 languages. While general categories like web data, PDF-like documents (extracted via Qwen2.5-VL), and synthetic data (generated by Qwen2.5-Math/Coder) are mentioned, there is no granular breakdown of specific data sources or exact percentage compositions. The filtering methodology is described at a high level (multilingual annotation system labeling for educational value and safety), but specific datasets remain proprietary.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository and is compatible with standard libraries like Transformers and vLLM. It uses a Byte Pair Encoding (BPE) scheme with a large vocabulary of 151,646 tokens (padded to 248,320). Documentation explicitly states the inclusion of functional control tokens (<|im_start|>, <|im_end|>) and supports 201 languages/dialects, which is verifiable through the provided configuration files.
Parameter Density
Transparency regarding parameter density is exemplary for an MoE model. The provider explicitly distinguishes between the 35B total parameters and the 3B active parameters per token. The MoE structure is detailed as having 256 total experts, with 8 routed experts and 1 shared expert activated per forward pass. This level of detail prevents the common 'parameter inflation' marketing trap and provides clear technical specs for compute estimation.
Training Compute
Information regarding the specific compute resources used for training is almost entirely absent. While the training stages and token counts are disclosed, there is no public data on GPU/TPU hours, hardware specifications used for the run, total energy consumption, or carbon footprint. This is a significant gap in an otherwise technical profile.
Benchmark Reproducibility
The model provides results for several standard benchmarks (MMLU-Pro: 85.3%, GPQA Diamond: 84.2%, SWE-bench: 69.2%). While evaluation results are listed on Hugging Face, the specific evaluation code and exact prompt templates used for these official scores are not fully centralized in a single reproducible repository. Third-party verification from platforms like Artificial Analysis and community tests on r/LocalLLaMA provide some external validation, but official reproduction instructions are limited.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying its version and family in official documentation and API responses. It is transparent about its nature as a mixture-of-experts model and its multimodal capabilities. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its 3B active parameter count as a 35B dense model.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The license file is explicitly included in the Hugging Face repository and allows for commercial use, modification, and distribution without conflicting proprietary terms. This is the highest level of licensing transparency possible.
Hardware Footprint
Hardware requirements are well-documented by both the provider and the community. Official documentation provides guidance for 8-GPU tensor parallel setups for 262k context. Community documentation (e.g., Unsloth, llama.cpp) provides precise VRAM requirements for various quantization levels (e.g., Q4_K_M requiring ~20GB VRAM, Q8_1 requiring ~37GB). The impact of context length on memory scaling is also documented through community benchmarks.
Versioning Drift
The model uses a naming convention that includes the version (3.5), but a formal semantic versioning changelog is not prominently maintained. While updates are pushed to Hugging Face (e.g., the March 5 GGUF update), these often rely on commit history rather than a structured, public-facing versioning system with deprecation notices. Tracking silent behavior drift over time remains difficult for end-users.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online