Parameters
800M
Context Length
262K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
24 Feb 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
8
Key-Value Heads
2
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
1,024
Number of Layers
24
FFN Intermediate Size (Dense)
3,584
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
248,320
Qwen3.5-0.8B is Alibaba Cloud's ultra-compact multimodal foundation model with 0.8B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in a 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. In thinking mode, it achieves MMLU-Pro (66.5%), GPQA Diamond (51.6%), and GPQA (11.9%). Features unified vision-language capabilities, 262k native context, multi-token prediction training, and supports both thinking and non-thinking modes, designed for prototyping, fine-tuning, and research purposes across 201 languages.
Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.
No evaluation benchmarks for Qwen3.5-0.8B available.
Overall Rank
-
Coding Rank
-
Total Score
69
/ 100
Qwen3.5-0.8B demonstrates high transparency in its architectural design and licensing, providing deep technical insights into its hybrid attention mechanism and permissive usage terms. However, it remains opaque regarding its specific training data sources and the environmental impact of its compute resources. While benchmark results are plentiful, the lack of a centralized reproduction suite limits its score in the model evaluation pillar.
Architectural Provenance
The model's architecture is extensively documented in the official Hugging Face repository and technical blog. It utilizes a sophisticated hybrid design (Gated Delta Networks and Gated Attention) with a specific 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. Key hyperparameters such as hidden dimensions (1024), layer count (24), and head dimensions are explicitly stated. It also discloses the use of Multi-Token Prediction (MTP) during training, which is a significant technical detail often omitted by competitors.
Dataset Composition
While the provider mentions training on a 'significantly larger scale' of multimodal tokens with 'stricter filtering' and support for 201 languages, the specific dataset composition (e.g., exact percentages of web, code, or vision data) is not disclosed. There is no public list of data sources or a detailed breakdown of the training mixture, falling into the 'general categories mentioned' tier of the scoring rubric.
Tokenizer Integrity
The tokenizer is publicly available on Hugging Face (tokenizer.json) and is well-documented. It uses a Byte-level BPE approach with a specific vocabulary size of 151,669 tokens (padded to 248,320). The documentation clearly explains the handling of control tokens (like <|im_start|>) and its efficiency across the 201 supported languages. Third-party implementations (e.g., KerasHub, .NET) further verify its integrity.
Parameter Density
The model clearly states its 0.8B parameter count. Unlike the larger MoE variants in the Qwen 3.5 family, this variant is dense, which is explicitly clarified in technical discussions. Detailed architectural breakdowns (KV heads, attention vs. linear layers) are provided, allowing for a clear understanding of parameter distribution, though a precise weight-by-weight breakdown is not in the primary model card.
Training Compute
There is a near-total lack of transparency regarding the specific compute resources used for the 0.8B variant. While the 'Next-Generation Training Infrastructure' is mentioned as a marketing highlight, there are no disclosures regarding total GPU hours, hardware counts, energy consumption, or carbon footprint. This information is conspicuously absent from the official technical report and model cards.
Benchmark Reproducibility
The model provides detailed scores across a wide array of benchmarks (MMLU-Pro, GPQA, Video-MME) and specifies the 'Thinking' vs 'Non-thinking' modes for each. However, while some evaluation settings (top_p, temperature) are disclosed, the full evaluation code and exact prompt templates for all benchmarks are not centrally hosted in a reproducible repository, requiring users to rely on third-party frameworks like OpenCompass for verification.
Identity Consistency
The model exhibits high identity consistency, correctly identifying itself as part of the Qwen 3.5 family. It maintains clear versioning and distinguishes between its base and chat variants. Documentation and system prompts (where applicable) reinforce its identity as a multimodal model from Alibaba Cloud without attempting to mimic competitors.
License Clarity
The model is released under the Apache 2.0 license, which is explicitly stated and included in the Hugging Face repository. This is a standard, highly permissive open-source license with no hidden 'custom' restrictions or conflicting terms, providing maximum clarity for both commercial and research use.
Hardware Footprint
Hardware requirements are exceptionally well-documented by both the provider and the community (e.g., Unsloth, Ollama). VRAM requirements for various quantization levels (FP16, Q8, Q4) and context lengths (up to 262k) are publicly available. The documentation also addresses the memory scaling impact of its hybrid architecture, which is critical for a model of this size.
Versioning Drift
The model uses clear semantic versioning (Qwen3.5-0.8B), and the Hugging Face commit history provides a basic changelog. However, there is no formal, centralized changelog detailing behavioral drift or specific performance changes between minor weight updates, making it difficult for downstream users to track subtle shifts in model behavior over time.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online