ApX logoApX logo

Qwen3.5-0.8B

Parameters

800M

Context Length

262K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

24 Feb 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

8

Key-Value Heads

2

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

1,024

Number of Layers

24

FFN Intermediate Size (Dense)

3,584

Multi-Token Prediction Heads

1

Tokenizer

Vocabulary Size

248,320

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 1k · Context: 262K · Vocab: 248.3kx 24 layersRMSNormPre-AttentionGrouped-Query Attention8Q / 2KV headsHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 3.6k+Final RMSNormOutput Logits

Qwen3.5-0.8B

Qwen3.5-0.8B is Alibaba Cloud's ultra-compact multimodal foundation model with 0.8B parameters, released February 2026. It uses a hybrid architecture combining Gated Delta Networks and Gated Attention in a 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. In thinking mode, it achieves MMLU-Pro (66.5%), GPQA Diamond (51.6%), and GPQA (11.9%). Features unified vision-language capabilities, 262k native context, multi-token prediction training, and supports both thinking and non-thinking modes, designed for prototyping, fine-tuning, and research purposes across 201 languages.

About Qwen 3.5

Qwen 3.5 is Alibaba Cloud's latest-generation foundation model family, released February 2026. It represents a significant leap forward, integrating breakthroughs in multimodal learning (unified vision-language foundation), efficient hybrid architecture (Gated Delta Networks with sparse Mixture-of-Experts), scalable reinforcement learning across million-agent environments, and global linguistic coverage spanning 201 languages. Available under Apache 2.0 license with open weights.


Other Qwen 3.5 Models

Evaluation Benchmarks

No evaluation benchmarks for Qwen3.5-0.8B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

69 / 100

Qwen3.5-0.8B Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Qwen3.5-0.8B demonstrates high transparency in its architectural design and licensing, providing deep technical insights into its hybrid attention mechanism and permissive usage terms. However, it remains opaque regarding its specific training data sources and the environmental impact of its compute resources. While benchmark results are plentiful, the lack of a centralized reproduction suite limits its score in the model evaluation pillar.

Upstream

21.0 / 30

Architectural Provenance

8.5 / 10

The model's architecture is extensively documented in the official Hugging Face repository and technical blog. It utilizes a sophisticated hybrid design (Gated Delta Networks and Gated Attention) with a specific 6×(3×DeltaNet→FFN→1×Attention→FFN) pattern. Key hyperparameters such as hidden dimensions (1024), layer count (24), and head dimensions are explicitly stated. It also discloses the use of Multi-Token Prediction (MTP) during training, which is a significant technical detail often omitted by competitors.

Dataset Composition

3.5 / 10

While the provider mentions training on a 'significantly larger scale' of multimodal tokens with 'stricter filtering' and support for 201 languages, the specific dataset composition (e.g., exact percentages of web, code, or vision data) is not disclosed. There is no public list of data sources or a detailed breakdown of the training mixture, falling into the 'general categories mentioned' tier of the scoring rubric.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available on Hugging Face (tokenizer.json) and is well-documented. It uses a Byte-level BPE approach with a specific vocabulary size of 151,669 tokens (padded to 248,320). The documentation clearly explains the handling of control tokens (like <|im_start|>) and its efficiency across the 201 supported languages. Third-party implementations (e.g., KerasHub, .NET) further verify its integrity.

Model

25.0 / 40

Parameter Density

7.5 / 10

The model clearly states its 0.8B parameter count. Unlike the larger MoE variants in the Qwen 3.5 family, this variant is dense, which is explicitly clarified in technical discussions. Detailed architectural breakdowns (KV heads, attention vs. linear layers) are provided, allowing for a clear understanding of parameter distribution, though a precise weight-by-weight breakdown is not in the primary model card.

Training Compute

2.0 / 10

There is a near-total lack of transparency regarding the specific compute resources used for the 0.8B variant. While the 'Next-Generation Training Infrastructure' is mentioned as a marketing highlight, there are no disclosures regarding total GPU hours, hardware counts, energy consumption, or carbon footprint. This information is conspicuously absent from the official technical report and model cards.

Benchmark Reproducibility

6.0 / 10

The model provides detailed scores across a wide array of benchmarks (MMLU-Pro, GPQA, Video-MME) and specifies the 'Thinking' vs 'Non-thinking' modes for each. However, while some evaluation settings (top_p, temperature) are disclosed, the full evaluation code and exact prompt templates for all benchmarks are not centrally hosted in a reproducible repository, requiring users to rely on third-party frameworks like OpenCompass for verification.

Identity Consistency

9.5 / 10

The model exhibits high identity consistency, correctly identifying itself as part of the Qwen 3.5 family. It maintains clear versioning and distinguishes between its base and chat variants. Documentation and system prompts (where applicable) reinforce its identity as a multimodal model from Alibaba Cloud without attempting to mimic competitors.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is explicitly stated and included in the Hugging Face repository. This is a standard, highly permissive open-source license with no hidden 'custom' restrictions or conflicting terms, providing maximum clarity for both commercial and research use.

Hardware Footprint

8.0 / 10

Hardware requirements are exceptionally well-documented by both the provider and the community (e.g., Unsloth, Ollama). VRAM requirements for various quantization levels (FP16, Q8, Q4) and context lengths (up to 262k) are publicly available. The documentation also addresses the memory scaling impact of its hybrid architecture, which is critical for a model of this size.

Versioning Drift

5.0 / 10

The model uses clear semantic versioning (Qwen3.5-0.8B), and the Hugging Face commit history provides a basic changelog. However, there is no formal, centralized changelog detailing behavioral drift or specific performance changes between minor weight updates, making it difficult for downstream users to track subtle shifts in model behavior over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs

Qwen3.5-0.8B: Specifications and GPU VRAM Requirements