ApX logoApX logo

Ministral 3 3B

Parameters

3B

Context Length

256K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

2 Dec 2025

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

3,072

Number of Layers

26

FFN Intermediate Size (Dense)

9,216

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

131,072

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 3.1k · Context: 256k · Vocab: 131.1kx 26 layersLayerNormPre-AttentionMulti-Head Attention32Q / 8KV headsHead dim: 128+LayerNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 9.2k+Final LayerNormOutput Logits

Ministral 3 3B

Ministral 3 3B is a compact, multimodal language model engineered by Mistral AI for efficient execution in edge computing environments and resource-constrained scenarios. The model architecture integrates a 3.4 billion parameter language decoder with a 410 million parameter Vision Transformer (ViT) encoder, yielding a combined capacity of approximately 3.8 billion parameters. This hybrid design enables the simultaneous processing of text and visual inputs, facilitating advanced tasks such as image captioning, visual question answering, and multimodal data extraction while maintaining a low computational overhead.

Technically, Ministral 3 3B follows a dense Transformer-based decoder-only architecture that leverages Grouped Query Attention (GQA) with 32 query heads and 8 key-value heads to optimize memory bandwidth and inference speed. It employs Rotary Positional Embeddings (RoPE) enhanced with YaRN (Yet another RoPE extensioN) and position-based softmax temperature scaling to support an extensive context window of up to 256,000 tokens. To further enhance efficiency at this scale, the 3B variant utilizes tied input-output embeddings, preventing vocabulary parameters from disproportionately increasing the total model size. The vision component utilizes a frozen ViT encoder derived from the Mistral Small 3.1 architecture, coupled with a newly trained multimodal projection layer.

The model is optimized for high-performance on-device applications, offering native support for function calling and structured JSON output to enable complex agentic workflows. It incorporates architectural refinements such as SwiGLU activation and RMSNorm to ensure stability and efficiency during local inference. By supporting dozens of languages and featuring a high-context capacity, Ministral 3 3B is positioned as a versatile solution for real-time translation, local content generation, and privacy-focused intelligent assistants operating directly on user hardware.

About Ministral 3

Ministral 3 is a family of efficient edge models with vision capabilities, available in 3B, 8B, and 14B parameter sizes. Designed for edge deployment with multimodal and multilingual support, offering best-in-class performance for resource-constrained environments.


Other Ministral 3 Models

Evaluation Benchmarks

Rank

#109

BenchmarkScoreRank

General Knowledge

MMLU

0.707

29

Rankings

Overall Rank

#109

Coding Rank

-

Model Integrity

Total Score

B+

73 / 100

Ministral 3 3B Model Integrity Report

Total Score

73

/ 100

B+

Audit Note

Ministral 3 3B exhibits strong transparency in its architectural design and licensing, providing a detailed technical report that clarifies its lineage from larger models. While it offers excellent documentation for hardware requirements and parameter counts, it follows industry trends of opacity regarding specific training data sources and total compute expenditures. The model's commitment to a permissive Apache 2.0 license and clear identity consistency makes it a highly trustworthy option for edge deployment.

Upstream

21.5 / 30

Architectural Provenance

8.5 / 10

The model's architecture is extensively documented in the 'Ministral 3' technical report (arXiv:2601.03006). It is explicitly identified as a descendant of Mistral Small 3.1, derived through a 'Cascade Distillation' process involving iterative pruning and continued training. Technical specifications are precise, detailing a 26-layer dense Transformer-based decoder (3.4B parameters) coupled with a frozen 410M parameter Vision Transformer (ViT) encoder from Mistral Small 3.1. Key architectural components such as Grouped Query Attention (GQA), Rotary Positional Embeddings (RoPE) with YaRN, and tied input-output embeddings are clearly defined and justified for the 3B scale.

Dataset Composition

4.0 / 10

While the training methodology (Cascade Distillation) is well-documented, the actual composition of the pretraining dataset remains vague. The technical report mentions training on 1 to 3 trillion tokens but does not provide a specific breakdown of data sources (e.g., percentages of web, code, or books). It refers to 'diverse open and proprietary sources' and 'high-quality multimodal data' without naming specific datasets. Filtering and cleaning procedures are mentioned as a 'rigorous multi-step pipeline' but lack granular detail beyond general categories of removed content.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the 'mistral-common' GitHub repository and Hugging Face. It uses a vocabulary size of 131,072 tokens, which is explicitly stated in both the technical report and the model configuration files. The tokenizer is shared across the Ministral 3 family, ensuring consistency, and supports dozens of languages as claimed. Documentation for the 'MistralCommonBackend' provides clear implementation details for developers.

Model

28.5 / 40

Parameter Density

8.5 / 10

The model's parameter counts are clearly and accurately disclosed. It is a dense model (not MoE), with a total of approximately 3.8B parameters, broken down into a 3.4B parameter language decoder and a 410M parameter vision encoder. The use of tied embeddings in the 3B variant is specifically noted as a design choice to manage parameter density. This transparency prevents the common confusion between total and active parameters often found in sparse architectures.

Training Compute

4.0 / 10

Mistral AI discloses that the models were trained on NVIDIA Hopper GPUs (specifically H200s) as part of a partnership with NVIDIA. However, the total compute budget in terms of GPU-hours is not provided. While the 'Cascade Distillation' method is highlighted as being more compute-efficient than training from scratch, specific metrics regarding energy consumption, carbon footprint, or the exact duration of the training runs are absent from the public documentation.

Benchmark Reproducibility

6.5 / 10

The technical report provides comprehensive benchmark results across standard datasets (MMLU, MATH, GPQA, etc.) and compares them against size-matched competitors like Qwen 3 and Gemma 3. While evaluation settings (e.g., 5-shot, CoT) are specified, the exact evaluation code and full prompt sets are not fully public, though some third-party verification (e.g., LMArena, independent reviewers) is available. The report acknowledges the use of an 'internal harness' for some comparisons, which limits full independent reproducibility.

Identity Consistency

9.5 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Mistral AI model and distinguishing between its variants (Base, Instruct, Reasoning). Versioning is clear with the '2512' (December 2025) suffix used in official naming conventions. There are no reported instances of the model claiming to be a competitor's product or misrepresenting its 3B-scale capabilities.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. This allows for commercial use, modification, and distribution without the restrictive 'non-commercial' or 'research-only' clauses found in some other 'open' models. The license terms are clearly stated on the Hugging Face model card, the official blog, and within the repository.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Official documentation and community guides (e.g., vLLM, Ollama) provide VRAM estimates for FP16 (approx. 8.5GB) and quantized versions (under 6GB for Q8/Q4). The impact of the 256k context window on memory scaling is also addressed, with recommendations for specific GPUs (e.g., RTX 3060, RTX 4000 Ada) provided in technical guides.

Versioning Drift

5.0 / 10

Mistral AI maintains a general changelog for its API and model releases, and the use of date-based versioning (e.g., '2512') provides some tracking. However, detailed changelogs for specific weight updates or fine-tuning iterations are less granular. While the transition from the previous 'Ministral 2410' to the '2512' version is documented, there is limited information on how minor updates or safety alignment changes might affect performance drift over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
125k
250k

VRAM Required:

Recommended GPUs

Ministral 3 3B: Specifications and GPU VRAM Requirements