Parameters
14B
Context Length
256K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
2 Dec 2025
Knowledge Cutoff
Jun 2025
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
1,000,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
5,120
Number of Layers
40
FFN Intermediate Size (Dense)
16,384
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
131,072
Ministral 3 14B is a high-density, multimodal transformer model engineered by Mistral AI to bridge the gap between edge-efficient computing and frontier-class intelligence. As the largest member of the Ministral 3 family, it employs a sophisticated Cascade Distillation strategy, where knowledge is progressively transferred from larger parent models, such as Mistral Small 3.1, into a more compact 14-billion-parameter footprint. This architecture integrates a 13.5-billion-parameter decoder-only language core with a frozen 410-million-parameter Vision Transformer (ViT) encoder, enabling the model to process interleaved image and text inputs with high precision.
The technical foundation of the model features 40 transformer layers and a hidden dimension of 5120, utilizing Grouped Query Attention (GQA) with 32 query heads and 8 key-value heads to optimize memory throughput during inference. It incorporates modern architectural best practices, including RMSNorm for stable normalization, SwiGLU activation functions for enhanced non-linear processing, and Rotary Positional Embeddings (RoPE) enhanced by YaRN scaling. These components collectively support an expansive context window of 256,000 tokens, allowing for the ingestion of massive document sets or complex multi-turn agentic workflows without performance degradation.
Designed for sophisticated automation and private AI deployments, Ministral 3 14B excels in agentic tasks through native support for function calling and structured JSON outputs. Its training emphasizes efficiency and versatility, providing robust multilingual capabilities across more than 40 languages and high-tier performance in reasoning-heavy domains like mathematics and coding. By balancing a dense architectural structure with advanced quantization compatibility, the model is optimized for deployment on local workstations and enterprise edge hardware, offering a high-performance alternative to much larger cloud-based systems.
Ministral 3 is a family of efficient edge models with vision capabilities, available in 3B, 8B, and 14B parameter sizes. Designed for edge deployment with multimodal and multilingual support, offering best-in-class performance for resource-constrained environments.
Rank
#80
| Benchmark | Score | Rank |
|---|---|---|
General Knowledge MMLU | 0.794 | 24 |
Overall Rank
#80
Coding Rank
-
Total Score
73
/ 100
Ministral 3 14B exhibits strong transparency in its architectural design and licensing, providing a clear lineage and a permissive open-source foundation. While it offers detailed hardware requirements and a well-integrated tokenizer, it remains opaque regarding the specific sources of its training data and the environmental cost of its compute resources. The model's identity and parameter density are clearly defined, though benchmark reproducibility is hampered by the lack of public evaluation code.
Architectural Provenance
The model's architecture is extensively documented in the 'Ministral 3' technical report (arXiv:2601.08584). It explicitly identifies the base model as a descendant of Mistral Small 3.1, derived through a 'Cascade Distillation' process. Technical specifications are precise: 40 transformer layers, a hidden dimension of 5120, and Grouped Query Attention (GQA) with 32 query heads and 8 key-value heads. It also details the integration of a frozen 410M parameter vision encoder from the Pixtral architecture and the use of YaRN for context extension.
Dataset Composition
While the technical report describes the training methodology (Cascade Distillation) and the number of tokens (1-3 trillion), it lacks a detailed breakdown of the dataset composition. It mentions using 'open and proprietary sources' and 'question-answer pairs' for post-training but does not provide specific percentages or named sources for the pretraining data. This follows the industry trend of high-level descriptions without granular transparency.
Tokenizer Integrity
The tokenizer is publicly accessible via the 'mistral-common' library (version >= 1.8.6) and is integrated into the Hugging Face transformers ecosystem. The vocabulary size is explicitly stated as 131,072 tokens. Documentation confirms support for over 40 languages and provides clear instructions for implementation in various inference frameworks like vLLM and llama.cpp.
Parameter Density
The model provides a highly transparent breakdown of its parameters: a total of 14 billion, consisting of a 13.5B language core and a 0.4B vision encoder. As a dense model, all 14B parameters are active during inference, which is clearly stated in official documentation and the technical report, avoiding the ambiguity often found in Mixture-of-Experts (MoE) models.
Training Compute
Documentation states the model was trained on NVIDIA Hopper GPUs (H100/H200), but it fails to disclose specific compute metrics such as total GPU hours, energy consumption, or the resulting carbon footprint. While it claims the 'Cascade Distillation' method is compute-efficient compared to training from scratch, it provides no verifiable data to quantify this efficiency or the environmental impact.
Benchmark Reproducibility
Mistral AI provides comprehensive benchmark results (AIME25, GPQA, Arena Hard, etc.) in the technical report and model cards. However, while they specify the versions and some evaluation settings (e.g., pass@k, temperature), the exact prompts and full evaluation code are not consistently provided in a single, reproducible repository, making independent verification of the exact reported scores difficult for the community.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as part of the Ministral 3 family in official documentation and through its metadata. It uses a clear semantic versioning-style naming convention (e.g., '2512' for the December 2025 release). There are no documented cases of the model claiming to be a competitor's product or misrepresenting its 14B parameter scale.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms are clear, allowing for both commercial and non-commercial use, modification, and distribution. There are no conflicting 'custom' terms or revenue-based restrictions for this specific model variant, providing maximum legal transparency.
Hardware Footprint
Hardware requirements are well-documented across multiple sources, including the official model card and third-party platforms like NVIDIA NIM and Ollama. It provides specific VRAM requirements for different precisions: ~32GB for BF16 and ~24GB for FP8. It also notes the memory scaling impact of its 256k context window and provides guidance on using quantization (Q4/Q8) to fit on consumer hardware.
Versioning Drift
Mistral maintains a public changelog for its API and model releases, and the model uses a date-based versioning suffix ('2512'). However, there is limited documentation regarding performance drift or specific 'alignment tax' impacts over time. While the initial release is well-defined, the long-term tracking of behavioral changes for this specific variant is not yet established in a detailed, public-facing version history.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online