Active Parameters
41B
Context Length
256K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
2 Dec 2025
Knowledge Cutoff
Oct 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
96
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
12,288
Number of Layers
88
FFN Intermediate Size (Dense)
28,672
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
32,768
Mixture of Experts
Total Expert Parameters
675.0B
Number of Experts
16
Active Experts
2
Shared Experts
-
FFN Intermediate Size (per Expert)
-
Dense Layers Before MoE
-
Mistral Large 3 represents a significant evolution in the Mistral AI model lineage, specifically engineered as a high-capacity, general-purpose multimodal foundation model. Built to handle complex enterprise workflows and production-grade assistant tasks, the model integrates native vision capabilities within a unified architecture. It is designed to operate as a central engine for retrieval-augmented generation (RAG) and sophisticated agentic systems, offering native support for function calling and structured JSON output. This instruct-tuned variant has been refined through post-training to ensure high adherence to system prompts and reliable instruction-following across diverse conversational contexts.
The technical foundation of Mistral Large 3 is a granular sparse Mixture-of-Experts (MoE) architecture that decouples total parameter capacity from inference-time computational cost. By utilizing a gating network to route tokens to a specific subset of experts, the model maintains a total of 675 billion parameters for expansive knowledge storage while activating only approximately 41 billion parameters per token. This architectural approach, combined with a 2.5 billion parameter integrated vision encoder, allows the model to process visual and textual data simultaneously. The training process utilized a massive cluster of 3,000 NVIDIA H200 GPUs, resulting in a model that supports a 256,000-token context window and advanced optimizations for modern hardware targets like NVIDIA Blackwell and Hopper architectures.
From an operational perspective, Mistral Large 3 provides versatility for large-scale deployments through support for high-efficiency quantization formats such as FP8 and NVFP4. These optimizations enable the serving of a model of this magnitude on single-node GPU configurations, such as an 8xH200 or 8xH100 setup, which traditionally would require multi-node infrastructure. The model demonstrates extensive multilingual capabilities, supporting over 40 languages and excelling in non-English conversational performance. This makes it an effective solution for global enterprises requiring a single, high-intelligence model capable of managing document understanding, code generation, and complex logical reasoning within a unified, open-weight framework.
Mistral Large 3 is a state-of-the-art general-purpose multimodal model with a granular Mixture-of-Experts architecture. With 675B total parameters and 41B active parameters, it delivers frontier performance for production-grade assistants, retrieval-augmented systems, and complex enterprise workflows.
Rank
#84
| Benchmark | Score | Rank |
|---|---|---|
StackUnseen ProLLM Stack Unseen | 0.516 | 23 |
Professional Knowledge MMLU Pro | 0.80 | 37 |
Web Development WebDev Arena | 1222 | 77 |
Overall Rank
#84
Coding Rank
#107
Total Score
68
/ 100
Mistral Large 3 demonstrates strong transparency in its architectural specifications and licensing, providing clear distinctions between total and active parameters for its MoE structure. The model's permissive Apache 2.0 license and detailed hardware deployment guidelines for various quantization formats are major strengths. However, the total lack of training data disclosure and the absence of a reproducible evaluation framework represent significant transparency gaps common to frontier-class models.
Architectural Provenance
Mistral Large 3 is explicitly documented as a granular sparse Mixture-of-Experts (MoE) model. Official technical blog posts and model cards confirm it was trained from scratch using 3,000 NVIDIA H200 GPUs. The architecture includes a 2.5B parameter integrated vision encoder. While the high-level methodology is clear, a full peer-reviewed technical paper with exhaustive architectural hyperparameters (e.g., specific layer dimensions beyond total counts) is not publicly available, preventing a higher score.
Dataset Composition
Mistral AI provides almost no specific information regarding the training data. Official documentation only mentions it is a 'massive multilingual text corpora' and includes 'image-text pairs' for its multimodal capabilities. There is no disclosure of data sources, specific percentage breakdowns (e.g., web vs. code), or detailed filtering and cleaning methodologies. This follows a 'proprietary' approach common to frontier models but fails the transparency criteria.
Tokenizer Integrity
The tokenizer is publicly accessible via the 'mistral-common' GitHub repository and Hugging Face. It uses a vocabulary size of 131,072 (2^17), which is a significant expansion from earlier 32k versions. The 'v3' tokenizer supports function calling and special control tokens. Documentation exists within the Mistral AI Cookbook, though the specific alignment between the training data distribution and the tokenizer's vocabulary is not fully detailed.
Parameter Density
Mistral is highly transparent about its MoE parameters. It explicitly states a total of 675 billion parameters with approximately 41 billion active parameters per token. The breakdown between the language backbone (673B total / 39B active) and the vision encoder (2.5B) is clearly provided in official model cards. This level of detail for a sparse architecture is exemplary.
Training Compute
The model disclosure includes the hardware used (3,000 NVIDIA H200 GPUs) but lacks the total GPU-hours or TFLOPS required for the full training run. While Mistral has published a general environmental report for previous models (Large 2), a specific lifecycle analysis or carbon footprint calculation for Large 3 is not yet available. Cost estimates are only provided for API usage, not the training phase.
Benchmark Reproducibility
Mistral provides scores for standard benchmarks like MMLU (85.5%), MMLU-Pro (low 80s), and GPQA Diamond (43.9%). However, the evaluation code and exact prompts used to achieve these scores are not fully public. While third-party verification is available through the LMSYS Chatbot Arena (Elo ~1418), the lack of a reproducible evaluation suite or specific few-shot examples in official documentation limits the score.
Identity Consistency
The model consistently identifies as Mistral Large 3 across API calls and system prompts. It maintains a clear versioning identity (mistral-large-2512) and is transparent about its nature as an AI and its limitations, such as not being a 'dedicated reasoning model' compared to specialized variants. No significant identity confusion or misrepresentation was found in technical documentation.
License Clarity
Mistral Large 3 is released under the Apache 2.0 license, which is the gold standard for permissive open-source licensing. This allows for unrestricted commercial use, modification, and redistribution. The license is clearly stated on Hugging Face, the official blog, and in the model weights repository, with no conflicting terms found.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Mistral provides specific guidance for FP8 (single node of B200/H200) and NVFP4 (single node of H100/A100) serving. Documentation notes memory scaling for the 256k context window and provides warnings about performance degradation in NVFP4 at context lengths exceeding 64k. Quantized checkpoints are officially provided in collaboration with vLLM.
Versioning Drift
Mistral uses date-based semantic versioning (e.g., 2512 for December 2025). A public changelog is maintained on the official documentation site, tracking model releases and API updates. However, detailed documentation of 'behavior drift' or specific weight-level changes between minor updates is less transparent, and previous versions are primarily accessible through specific dated tags rather than a comprehensive historical archive.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online