ApX logoApX logo

Mistral Large 3

Active Parameters

41B

Context Length

256K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

2 Dec 2025

Knowledge Cutoff

Oct 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

96

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

12,288

Number of Layers

88

FFN Intermediate Size (Dense)

28,672

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

32,768

Mixture of Experts

Total Expert Parameters

675.0B

Number of Experts

16

Active Experts

2

Shared Experts

-

FFN Intermediate Size (per Expert)

-

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 12.3k · Context: 256k · Vocab: 32.8kx 88 layersRMSNormPre-AttentionMulti-Head Attention96Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (2/16 experts)SwiGLU+Final RMSNormOutput Logits

Mistral Large 3

Mistral Large 3 represents a significant evolution in the Mistral AI model lineage, specifically engineered as a high-capacity, general-purpose multimodal foundation model. Built to handle complex enterprise workflows and production-grade assistant tasks, the model integrates native vision capabilities within a unified architecture. It is designed to operate as a central engine for retrieval-augmented generation (RAG) and sophisticated agentic systems, offering native support for function calling and structured JSON output. This instruct-tuned variant has been refined through post-training to ensure high adherence to system prompts and reliable instruction-following across diverse conversational contexts.

The technical foundation of Mistral Large 3 is a granular sparse Mixture-of-Experts (MoE) architecture that decouples total parameter capacity from inference-time computational cost. By utilizing a gating network to route tokens to a specific subset of experts, the model maintains a total of 675 billion parameters for expansive knowledge storage while activating only approximately 41 billion parameters per token. This architectural approach, combined with a 2.5 billion parameter integrated vision encoder, allows the model to process visual and textual data simultaneously. The training process utilized a massive cluster of 3,000 NVIDIA H200 GPUs, resulting in a model that supports a 256,000-token context window and advanced optimizations for modern hardware targets like NVIDIA Blackwell and Hopper architectures.

From an operational perspective, Mistral Large 3 provides versatility for large-scale deployments through support for high-efficiency quantization formats such as FP8 and NVFP4. These optimizations enable the serving of a model of this magnitude on single-node GPU configurations, such as an 8xH200 or 8xH100 setup, which traditionally would require multi-node infrastructure. The model demonstrates extensive multilingual capabilities, supporting over 40 languages and excelling in non-English conversational performance. This makes it an effective solution for global enterprises requiring a single, high-intelligence model capable of managing document understanding, code generation, and complex logical reasoning within a unified, open-weight framework.

About Mistral Large 3

Mistral Large 3 is a state-of-the-art general-purpose multimodal model with a granular Mixture-of-Experts architecture. With 675B total parameters and 41B active parameters, it delivers frontier performance for production-grade assistants, retrieval-augmented systems, and complex enterprise workflows.


Other Mistral Large 3 Models
  • No related models available

Evaluation Benchmarks

Rank

#84

BenchmarkScoreRank

0.516

23

Professional Knowledge

MMLU Pro

0.80

37

Web Development

WebDev Arena

1222

77

Rankings

Overall Rank

#84

Coding Rank

#107

Model Integrity

Total Score

B

68 / 100

Mistral Large 3 Model Integrity Report

Total Score

68

/ 100

B

Audit Note

Mistral Large 3 demonstrates strong transparency in its architectural specifications and licensing, providing clear distinctions between total and active parameters for its MoE structure. The model's permissive Apache 2.0 license and detailed hardware deployment guidelines for various quantization formats are major strengths. However, the total lack of training data disclosure and the absence of a reproducible evaluation framework represent significant transparency gaps common to frontier-class models.

Upstream

17.0 / 30

Architectural Provenance

7.0 / 10

Mistral Large 3 is explicitly documented as a granular sparse Mixture-of-Experts (MoE) model. Official technical blog posts and model cards confirm it was trained from scratch using 3,000 NVIDIA H200 GPUs. The architecture includes a 2.5B parameter integrated vision encoder. While the high-level methodology is clear, a full peer-reviewed technical paper with exhaustive architectural hyperparameters (e.g., specific layer dimensions beyond total counts) is not publicly available, preventing a higher score.

Dataset Composition

2.0 / 10

Mistral AI provides almost no specific information regarding the training data. Official documentation only mentions it is a 'massive multilingual text corpora' and includes 'image-text pairs' for its multimodal capabilities. There is no disclosure of data sources, specific percentage breakdowns (e.g., web vs. code), or detailed filtering and cleaning methodologies. This follows a 'proprietary' approach common to frontier models but fails the transparency criteria.

Tokenizer Integrity

8.0 / 10

The tokenizer is publicly accessible via the 'mistral-common' GitHub repository and Hugging Face. It uses a vocabulary size of 131,072 (2^17), which is a significant expansion from earlier 32k versions. The 'v3' tokenizer supports function calling and special control tokens. Documentation exists within the Mistral AI Cookbook, though the specific alignment between the training data distribution and the tokenizer's vocabulary is not fully detailed.

Model

27.0 / 40

Parameter Density

9.0 / 10

Mistral is highly transparent about its MoE parameters. It explicitly states a total of 675 billion parameters with approximately 41 billion active parameters per token. The breakdown between the language backbone (673B total / 39B active) and the vision encoder (2.5B) is clearly provided in official model cards. This level of detail for a sparse architecture is exemplary.

Training Compute

4.0 / 10

The model disclosure includes the hardware used (3,000 NVIDIA H200 GPUs) but lacks the total GPU-hours or TFLOPS required for the full training run. While Mistral has published a general environmental report for previous models (Large 2), a specific lifecycle analysis or carbon footprint calculation for Large 3 is not yet available. Cost estimates are only provided for API usage, not the training phase.

Benchmark Reproducibility

5.0 / 10

Mistral provides scores for standard benchmarks like MMLU (85.5%), MMLU-Pro (low 80s), and GPQA Diamond (43.9%). However, the evaluation code and exact prompts used to achieve these scores are not fully public. While third-party verification is available through the LMSYS Chatbot Arena (Elo ~1418), the lack of a reproducible evaluation suite or specific few-shot examples in official documentation limits the score.

Identity Consistency

9.0 / 10

The model consistently identifies as Mistral Large 3 across API calls and system prompts. It maintains a clear versioning identity (mistral-large-2512) and is transparent about its nature as an AI and its limitations, such as not being a 'dedicated reasoning model' compared to specialized variants. No significant identity confusion or misrepresentation was found in technical documentation.

Downstream

24.0 / 30

License Clarity

10.0 / 10

Mistral Large 3 is released under the Apache 2.0 license, which is the gold standard for permissive open-source licensing. This allows for unrestricted commercial use, modification, and redistribution. The license is clearly stated on Hugging Face, the official blog, and in the model weights repository, with no conflicting terms found.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Mistral provides specific guidance for FP8 (single node of B200/H200) and NVFP4 (single node of H100/A100) serving. Documentation notes memory scaling for the 256k context window and provides warnings about performance degradation in NVFP4 at context lengths exceeding 64k. Quantized checkpoints are officially provided in collaboration with vLLM.

Versioning Drift

6.0 / 10

Mistral uses date-based semantic versioning (e.g., 2512 for December 2025). A public changelog is maintained on the official documentation site, tracking model releases and API updates. However, detailed documentation of 'behavior drift' or specific weight-level changes between minor updates is less transparent, and previous versions are primarily accessible through specific dated tags rather than a comprehensive historical archive.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
125k
250k

VRAM Required:

Recommended GPUs

Mistral Large 3: Specifications and GPU VRAM Requirements