Active Parameters
41B
Context Length
256K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
2 Dec 2025
Knowledge Cutoff
Oct 2024
Total Expert Parameters
675.0B
Number of Experts
16
Active Experts
2
Attention Structure
Multi-Head Attention
Hidden Dimension Size
12288
Number of Layers
88
Attention Heads
96
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
Mistral Large 3 represents a significant evolution in the Mistral AI model lineage, specifically engineered as a high-capacity, general-purpose multimodal foundation model. Built to handle complex enterprise workflows and production-grade assistant tasks, the model integrates native vision capabilities within a unified architecture. It is designed to operate as a central engine for retrieval-augmented generation (RAG) and sophisticated agentic systems, offering native support for function calling and structured JSON output. This instruct-tuned variant has been refined through post-training to ensure high adherence to system prompts and reliable instruction-following across diverse conversational contexts.
The technical foundation of Mistral Large 3 is a granular sparse Mixture-of-Experts (MoE) architecture that decouples total parameter capacity from inference-time computational cost. By utilizing a gating network to route tokens to a specific subset of experts, the model maintains a total of 675 billion parameters for expansive knowledge storage while activating only approximately 41 billion parameters per token. This architectural approach, combined with a 2.5 billion parameter integrated vision encoder, allows the model to process visual and textual data simultaneously. The training process utilized a massive cluster of 3,000 NVIDIA H200 GPUs, resulting in a model that supports a 256,000-token context window and advanced optimizations for modern hardware targets like NVIDIA Blackwell and Hopper architectures.
From an operational perspective, Mistral Large 3 provides versatility for large-scale deployments through support for high-efficiency quantization formats such as FP8 and NVFP4. These optimizations enable the serving of a model of this magnitude on single-node GPU configurations, such as an 8xH200 or 8xH100 setup, which traditionally would require multi-node infrastructure. The model demonstrates extensive multilingual capabilities, supporting over 40 languages and excelling in non-English conversational performance. This makes it an effective solution for global enterprises requiring a single, high-intelligence model capable of managing document understanding, code generation, and complex logical reasoning within a unified, open-weight framework.
Mistral Large 3 is a state-of-the-art general-purpose multimodal model with a granular Mixture-of-Experts architecture. With 675B total parameters and 41B active parameters, it delivers frontier performance for production-grade assistants, retrieval-augmented systems, and complex enterprise workflows.
Rank
#17
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1414 | 15 |
Overall Rank
#17
Coding Rank
#23
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens