Mistral Large 3: Specifications and GPU VRAM Requirements

Mistral Large 3

Open Source

Open Weights

Active Parameters

41B

Context Length

256K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

2 Dec 2025

Knowledge Cutoff

Oct 2024

Technical Specifications

Total Expert Parameters

675.0B

Number of Experts

Active Experts

Attention Structure

Multi-Head Attention

Hidden Dimension Size

12288

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

Mistral Large 3

Mistral Large 3 represents a significant evolution in the Mistral AI model lineage, specifically engineered as a high-capacity, general-purpose multimodal foundation model. Built to handle complex enterprise workflows and production-grade assistant tasks, the model integrates native vision capabilities within a unified architecture. It is designed to operate as a central engine for retrieval-augmented generation (RAG) and sophisticated agentic systems, offering native support for function calling and structured JSON output. This instruct-tuned variant has been refined through post-training to ensure high adherence to system prompts and reliable instruction-following across diverse conversational contexts.

The technical foundation of Mistral Large 3 is a granular sparse Mixture-of-Experts (MoE) architecture that decouples total parameter capacity from inference-time computational cost. By utilizing a gating network to route tokens to a specific subset of experts, the model maintains a total of 675 billion parameters for expansive knowledge storage while activating only approximately 41 billion parameters per token. This architectural approach, combined with a 2.5 billion parameter integrated vision encoder, allows the model to process visual and textual data simultaneously. The training process utilized a massive cluster of 3,000 NVIDIA H200 GPUs, resulting in a model that supports a 256,000-token context window and advanced optimizations for modern hardware targets like NVIDIA Blackwell and Hopper architectures.

From an operational perspective, Mistral Large 3 provides versatility for large-scale deployments through support for high-efficiency quantization formats such as FP8 and NVFP4. These optimizations enable the serving of a model of this magnitude on single-node GPU configurations, such as an 8xH200 or 8xH100 setup, which traditionally would require multi-node infrastructure. The model demonstrates extensive multilingual capabilities, supporting over 40 languages and excelling in non-English conversational performance. This makes it an effective solution for global enterprises requiring a single, high-intelligence model capable of managing document understanding, code generation, and complex logical reasoning within a unified, open-weight framework.

About Mistral Large 3

Mistral Large 3 is a state-of-the-art general-purpose multimodal model with a granular Mixture-of-Experts architecture. With 675B total parameters and 41B active parameters, it delivers frontier performance for production-grade assistants, retrieval-augmented systems, and complex enterprise workflows.

Other Mistral Large 3 Models

No related models available

Evaluation Benchmarks

Rank

#17

Benchmark	Score	Rank
Web Development WebDev Arena	1414	15

Rankings

Overall Rank

#17

Coding Rank

#23

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

125k

250k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Download Weights Source Code