Parameters
123B
Context Length
128K
Modality
Text
Architecture
Dense
License
Mistral Research License
Release Date
24 Jul 2024
Knowledge Cutoff
Oct 2023
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
48
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
12,288
Number of Layers
64
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Mistral Large 2 (Mistral-Large-2407) is a sophisticated dense transformer model engineered to deliver advanced linguistic and computational reasoning. As the flagship representative of its model family, it utilizes a decoder-only architecture with 123 billion parameters. This specific parameter count is intentionally selected to optimize single-node inference, allowing the model to achieve high throughput on enterprise-grade hardware without the complexities of multi-node distribution. It is designed to process extensive datasets and long-form content, maintaining high fidelity across complex tasks such as code generation, mathematical theorem proving, and multi-step logical deduction.
The model's architecture incorporates several modern advancements in transformer design to enhance computational efficiency and performance. It employs Grouped Query Attention (GQA) with 48 attention heads and 8 key-value heads to reduce memory overhead during inference, particularly when handling its substantial 128,000-token context window. Positional information is managed via Rotary Position Embeddings (RoPE), and the model utilizes RMS Norm for more stable layer normalization. The feed-forward network integrates the SwiGLU activation function, which provides more expressive gating compared to traditional ReLU or GELU alternatives, while Flash Attention is leveraged to optimize speed and resource utilization during processing.
Mistral Large 2 is optimized for versatile deployment in automated workflows and agentic systems. It features native support for over 80 programming languages and dozens of human languages, ensuring proficiency in global multilingual environments. The model is specifically tuned for improved instruction following and high-precision function calling, which enables it to interface effectively with external tools and generate structured JSON outputs. By focusing on minimizing hallucination and enhancing response conciseness, the architecture provides a reliable foundation for enterprise applications requiring both speed and sophisticated reasoning capabilities.
Mistral Large 2 is a 123 billion parameter, dense transformer model engineered for advanced language and code generation, supporting over 80 programming languages. Its 128,000 token context window facilitates complex reasoning and long-context applications on a single node. Enhanced function calling capabilities are integrated.
Rank
#55
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.964 | 5 |
General Knowledge MMLU | 0.84 | 16 |
Web Development WebDev Arena | 1314 | 56 |
Overall Rank
#55
Coding Rank
#70
Total Score
62
/ 100
Mistral Large 2 demonstrates strong transparency in its architectural fundamentals and tokenizer implementation, supported by verifiable environmental impact data. However, the model suffers from a near-total lack of transparency regarding its training data composition and sources. While the licensing terms are clearly defined, they remain restrictive for commercial users, and the absence of a detailed technical paper limits deeper scientific verification.
Architectural Provenance
Mistral Large 2 is documented as a decoder-only transformer with 123 billion parameters. Official documentation specifies the use of Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), RMS Norm, and SwiGLU activation. While the architecture is described in technical blog posts and model cards, there is no formal peer-reviewed technical report or paper detailing the specific pre-training methodology, layer-by-layer configuration, or the exact architectural evolution from the previous version beyond high-level summaries.
Dataset Composition
Information regarding the training data is extremely limited. Mistral AI only provides vague marketing claims stating the model was trained on a 'vast amount of text and code' and 'diverse data containing different languages.' There are no public disclosures regarding specific data sources, percentage breakdowns of data types (e.g., web vs. books vs. code), filtering methodologies, or data cleaning procedures. This lack of transparency makes it impossible to verify the quality or provenance of the training corpus.
Tokenizer Integrity
The tokenizer is publicly accessible via the 'mistral-common' library and Hugging Face. It uses a vocabulary size of 131,072 (128k) tokens, which is a significant expansion from earlier 32k versions. The approach is documented as a byte-fallback BPE tokenizer, and its performance across 80+ programming languages and dozens of human languages is verifiable through the provided weights and open-source implementation code.
Parameter Density
The model is explicitly stated to have 123 billion parameters. Unlike Mistral's previous MoE models, this is a dense architecture, meaning all parameters are active during inference. The parameter count is consistently reported across official sources and third-party platforms like NVIDIA and AWS. However, a detailed breakdown of parameter allocation (e.g., attention vs. FFN) is not provided in official documentation.
Training Compute
Mistral AI has released a 'Life Cycle Assessment' (LCA) that provides high-level environmental metrics, including 20,400 metric tons of CO2e for training and 281,000 cubic meters of water. While this is a step forward for environmental transparency, the company does not disclose the specific hardware hours (GPU/TPU hours), the exact cluster configuration used for training, or the total financial cost of the compute resources.
Benchmark Reproducibility
Mistral provides benchmark results on standard sets like MMLU (84.0%), GSM8K, and HumanEval. They maintain a 'mistral-evals' GitHub repository containing some evaluation code and standardized prompts. However, the full evaluation suite for all claimed benchmarks is not entirely public, and third-party reports indicate that reproducing exact scores can be difficult without the specific internal few-shot templates used by the team.
Identity Consistency
The model consistently identifies itself as a Mistral AI model and is transparent about its versioning (2407). It does not exhibit the identity confusion seen in some other models that claim to be GPT-4. It is generally accurate about its capabilities, such as its 128k context window and multilingual support, which are verifiable through API testing and documentation.
License Clarity
The model is released under the 'Mistral Research License,' which allows for non-commercial research use but requires a separate 'Mistral Commercial License' for any business activity or self-deployment. While the terms are legally clear, the license is restrictive and does not meet the definition of 'Open Source' (OSI-compliant), despite being marketed as an 'open model.' This creates a distinction between 'open weights' and 'open source' that is often blurred in marketing.
Hardware Footprint
Hardware requirements are well-documented by both Mistral and third-party providers like NVIDIA. The model is specifically designed for single-node inference on an H100 (80GB) or similar hardware. VRAM requirements for various quantization levels (e.g., 4-bit requiring ~61GB) are widely known and verified by the community. Documentation on context window memory scaling is available through partner platforms like AWS Bedrock.
Versioning Drift
Mistral uses date-based versioning (e.g., 2407), which provides some clarity. However, the model has already seen deprecation notices on platforms like GitHub Models in favor of newer versions (24.11) without detailed changelogs explaining the specific behavioral changes or performance drift between these iterations. There is no public commitment to long-term support for specific weight versions.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online