Parameters
123B
Context Length
128K
Modality
Text
Architecture
Dense
License
Mistral Research License
Release Date
24 Jul 2024
Knowledge Cutoff
Oct 2023
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
12288
Number of Layers
64
Attention Heads
48
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
ROPE
Mistral Large 2 (Mistral-Large-2407) is a sophisticated dense transformer model engineered to deliver advanced linguistic and computational reasoning. As the flagship representative of its model family, it utilizes a decoder-only architecture with 123 billion parameters. This specific parameter count is intentionally selected to optimize single-node inference, allowing the model to achieve high throughput on enterprise-grade hardware without the complexities of multi-node distribution. It is designed to process extensive datasets and long-form content, maintaining high fidelity across complex tasks such as code generation, mathematical theorem proving, and multi-step logical deduction.
The model's architecture incorporates several modern advancements in transformer design to enhance computational efficiency and performance. It employs Grouped Query Attention (GQA) with 48 attention heads and 8 key-value heads to reduce memory overhead during inference, particularly when handling its substantial 128,000-token context window. Positional information is managed via Rotary Position Embeddings (RoPE), and the model utilizes RMS Norm for more stable layer normalization. The feed-forward network integrates the SwiGLU activation function, which provides more expressive gating compared to traditional ReLU or GELU alternatives, while Flash Attention is leveraged to optimize speed and resource utilization during processing.
Mistral Large 2 is optimized for versatile deployment in automated workflows and agentic systems. It features native support for over 80 programming languages and dozens of human languages, ensuring proficiency in global multilingual environments. The model is specifically tuned for improved instruction following and high-precision function calling, which enables it to interface effectively with external tools and generate structured JSON outputs. By focusing on minimizing hallucination and enhancing response conciseness, the architecture provides a reliable foundation for enterprise applications requiring both speed and sophisticated reasoning capabilities.
Mistral Large 2 is a 123 billion parameter, dense transformer model engineered for advanced language and code generation, supporting over 80 programming languages. Its 128,000 token context window facilitates complex reasoning and long-context applications on a single node. Enhanced function calling capabilities are integrated.
Rank
#68
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.96 | 🥉 3 |
General Knowledge MMLU | 0.84 | 13 |
Web Development WebDev Arena | 1314 | 40 |
Overall Rank
#68
Coding Rank
#54
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens