Parameters
123B
Context Length
128K
Modality
Text
Architecture
Dense
License
Mistral Research License
Release Date
24 Jul 2024
Knowledge Cutoff
Oct 2023
Attention Structure
Grouped-Query Attention
Hidden Dimension Size
-
Number of Layers
64
Attention Heads
48
Key-Value Heads
8
Activation Function
-
Normalization
RMS Normalization
Position Embedding
ROPE
VRAM requirements for different quantization methods and context sizes
Mistral Large 2 (Mistral-Large-2407) is the newest generation of Mistral AI's flagship large language models, designed to advance capabilities in natural language understanding and generation. It is built upon a decoder-only Transformer architecture, a widely adopted design for constructing efficient and scalable language models. The model integrates 123 billion parameters, enabling it to process and generate complex linguistic structures with a high degree of fidelity. A key architectural characteristic includes its design for single-node inference, which facilitates high throughput in long-context applications.
This model is distinguished by its extensive 128,000-token context window, allowing it to maintain coherence over extended documents and interactions. It incorporates Grouped Query Attention (GQA) with 48 attention heads and 8 key-value heads, which contributes to its computational efficiency while managing long sequences. The model also leverages Rotary Position Embeddings (RoPE) for effective positional encoding and integrates Flash Attention for optimized processing speed. These architectural choices aim to balance performance with computational requirements.
Mistral Large 2 exhibits enhanced performance across a range of linguistic tasks, including advanced code generation, complex mathematical problem-solving, and sophisticated reasoning. It supports over 80 programming languages, such as Python, Java, C, C++, and JavaScript, and operates proficiently across dozens of human languages, including Russian, Chinese, Japanese, Korean, Spanish, Italian, Portuguese, Arabic, and Hindi, indicating broad multilingual capabilities. Furthermore, the model is equipped with robust function calling abilities and supports native JSON output, facilitating its integration into complex automated workflows and agentic systems. A significant focus during its development was placed on minimizing the generation of erroneous or irrelevant information, thereby enhancing the reliability of its outputs and improving instruction following.
Mistral Large 2 is a 123 billion parameter, dense transformer model engineered for advanced language and code generation, supporting over 80 programming languages. Its 128,000 token context window facilitates complex reasoning and long-context applications on a single node. Enhanced function calling capabilities are integrated.
Ranking is for Local LLMs.
Rank
#19
Benchmark | Score | Rank |
---|---|---|
QA Assistant ProLLM QA Assistant | 0.96 | 🥈 2 |
General Knowledge MMLU | 0.84 | 🥈 2 |
Refactoring Aider Refactoring | 0.60 | 5 |
Coding Aider Coding | 0.65 | 7 |
Coding LiveBench Coding | 0.63 | 8 |
StackEval ProLLM Stack Eval | 0.88 | 8 |
Summarization ProLLM Summarization | 0.73 | 9 |
Data Analysis LiveBench Data Analysis | 0.54 | 16 |
Agentic Coding LiveBench Agentic | 0.02 | 19 |
Reasoning LiveBench Reasoning | 0.34 | 22 |
Mathematics LiveBench Mathematics | 0.42 | 22 |
Overall Rank
#19
Coding Rank
#8
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens