ApX logoApX logo

Mistral-Large-2407

Parameters

123B

Context Length

128K

Modality

Text

Architecture

Dense

License

Mistral Research License

Release Date

24 Jul 2024

Knowledge Cutoff

Oct 2023

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

12288

Number of Layers

64

Attention Heads

48

Key-Value Heads

8

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

Mistral-Large-2407

Mistral Large 2 (Mistral-Large-2407) is a sophisticated dense transformer model engineered to deliver advanced linguistic and computational reasoning. As the flagship representative of its model family, it utilizes a decoder-only architecture with 123 billion parameters. This specific parameter count is intentionally selected to optimize single-node inference, allowing the model to achieve high throughput on enterprise-grade hardware without the complexities of multi-node distribution. It is designed to process extensive datasets and long-form content, maintaining high fidelity across complex tasks such as code generation, mathematical theorem proving, and multi-step logical deduction.

The model's architecture incorporates several modern advancements in transformer design to enhance computational efficiency and performance. It employs Grouped Query Attention (GQA) with 48 attention heads and 8 key-value heads to reduce memory overhead during inference, particularly when handling its substantial 128,000-token context window. Positional information is managed via Rotary Position Embeddings (RoPE), and the model utilizes RMS Norm for more stable layer normalization. The feed-forward network integrates the SwiGLU activation function, which provides more expressive gating compared to traditional ReLU or GELU alternatives, while Flash Attention is leveraged to optimize speed and resource utilization during processing.

Mistral Large 2 is optimized for versatile deployment in automated workflows and agentic systems. It features native support for over 80 programming languages and dozens of human languages, ensuring proficiency in global multilingual environments. The model is specifically tuned for improved instruction following and high-precision function calling, which enables it to interface effectively with external tools and generate structured JSON outputs. By focusing on minimizing hallucination and enhancing response conciseness, the architecture provides a reliable foundation for enterprise applications requiring both speed and sophisticated reasoning capabilities.

About Mistral Large 2

Mistral Large 2 is a 123 billion parameter, dense transformer model engineered for advanced language and code generation, supporting over 80 programming languages. Its 128,000 token context window facilitates complex reasoning and long-context applications on a single node. Enhanced function calling capabilities are integrated.


Other Mistral Large 2 Models
  • No related models available

Evaluation Benchmarks

Rank

#68

BenchmarkScoreRank

0.96

🥉

3

General Knowledge

MMLU

0.84

13

Web Development

WebDev Arena

1314

40

Rankings

Overall Rank

#68

Coding Rank

#54

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

Mistral-Large-2407: Specifications and GPU VRAM Requirements