Parameters
24B
Context Length
128K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
10 Jun 2025
Knowledge Cutoff
Oct 2023
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
1,000,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
14,336
Number of Layers
32
FFN Intermediate Size (Dense)
32,768
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
131,072
Magistral Small is an open-source reasoning model developed by Mistral AI, comprising 24 billion parameters. It is architecturally founded upon the Mistral Small 3.1 model and is specifically engineered to perform transparent, multi-step reasoning. This model provides traceable thought processes in the user's language, a feature designed to enhance interpretability and auditability for complex tasks. It supports multilingual reasoning across more than 24 languages, including widely used global languages such as English, French, German, Japanese, Korean, Chinese, Arabic, and Farsi.
From a technical perspective, Magistral Small employs a decoder-only transformer architecture with a hidden dimension size of 14,336 across its 32 layers. The model utilizes Grouped Query Attention (GQA) with 32 attention heads and 8 key-value heads, which contributes to optimized inference speed and reduced memory consumption compared to traditional Multi-Head Attention. Positional information is integrated using Rotary Positional Embeddings (RoPE), and the network's feedforward components incorporate SwiGLU activation functions in conjunction with RMS Normalization for stabilized training dynamics. The architecture also integrates FlashAttention for accelerated processing. While supporting a theoretical context window of 128,000 tokens, optimal performance is typically observed with contexts up to 40,000 tokens.
Magistral Small is proficient in multimodal comprehension, enabling it to process and reason over both textual and visual inputs. It is particularly suited for applications requiring structured calculations, programmatic logic, decision trees, and rule-based systems. The model's design facilitates its use in various scenarios, including fast-response conversational agents, systems for long document understanding, visual understanding applications, and specialized domain-specific fine-tuning. Its capabilities extend to supporting agentic AI workflows through native function calling and structured output generation.
Magistral is Mistral AI's first reasoning model series, purpose-built for transparent, step-by-step reasoning with native multilingual capabilities. Features chain-of-thought reasoning in the user's language with traceable thought processes. Excels in domain-specific problems requiring multi-step logic, from legal research and financial forecasting to software development and creative storytelling. Supports reasoning across numerous languages including English, French, Spanish, German, Italian, Arabic, Russian, and Chinese.
Rank
#129
| Benchmark | Score | Rank |
|---|---|---|
StackUnseen ProLLM Stack Unseen | 0.346 | 29 |
Professional Knowledge MMLU Pro | 0.62 | 53 |
Overall Rank
#129
Coding Rank
#102
Total Score
75
/ 100
Magistral Small 24B exhibits a strong transparency profile, particularly regarding its open-source licensing, architectural specifications, and the disclosure of its reasoning-specific training methodology. While it excels in providing the technical details necessary for local deployment and verification, it remains less transparent about the specific composition of its massive pre-training datasets and the total compute resources consumed during its development.
Architectural Provenance
The model's lineage is clearly documented as being built upon Mistral Small 3.1 (2503). The technical architecture is well-defined as a 24B dense decoder-only transformer with 32 layers, utilizing Grouped Query Attention (GQA), SwiGLU activations, and Rotary Positional Embeddings (RoPE). The training methodology is explicitly described in the accompanying paper (arXiv:2506.10910), detailing a 'cold-start' SFT process using reasoning traces from Magistral Medium followed by Reinforcement Learning from Verifiable Rewards (RLVR).
Dataset Composition
While the training methodology (SFT traces + RLVR) is well-documented, the specific composition of the underlying pre-training data for the base model (Mistral Small 3.1) remains largely undisclosed beyond general categories. The reasoning-specific data is described as being derived from 'Magistral Medium traces' and 'verifiable reward' tasks (math, code), but precise dataset distributions, source names, and filtering/cleaning metrics for the vast majority of the training corpus are absent.
Tokenizer Integrity
The model uses the 'Tekken' tokenizer, which is part of the publicly available 'mistral-common' library. It features a vocabulary size of 131,072 tokens and is specifically optimized for over 24 languages and source code. The tokenizer's performance and alignment with the model's multilingual claims are verifiable through the public GitHub repository and Hugging Face model files.
Parameter Density
The model is explicitly stated to be a 24.0 billion parameter dense architecture. Unlike Mixture-of-Experts (MoE) models where active parameters can be ambiguous, this dense configuration ensures all 24B parameters are active during inference. Detailed architectural specifications, including the hidden dimension (14,336) and head counts (32 attention, 8 KV), are provided in the technical documentation.
Training Compute
Information regarding the specific compute resources used for training is minimal. While the paper discusses the 'asynchronous system' and infrastructure for online RL, it lacks concrete data on total GPU/TPU hours, hardware counts, carbon footprint, or energy consumption. The disclosure is limited to high-level descriptions of the training pipeline rather than quantifiable resource metrics.
Benchmark Reproducibility
Mistral provides specific results for standard benchmarks (AIME24, GPQA, LiveCodeBench) and includes the exact sampling parameters (top_p: 0.95, temp: 0.7) and system prompts required to replicate the reasoning behavior. Third-party reproduction guides (e.g., via promptfoo) are already available. However, the full evaluation code and the complete set of internal prompts used for all reported metrics are not entirely public.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Mistral-developed reasoning model. It maintains clear versioning (2506) and is transparent about its specific 'Think' mode capabilities and the 40k-128k context window limitations. There are no documented instances of the model claiming to be a competitor or misrepresenting its 24B scale.
License Clarity
The model is released under the highly permissive Apache 2.0 license, which is explicitly stated on Hugging Face, the official blog, and in the technical paper. This license allows for unrestricted commercial use, modification, and distribution, providing maximum legal clarity for downstream users without conflicting proprietary terms.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Official documentation specifies that the model fits on a single RTX 4090 (24GB VRAM) or a 32GB RAM Mac when quantized (4-bit). VRAM estimates for different quantization levels (Q4, Q8) are provided by the community and supported by official GGUF releases, with clear guidance on the 40k token context performance threshold.
Versioning Drift
Mistral uses a date-based semantic versioning system (2506) and maintains a clear distinction between variants (Small vs. Medium). While a changelog for the 'Magistral' family is emerging, the model is still relatively new, and long-term tracking of behavioral drift or silent updates is not yet fully established. Previous versions remain accessible on Hugging Face.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online