ApX logoApX logo

Magistral Small

Parameters

24B

Context Length

128K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

10 Jun 2025

Knowledge Cutoff

Oct 2023

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

1,000,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

14,336

Number of Layers

32

FFN Intermediate Size (Dense)

32,768

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

131,072

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 14.3k · Context: 128k · Vocab: 131.1kx 32 layersRMSNormPre-AttentionMulti-Head Attention32Q / 8KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 32.8k+Final RMSNormOutput Logits

Magistral Small

Magistral Small is an open-source reasoning model developed by Mistral AI, comprising 24 billion parameters. It is architecturally founded upon the Mistral Small 3.1 model and is specifically engineered to perform transparent, multi-step reasoning. This model provides traceable thought processes in the user's language, a feature designed to enhance interpretability and auditability for complex tasks. It supports multilingual reasoning across more than 24 languages, including widely used global languages such as English, French, German, Japanese, Korean, Chinese, Arabic, and Farsi.

From a technical perspective, Magistral Small employs a decoder-only transformer architecture with a hidden dimension size of 14,336 across its 32 layers. The model utilizes Grouped Query Attention (GQA) with 32 attention heads and 8 key-value heads, which contributes to optimized inference speed and reduced memory consumption compared to traditional Multi-Head Attention. Positional information is integrated using Rotary Positional Embeddings (RoPE), and the network's feedforward components incorporate SwiGLU activation functions in conjunction with RMS Normalization for stabilized training dynamics. The architecture also integrates FlashAttention for accelerated processing. While supporting a theoretical context window of 128,000 tokens, optimal performance is typically observed with contexts up to 40,000 tokens.

Magistral Small is proficient in multimodal comprehension, enabling it to process and reason over both textual and visual inputs. It is particularly suited for applications requiring structured calculations, programmatic logic, decision trees, and rule-based systems. The model's design facilitates its use in various scenarios, including fast-response conversational agents, systems for long document understanding, visual understanding applications, and specialized domain-specific fine-tuning. Its capabilities extend to supporting agentic AI workflows through native function calling and structured output generation.

About Magistral

Magistral is Mistral AI's first reasoning model series, purpose-built for transparent, step-by-step reasoning with native multilingual capabilities. Features chain-of-thought reasoning in the user's language with traceable thought processes. Excels in domain-specific problems requiring multi-step logic, from legal research and financial forecasting to software development and creative storytelling. Supports reasoning across numerous languages including English, French, Spanish, German, Italian, Arabic, Russian, and Chinese.


Other Magistral Models
  • No related models available

Evaluation Benchmarks

Rank

#129

BenchmarkScoreRank

0.346

29

Professional Knowledge

MMLU Pro

0.62

53

Rankings

Overall Rank

#129

Coding Rank

#102

Model Integrity

Total Score

B+

75 / 100

Magistral Small Model Integrity Report

Total Score

75

/ 100

B+

Audit Note

Magistral Small 24B exhibits a strong transparency profile, particularly regarding its open-source licensing, architectural specifications, and the disclosure of its reasoning-specific training methodology. While it excels in providing the technical details necessary for local deployment and verification, it remains less transparent about the specific composition of its massive pre-training datasets and the total compute resources consumed during its development.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The model's lineage is clearly documented as being built upon Mistral Small 3.1 (2503). The technical architecture is well-defined as a 24B dense decoder-only transformer with 32 layers, utilizing Grouped Query Attention (GQA), SwiGLU activations, and Rotary Positional Embeddings (RoPE). The training methodology is explicitly described in the accompanying paper (arXiv:2506.10910), detailing a 'cold-start' SFT process using reasoning traces from Magistral Medium followed by Reinforcement Learning from Verifiable Rewards (RLVR).

Dataset Composition

4.5 / 10

While the training methodology (SFT traces + RLVR) is well-documented, the specific composition of the underlying pre-training data for the base model (Mistral Small 3.1) remains largely undisclosed beyond general categories. The reasoning-specific data is described as being derived from 'Magistral Medium traces' and 'verifiable reward' tasks (math, code), but precise dataset distributions, source names, and filtering/cleaning metrics for the vast majority of the training corpus are absent.

Tokenizer Integrity

9.0 / 10

The model uses the 'Tekken' tokenizer, which is part of the publicly available 'mistral-common' library. It features a vocabulary size of 131,072 tokens and is specifically optimized for over 24 languages and source code. The tokenizer's performance and alignment with the model's multilingual claims are verifiable through the public GitHub repository and Hugging Face model files.

Model

28.5 / 40

Parameter Density

8.5 / 10

The model is explicitly stated to be a 24.0 billion parameter dense architecture. Unlike Mixture-of-Experts (MoE) models where active parameters can be ambiguous, this dense configuration ensures all 24B parameters are active during inference. Detailed architectural specifications, including the hidden dimension (14,336) and head counts (32 attention, 8 KV), are provided in the technical documentation.

Training Compute

3.5 / 10

Information regarding the specific compute resources used for training is minimal. While the paper discusses the 'asynchronous system' and infrastructure for online RL, it lacks concrete data on total GPU/TPU hours, hardware counts, carbon footprint, or energy consumption. The disclosure is limited to high-level descriptions of the training pipeline rather than quantifiable resource metrics.

Benchmark Reproducibility

7.5 / 10

Mistral provides specific results for standard benchmarks (AIME24, GPQA, LiveCodeBench) and includes the exact sampling parameters (top_p: 0.95, temp: 0.7) and system prompts required to replicate the reasoning behavior. Third-party reproduction guides (e.g., via promptfoo) are already available. However, the full evaluation code and the complete set of internal prompts used for all reported metrics are not entirely public.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Mistral-developed reasoning model. It maintains clear versioning (2506) and is transparent about its specific 'Think' mode capabilities and the 40k-128k context window limitations. There are no documented instances of the model claiming to be a competitor or misrepresenting its 24B scale.

Downstream

24.5 / 30

License Clarity

10.0 / 10

The model is released under the highly permissive Apache 2.0 license, which is explicitly stated on Hugging Face, the official blog, and in the technical paper. This license allows for unrestricted commercial use, modification, and distribution, providing maximum legal clarity for downstream users without conflicting proprietary terms.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Official documentation specifies that the model fits on a single RTX 4090 (24GB VRAM) or a 32GB RAM Mac when quantized (4-bit). VRAM estimates for different quantization levels (Q4, Q8) are provided by the community and supported by official GGUF releases, with clear guidance on the 40k token context performance threshold.

Versioning Drift

6.5 / 10

Mistral uses a date-based semantic versioning system (2506) and maintains a clear distinction between variants (Small vs. Medium). While a changelog for the 'Magistral' family is emerging, the model is still relatively new, and long-term tracking of behavioral drift or silent updates is not yet fully established. Previous versions remain accessible on Hugging Face.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs