ApX logoApX logo

Ministral-8B-2410

Parameters

8B

Context Length

128K

Modality

Text

Architecture

Dense

License

Mistral Research License

Release Date

10 Oct 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

100,000,000

Sliding Window Attention

Yes

Sliding Window Size

32,768

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

12,288

Number of Layers

36

FFN Intermediate Size (Dense)

12,288

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

131,072

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 12.3k · Context: 128K · Vocab: 131.1kx 36 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 8KV heads · SW: 32.8kHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 12.3k+Final RMSNormOutput Logits

Ministral-8B-2410

The Ministral-8B-2410 is a state-of-the-art large language model developed by Mistral AI, comprising approximately 8.0 billion parameters. It is part of the "les Ministraux" model family, introduced alongside Ministral 3B, specifically optimized for local intelligence, on-device computing, and edge computing use cases. The primary objective behind this model family is to deliver compute-efficient and low-latency inference solutions for applications that operate in resource-constrained environments or require privacy-first local data processing. This model is also provided in an instruct-tuned variant, Ministral-8B-Instruct-2410.

The technical architecture of Ministral-8B-2410 is based on a dense Transformer network, featuring 36 layers with 32 attention heads and an embedding dimension of 4096, which projects to a hidden dimension of 12288. A key innovation in its design is the integration of a 128,000-token context window, facilitated by an interleaved sliding-window attention mechanism. This is complemented by Grouped Query Attention (GQA) with 8 key-value heads, enhancing inference speed and memory efficiency. The model utilizes the V3-Tekken tokenizer, supporting a vocabulary size of 131,072 tokens, optimizing its ability to process diverse linguistic inputs.

Ministral-8B-2410 demonstrates capabilities across a range of natural language processing tasks, including content generation, question answering, and code generation or assistance. It is noted for its strong performance in multilingual contexts, supporting 10 major languages, and its built-in support for function calling, enabling advanced API interactions. Its design makes it particularly suitable for practical applications such as on-device translation, internet-independent smart assistants, local data analytics, and autonomous robotics, where its low-latency and efficient processing characteristics are advantageous. The model can also function as an efficient intermediary for handling function calls within complex, multi-step agentic workflows.

About Ministral

The Ministral model family, developed by Mistral AI, includes 3B and 8B parameter versions for on-device and edge computing. Designed for compute efficiency and low latency, these models support up to 128K context length. The 8B version incorporates an interleaved sliding-window attention pattern for efficient inference.


Other Ministral Models

Evaluation Benchmarks

Rank

#131

BenchmarkScoreRank

General Knowledge

MMLU

0.65

33

Web Development

WebDev Arena

1237

91

General Text

Text Arena

1237

94

Rankings

Overall Rank

#131

Coding Rank

#107

Model Integrity

Total Score

B-

61 / 100

Ministral-8B-2410 Model Integrity Report

Total Score

61

/ 100

B-

Audit Note

Ministral-8B-2410 provides strong transparency regarding its technical architecture and tokenizer, offering precise parameter counts and structural details. However, it remains opaque concerning its training data and compute resources, relying on vague marketing descriptions for its dataset. While the model is accessible for research, its custom license and lack of detailed evaluation methodology hinder full reproducibility and commercial clarity.

Upstream

18.5 / 30

Architectural Provenance

7.5 / 10

The model is explicitly identified as a dense Transformer with 36 layers, 32 attention heads, and an embedding dimension of 4096. Mistral AI provides technical specifics such as the use of Grouped Query Attention (GQA) with 8 KV heads and an interleaved sliding-window attention mechanism to support its 128k context window. While the high-level architecture is well-documented in the release blog and model cards, a formal peer-reviewed technical paper detailing the specific pre-training methodology or architectural innovations beyond the summary is not publicly available.

Dataset Composition

2.5 / 10

Mistral AI provides only vague descriptions of the training data, stating it was trained on a 'large proportion of multilingual and code data' with a cutoff of June 2024. There is no public disclosure of specific data sources, no percentage breakdown of dataset components (e.g., web vs. books vs. code), and no documentation regarding data filtering, cleaning, or de-duplication processes. This falls under the 'vague marketing claims' category for data provenance.

Tokenizer Integrity

8.5 / 10

The model uses the V3-Tekken tokenizer, which is publicly accessible via the 'mistral-common' library. The vocabulary size is clearly stated as 131,072 tokens. The tokenizer's support for 10+ languages and its specific BPE merges are verifiable through the provided configuration files on Hugging Face and GitHub. Documentation on the tokenizer's training alignment is present, though minor integration issues with third-party tools like TGI have been noted by the community.

Model

24.0 / 40

Parameter Density

9.0 / 10

The total parameter count is precisely disclosed as 8,019,808,256. As a dense model, all parameters are active during inference, which is clearly stated. The architectural breakdown (layers, heads, hidden dimensions) is fully provided in the model card, allowing for a complete understanding of parameter distribution across the network.

Training Compute

1.0 / 10

There is virtually no public information regarding the compute resources used to train Ministral-8B. Mistral AI has not disclosed GPU/TPU hours, hardware specifications used for training, training duration, or the carbon footprint of the process. The absence of this information is a significant transparency gap, typical of proprietary-leaning 'open-weight' releases.

Benchmark Reproducibility

5.0 / 10

Mistral provides a range of benchmark results (MMLU, AGIEval, HumanEval, etc.) in their release blog and model cards. However, they do not provide the exact evaluation code, specific prompts, or few-shot configurations used to achieve these scores. While some third-party verification exists on public leaderboards, the lack of official reproduction instructions or prompt transparency limits the ability to independently verify the claimed performance delta over competitors.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself and its version (2410) in standard deployments. It does not exhibit the identity confusion seen in some fine-tuned models that claim to be GPT-4. Version tracking is clear through the '2410' suffix, and the model's capabilities/limitations are generally well-communicated in the documentation.

Downstream

18.0 / 30

License Clarity

6.5 / 10

The model is released under the 'Mistral Research License' (MRL), which is a custom license. While the terms are clearly written and accessible, it is not a standard OSI-approved open-source license. It permits research and non-commercial use but requires a separate agreement for commercial applications. This creates a 'look but don't touch' environment for many users, and the definition of 'commercial' can be subject to interpretation, though the text is legally explicit.

Hardware Footprint

7.0 / 10

Official documentation provides clear VRAM requirements for standard inference (e.g., 24GB for a single GPU in BF16). Community-driven documentation (e.g., Bartowski's GGUF quants) provides extensive details on VRAM requirements for various quantization levels (Q4, Q8, etc.) and their associated quality tradeoffs. While Mistral's own documentation is slightly less granular on quantization, the overall ecosystem provides good visibility into hardware needs.

Versioning Drift

4.5 / 10

Mistral uses a date-based versioning system (2410), which provides some clarity. However, they have a history of 'silent' updates to their API endpoints and have recently downscaled temperature parameters for this model family to 'unify behavior' without a major version bump. There is no comprehensive public changelog for the weights themselves, making it difficult to track if the underlying model has been modified since release.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs