ApX logoApX logo

Mixtral-8x7B-v0.1

Active Parameters

46.7B

Context Length

32.768K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

9 Dec 2023

Knowledge Cutoff

Nov 2022

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

14,336

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

32,000

Mixture of Experts

Total Expert Parameters

7.0B

Number of Experts

8

Active Experts

2

Shared Experts

-

FFN Intermediate Size (per Expert)

14,336

Dense Layers Before MoE

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 32.8k · Vocab: 32kx 32 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 8KV headsHead dim: 128+RMSNormPre-FFNSparse MoE FFN (2/8 experts)SwishIntermediate: 14.3k+Final RMSNormOutput Logits

Mixtral-8x7B-v0.1

Mixtral-8x7B-v0.1 is a generative large language model developed by Mistral AI, distinguished by its Sparse Mixture of Experts (SMoE) architecture. This design enables the model to process information efficiently by conditionally activating a subset of its parameters for each input. Its primary purpose is to facilitate advanced text generation and comprehensive language understanding across a diverse range of applications.

The model is built upon a decoder-only transformer architecture. It integrates a Mixture-of-Experts layer where each layer comprises eight distinct feedforward blocks, known as 'experts'. A router network dynamically selects two of these experts to process each token, subsequently combining their outputs additively. This mechanism permits the model to leverage a substantial total parameter count of 46.7 billion while maintaining a significantly lower active parameter count of 12.9 billion per token during inference, thereby optimizing the balance between model capacity and computational efficiency. The architecture further incorporates Grouped Query Attention (GQA) and supports Flash Attention for enhanced performance.

Mixtral-8x7B-v0.1 supports a context length of 32,000 tokens, allowing it to process and generate responses based on extensive textual inputs. The model demonstrates proficiency in multilingual tasks, supporting English, French, Italian, German, and Spanish. It also exhibits strong performance in code generation tasks. The model can be fine-tuned for instruction-following tasks, making it a suitable foundation for building interactive applications that require precise adherence to user commands.

About Mixtral

The Mixtral model family, developed by Mistral AI, employs a sparse Mixture-of-Experts (SMoE) architecture. This design utilizes multiple expert networks per layer, where a router selects a subset to process each token. This enables large total parameter counts while maintaining computational efficiency by activating only a fraction of parameters per forward pass.


Other Mixtral Models

Evaluation Benchmarks

Rank

#146

BenchmarkScoreRank

Web Development

WebDev Arena

1197

82

Rankings

Overall Rank

#146

Coding Rank

#101

Model Integrity

Total Score

B-

63 / 100

Mixtral-8x7B-v0.1 Model Integrity Report

Total Score

63

/ 100

B-

Audit Note

Mixtral-8x7B-v0.1 exhibits a bifurcated transparency profile, excelling in architectural and licensing clarity while remaining almost entirely opaque regarding its training data and compute resources. The model sets a high standard for MoE parameter disclosure but fails to provide the necessary evidence to verify its data provenance or environmental impact. Users can trust the technical implementation and legal framework, but the 'black box' nature of its training remains a significant concern.

Upstream

18.0 / 30

Architectural Provenance

7.5 / 10

The model's Sparse Mixture of Experts (SMoE) architecture is well-documented in the official technical report and blog posts. It explicitly details the use of 8 experts per layer with a router selecting 2 per token, and the use of Grouped Query Attention (GQA). While the base architecture is clearly a modification of the Mistral 7B design, the specific pretraining methodology and exact architectural modifications are described with sufficient technical detail for implementation, though some proprietary training recipes remain undisclosed.

Dataset Composition

2.0 / 10

Mistral AI provides almost no transparency regarding the training data. Official documentation only states the model was 'pre-trained on internet-scale data' without disclosing specific sources, proportions, or filtering methodologies. There is no breakdown of data categories (e.g., web, code, books) or information on the data collection timeline, which is a major transparency gap.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the 'mistral-common' and 'mistral-src' GitHub repositories. It uses a Byte-fallback BPE approach with a documented vocabulary size of 32,000 tokens. The instruction format and special tokens (BOS/EOS) are clearly defined in the model card, and the tokenizer's behavior is verifiable through open-source implementations.

Model

23.0 / 40

Parameter Density

9.0 / 10

Mistral AI is highly transparent about the model's parameter density. They explicitly state the total parameter count (46.7B) and the active parameter count per token (12.9B). This distinction is crucial for MoE models and prevents the common industry practice of inflating effective parameter counts without disclosing inference costs.

Training Compute

1.0 / 10

There is virtually no public information regarding the compute resources used to train Mixtral-8x7B. The technical report does not disclose GPU/TPU hours, hardware specifications, training duration, or the carbon footprint. This lack of disclosure makes it impossible to verify the environmental impact or the scale of the training run.

Benchmark Reproducibility

4.0 / 10

While the model provides scores on standard benchmarks (MMLU, GSM8K, etc.) and compares them to Llama 2, the evaluation code and exact prompts used for these results are not fully public. Third-party evaluations (e.g., LMSYS Chatbot Arena) provide some external verification, but the lack of a reproducible evaluation suite from the provider limits transparency. A penalty was applied due to documented evidence of benchmark contamination in the training data for GSM8K.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Mistral AI model in most deployments. It maintains a clear versioning scheme (v0.1) and does not suffer from the identity confusion seen in some other open-weight models that claim to be GPT-4 or other competitors.

Downstream

22.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a highly permissive and well-understood open-source license. There are no conflicting terms or hidden commercial restrictions in the official weights release, providing exemplary clarity for both commercial and non-commercial users.

Hardware Footprint

7.0 / 10

Hardware requirements are well-documented by both the provider and the community. Official docs provide VRAM estimates for different precisions (FP16, Int8, Int4), and the impact of the MoE architecture on memory vs. compute is clearly explained. While some specific context-length memory scaling details are missing, the general footprint is highly verifiable.

Versioning Drift

5.0 / 10

The model uses semantic versioning (v0.1), but the changelog and update history are sparse. While the initial release was well-documented, there is limited information on how subsequent minor updates or weight adjustments are tracked, and no formal deprecation path for older versions is established.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs