Active Parameters
46.7B
Context Length
32.768K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
9 Dec 2023
Knowledge Cutoff
Nov 2022
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
4,096
Number of Layers
32
FFN Intermediate Size (Dense)
14,336
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
32,000
Mixture of Experts
Total Expert Parameters
7.0B
Number of Experts
8
Active Experts
2
Shared Experts
-
FFN Intermediate Size (per Expert)
14,336
Dense Layers Before MoE
-
Mixtral-8x7B-v0.1 is a generative large language model developed by Mistral AI, distinguished by its Sparse Mixture of Experts (SMoE) architecture. This design enables the model to process information efficiently by conditionally activating a subset of its parameters for each input. Its primary purpose is to facilitate advanced text generation and comprehensive language understanding across a diverse range of applications.
The model is built upon a decoder-only transformer architecture. It integrates a Mixture-of-Experts layer where each layer comprises eight distinct feedforward blocks, known as 'experts'. A router network dynamically selects two of these experts to process each token, subsequently combining their outputs additively. This mechanism permits the model to leverage a substantial total parameter count of 46.7 billion while maintaining a significantly lower active parameter count of 12.9 billion per token during inference, thereby optimizing the balance between model capacity and computational efficiency. The architecture further incorporates Grouped Query Attention (GQA) and supports Flash Attention for enhanced performance.
Mixtral-8x7B-v0.1 supports a context length of 32,000 tokens, allowing it to process and generate responses based on extensive textual inputs. The model demonstrates proficiency in multilingual tasks, supporting English, French, Italian, German, and Spanish. It also exhibits strong performance in code generation tasks. The model can be fine-tuned for instruction-following tasks, making it a suitable foundation for building interactive applications that require precise adherence to user commands.
The Mixtral model family, developed by Mistral AI, employs a sparse Mixture-of-Experts (SMoE) architecture. This design utilizes multiple expert networks per layer, where a router selects a subset to process each token. This enables large total parameter counts while maintaining computational efficiency by activating only a fraction of parameters per forward pass.
Rank
#146
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1197 | 82 |
Overall Rank
#146
Coding Rank
#101
Total Score
63
/ 100
Mixtral-8x7B-v0.1 exhibits a bifurcated transparency profile, excelling in architectural and licensing clarity while remaining almost entirely opaque regarding its training data and compute resources. The model sets a high standard for MoE parameter disclosure but fails to provide the necessary evidence to verify its data provenance or environmental impact. Users can trust the technical implementation and legal framework, but the 'black box' nature of its training remains a significant concern.
Architectural Provenance
The model's Sparse Mixture of Experts (SMoE) architecture is well-documented in the official technical report and blog posts. It explicitly details the use of 8 experts per layer with a router selecting 2 per token, and the use of Grouped Query Attention (GQA). While the base architecture is clearly a modification of the Mistral 7B design, the specific pretraining methodology and exact architectural modifications are described with sufficient technical detail for implementation, though some proprietary training recipes remain undisclosed.
Dataset Composition
Mistral AI provides almost no transparency regarding the training data. Official documentation only states the model was 'pre-trained on internet-scale data' without disclosing specific sources, proportions, or filtering methodologies. There is no breakdown of data categories (e.g., web, code, books) or information on the data collection timeline, which is a major transparency gap.
Tokenizer Integrity
The tokenizer is publicly accessible via the 'mistral-common' and 'mistral-src' GitHub repositories. It uses a Byte-fallback BPE approach with a documented vocabulary size of 32,000 tokens. The instruction format and special tokens (BOS/EOS) are clearly defined in the model card, and the tokenizer's behavior is verifiable through open-source implementations.
Parameter Density
Mistral AI is highly transparent about the model's parameter density. They explicitly state the total parameter count (46.7B) and the active parameter count per token (12.9B). This distinction is crucial for MoE models and prevents the common industry practice of inflating effective parameter counts without disclosing inference costs.
Training Compute
There is virtually no public information regarding the compute resources used to train Mixtral-8x7B. The technical report does not disclose GPU/TPU hours, hardware specifications, training duration, or the carbon footprint. This lack of disclosure makes it impossible to verify the environmental impact or the scale of the training run.
Benchmark Reproducibility
While the model provides scores on standard benchmarks (MMLU, GSM8K, etc.) and compares them to Llama 2, the evaluation code and exact prompts used for these results are not fully public. Third-party evaluations (e.g., LMSYS Chatbot Arena) provide some external verification, but the lack of a reproducible evaluation suite from the provider limits transparency. A penalty was applied due to documented evidence of benchmark contamination in the training data for GSM8K.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Mistral AI model in most deployments. It maintains a clear versioning scheme (v0.1) and does not suffer from the identity confusion seen in some other open-weight models that claim to be GPT-4 or other competitors.
License Clarity
The model is released under the Apache 2.0 license, which is a highly permissive and well-understood open-source license. There are no conflicting terms or hidden commercial restrictions in the official weights release, providing exemplary clarity for both commercial and non-commercial users.
Hardware Footprint
Hardware requirements are well-documented by both the provider and the community. Official docs provide VRAM estimates for different precisions (FP16, Int8, Int4), and the impact of the MoE architecture on memory vs. compute is clearly explained. While some specific context-length memory scaling details are missing, the general footprint is highly verifiable.
Versioning Drift
The model uses semantic versioning (v0.1), but the changelog and update history are sparse. While the initial release was well-documented, there is limited information on how subsequent minor updates or weight adjustments are tracked, and no formal deprecation path for older versions is established.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online