Parameters
7.3B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
27 Sept 2023
Knowledge Cutoff
Aug 2021
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
4,096
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
32
FFN Intermediate Size (Dense)
14,336
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
32,000
Mistral-7B-v0.1 is a 7.3 billion parameter large language model developed by Mistral AI, engineered for superior performance and computational efficiency in natural language processing tasks. Its design prioritizes efficient inference, making it suitable for practical deployment across various applications. The model is built upon a decoder-only transformer architecture, integrating several key innovations to optimize its operation.
Mistral 7B, a 7.3 billion parameter model, utilizes a decoder-only transformer architecture. It features Sliding Window Attention and Grouped Query Attention for efficient long sequence processing. A Rolling Buffer Cache optimizes memory use, contributing to its design for efficient language processing.
No evaluation benchmarks for Mistral-7B-v0.1 available.
Overall Rank
-
Coding Rank
-
Total Score
66
/ 100
Mistral 7B v0.1 demonstrates strong transparency in its architectural design and licensing, providing a clear blueprint for its technical innovations and permissive usage. However, it remains highly opaque regarding its upstream data sources and the environmental/computational costs of its training. While the model's identity and technical specifications are well-defined, the lack of a reproducible evaluation harness limits independent verification of its benchmark claims.
Architectural Provenance
Mistral 7B v0.1 is a dense decoder-only transformer with well-documented architectural innovations, specifically Grouped-Query Attention (GQA) and Sliding Window Attention (SWA). The official technical report and blog post provide clear specifications: 32 layers, 4096 hidden dimension, 14336 intermediate dimension, and 32 heads. While the pretraining procedure is described as 'trained from scratch,' the specific initialization and optimization hyperparameters are less detailed than in exemplary documentation.
Dataset Composition
Mistral AI provides almost no transparency regarding the pretraining data. The official documentation and paper state only that it was trained on 'publicly available' data and mention English, French, and code. There is no disclosure of specific sources, no percentage breakdown of the 8 trillion tokens claimed by some third-party sources, and no detailed filtering or cleaning methodology. This is a significant gap in transparency.
Tokenizer Integrity
The model uses a Byte-fallback BPE tokenizer with a vocabulary size of 32,000 tokens. The tokenizer is publicly accessible via the official GitHub repository and Hugging Face, allowing for full inspection. It is well-documented as being based on the Llama tokenizer but with minor modifications for efficiency. The byte-fallback mechanism ensures no out-of-vocabulary issues, and its implementation is verifiable through the provided reference code.
Parameter Density
The model is explicitly stated to have 7.3 billion parameters. As a dense model, all parameters are active during inference, which is clearly communicated. The architectural breakdown (layers, heads, dimensions) is fully provided in the technical report, allowing for precise calculation of parameter distribution across attention and FFN blocks.
Training Compute
Mistral AI has not disclosed the specific compute resources used for training Mistral 7B v0.1. There is no public information regarding GPU/TPU hours, hardware counts, training duration, or the carbon footprint of the pretraining phase. Third-party audits (e.g., Stanford CRFM) confirm this lack of disclosure. The only verifiable detail is that it was trained on a CoreWeave cluster.
Benchmark Reproducibility
While Mistral provides a comprehensive list of benchmark results (MMLU, GSM8K, etc.) and specifies the few-shot settings used, they do not release the exact evaluation code or the specific prompts/examples used to achieve those scores. This makes exact reproduction difficult for independent researchers. Third-party leaderboards often show variance from official claims, and the lack of a public evaluation harness is a notable transparency deficit.
Identity Consistency
The model consistently identifies as a Mistral AI product and maintains a clear versioning identity (v0.1). It does not exhibit the identity confusion seen in some fine-tuned models that claim to be GPT-4. Documentation clearly distinguishes between the base model and the 'Instruct' variant, and the model's behavior is generally aligned with its stated capabilities as a base foundation model.
License Clarity
The model weights and reference code are released under the Apache 2.0 license, which is a standard, highly permissive open-source license. There are no conflicting terms or 'open-weights' restrictions that limit commercial use. The licensing is prominently displayed on the official website, GitHub, and Hugging Face repository, providing maximum clarity for downstream users.
Hardware Footprint
Hardware requirements are well-understood due to the model's popularity and the availability of reference implementations. While the official documentation provides some guidance on VRAM (approx. 15GB for FP16), it lacks a detailed official breakdown of quantization tradeoffs (e.g., perplexity loss at 4-bit). However, the community and third-party documentation extensively fill this gap with verifiable data for various quantization formats (bitsandbytes, GGUF).
Versioning Drift
Mistral uses semantic-like versioning (v0.1, v0.2, v0.3), and a changelog is maintained on their documentation site. However, the documentation for v0.1 specifically is static, and while newer versions are released as separate entities, there is limited information on 'drift' or minor weight updates within the v0.1 lifecycle. The transition path between versions is mentioned but not deeply documented in terms of performance deltas.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online