ApX logoApX logo

Mistral-7B-v0.1

Parameters

7.3B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

27 Sept 2023

Knowledge Cutoff

Aug 2021

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

4,096

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

14,336

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

32,000

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 8.2k · Vocab: 32kx 32 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 8KV heads · SW: 4.1kHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 14.3k+Final RMSNormOutput Logits

Mistral-7B-v0.1

Mistral-7B-v0.1 is a 7.3 billion parameter large language model developed by Mistral AI, engineered for superior performance and computational efficiency in natural language processing tasks. Its design prioritizes efficient inference, making it suitable for practical deployment across various applications. The model is built upon a decoder-only transformer architecture, integrating several key innovations to optimize its operation.

About Mistral 7B

Mistral 7B, a 7.3 billion parameter model, utilizes a decoder-only transformer architecture. It features Sliding Window Attention and Grouped Query Attention for efficient long sequence processing. A Rolling Buffer Cache optimizes memory use, contributing to its design for efficient language processing.


Other Mistral 7B Models

Evaluation Benchmarks

No evaluation benchmarks for Mistral-7B-v0.1 available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

66 / 100

Mistral-7B-v0.1 Model Integrity Report

Total Score

66

/ 100

B

Audit Note

Mistral 7B v0.1 demonstrates strong transparency in its architectural design and licensing, providing a clear blueprint for its technical innovations and permissive usage. However, it remains highly opaque regarding its upstream data sources and the environmental/computational costs of its training. While the model's identity and technical specifications are well-defined, the lack of a reproducible evaluation harness limits independent verification of its benchmark claims.

Upstream

19.0 / 30

Architectural Provenance

8.0 / 10

Mistral 7B v0.1 is a dense decoder-only transformer with well-documented architectural innovations, specifically Grouped-Query Attention (GQA) and Sliding Window Attention (SWA). The official technical report and blog post provide clear specifications: 32 layers, 4096 hidden dimension, 14336 intermediate dimension, and 32 heads. While the pretraining procedure is described as 'trained from scratch,' the specific initialization and optimization hyperparameters are less detailed than in exemplary documentation.

Dataset Composition

2.0 / 10

Mistral AI provides almost no transparency regarding the pretraining data. The official documentation and paper state only that it was trained on 'publicly available' data and mention English, French, and code. There is no disclosure of specific sources, no percentage breakdown of the 8 trillion tokens claimed by some third-party sources, and no detailed filtering or cleaning methodology. This is a significant gap in transparency.

Tokenizer Integrity

9.0 / 10

The model uses a Byte-fallback BPE tokenizer with a vocabulary size of 32,000 tokens. The tokenizer is publicly accessible via the official GitHub repository and Hugging Face, allowing for full inspection. It is well-documented as being based on the Llama tokenizer but with minor modifications for efficiency. The byte-fallback mechanism ensures no out-of-vocabulary issues, and its implementation is verifiable through the provided reference code.

Model

23.5 / 40

Parameter Density

8.5 / 10

The model is explicitly stated to have 7.3 billion parameters. As a dense model, all parameters are active during inference, which is clearly communicated. The architectural breakdown (layers, heads, dimensions) is fully provided in the technical report, allowing for precise calculation of parameter distribution across attention and FFN blocks.

Training Compute

2.0 / 10

Mistral AI has not disclosed the specific compute resources used for training Mistral 7B v0.1. There is no public information regarding GPU/TPU hours, hardware counts, training duration, or the carbon footprint of the pretraining phase. Third-party audits (e.g., Stanford CRFM) confirm this lack of disclosure. The only verifiable detail is that it was trained on a CoreWeave cluster.

Benchmark Reproducibility

4.0 / 10

While Mistral provides a comprehensive list of benchmark results (MMLU, GSM8K, etc.) and specifies the few-shot settings used, they do not release the exact evaluation code or the specific prompts/examples used to achieve those scores. This makes exact reproduction difficult for independent researchers. Third-party leaderboards often show variance from official claims, and the lack of a public evaluation harness is a notable transparency deficit.

Identity Consistency

9.0 / 10

The model consistently identifies as a Mistral AI product and maintains a clear versioning identity (v0.1). It does not exhibit the identity confusion seen in some fine-tuned models that claim to be GPT-4. Documentation clearly distinguishes between the base model and the 'Instruct' variant, and the model's behavior is generally aligned with its stated capabilities as a base foundation model.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model weights and reference code are released under the Apache 2.0 license, which is a standard, highly permissive open-source license. There are no conflicting terms or 'open-weights' restrictions that limit commercial use. The licensing is prominently displayed on the official website, GitHub, and Hugging Face repository, providing maximum clarity for downstream users.

Hardware Footprint

7.0 / 10

Hardware requirements are well-understood due to the model's popularity and the availability of reference implementations. While the official documentation provides some guidance on VRAM (approx. 15GB for FP16), it lacks a detailed official breakdown of quantization tradeoffs (e.g., perplexity loss at 4-bit). However, the community and third-party documentation extensively fill this gap with verifiable data for various quantization formats (bitsandbytes, GGUF).

Versioning Drift

6.0 / 10

Mistral uses semantic-like versioning (v0.1, v0.2, v0.3), and a changelog is maintained on their documentation site. However, the documentation for v0.1 specifically is static, and while newer versions are released as separate entities, there is limited information on 'drift' or minor weight updates within the v0.1 lifecycle. The transition path between versions is mentioned but not deeply documented in terms of performance deltas.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs