ApX logoApX logo

Mistral-7B-Instruct-v0.1

Parameters

7.3B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

27 Sept 2023

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

4096

Number of Layers

32

Attention Heads

32

Key-Value Heads

8

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

Mistral-7B-Instruct-v0.1

The Mistral-7B-Instruct-v0.1 model is an instruction-tuned variant of the Mistral-7B-v0.1 generative text model, developed by Mistral AI. Its primary purpose is to facilitate conversational AI and assistant tasks by precisely interpreting and responding to instructional prompts. This model is designed for efficiency, providing a compact yet performant solution for language processing applications.

Architecturally, Mistral-7B-Instruct-v0.1 is a decoder-only transformer model. It incorporates several advancements to enhance computational efficiency and context management. These include Grouped-Query Attention (GQA) for accelerated inference and Sliding-Window Attention (SWA), which enables processing of longer input sequences more effectively by attending to a fixed window of prior hidden states. The model utilizes Rotary Position Embedding (RoPE) for positional encoding and employs RMS Normalization. Its tokenization is handled by a Byte-fallback BPE tokenizer.

Regarding its capabilities, Mistral-7B-Instruct-v0.1 is applicable across various text-based scenarios. It is adept at generating coherent text, answering questions, and performing general natural language processing tasks. Specific applications include conversational AI systems, educational tools, customer support interfaces, and knowledge retrieval agents. Its design also supports real-time content generation and energy-efficient AI deployments due to its optimized architecture.

About Mistral 7B

Mistral 7B, a 7.3 billion parameter model, utilizes a decoder-only transformer architecture. It features Sliding Window Attention and Grouped Query Attention for efficient long sequence processing. A Rolling Buffer Cache optimizes memory use, contributing to its design for efficient language processing.


Other Mistral 7B Models

Evaluation Benchmarks

No evaluation benchmarks for Mistral-7B-Instruct-v0.1 available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B

64 / 100

Mistral-7B-Instruct-v0.1 Transparency Report

Total Score

64

/ 100

B

Audit Note

Mistral-7B-Instruct-v0.1 demonstrates strong transparency in its licensing and core architectural specifications, utilizing a standard open-source framework. However, it remains highly opaque regarding its training data and compute resources, providing virtually no verifiable information on the sources or environmental impact of its development. While the model is technically accessible, the lack of detailed evaluation protocols and data provenance limits full independent auditability.

Upstream

18.5 / 30

Architectural Provenance

7.5 / 10

The model's architecture is well-documented in the 'Mistral 7B' technical paper, which specifies a decoder-only transformer structure. It explicitly details the use of Grouped-Query Attention (GQA) and Sliding Window Attention (SWA) with a window size of 4096. While the base model's pretraining methodology is described at a high level, the specific instruction-tuning procedure for the v0.1 variant is only broadly attributed to 'publicly available conversation datasets' without a step-by-step technical breakdown of the fine-tuning recipe.

Dataset Composition

2.5 / 10

Mistral AI provides almost no transparency regarding the pretraining dataset composition, stating only that it was 'trained on a massive dataset of text' and 'carefully designed'. For the Instruct v0.1 variant, the documentation vaguely mentions the use of 'publicly available conversation datasets' without naming specific sources, providing percentage breakdowns, or detailing the filtering and cleaning methodologies used. This lack of disclosure makes it impossible to verify the data's provenance or quality.

Tokenizer Integrity

8.5 / 10

The model uses a Byte-fallback BPE tokenizer with a clearly stated vocabulary size of 32,000 tokens. The tokenizer is publicly accessible via the 'mistral-common' library and Hugging Face, allowing for direct inspection and verification. Documentation exists for its implementation, including the specific [INST] and [/INST] control tokens required for the instruction format, though some early discrepancies between the reference implementation and third-party libraries were noted by the community.

Model

23.0 / 40

Parameter Density

9.0 / 10

The model is explicitly defined as a dense architecture with 7.3 billion total parameters. Detailed architectural hyperparameters are provided in the official paper, including the number of layers (32), hidden dimension (4096), and number of heads (32). As a dense model, all parameters are active during inference, and there is no ambiguity regarding sparse or MoE components.

Training Compute

1.0 / 10

There is a near-total absence of official information regarding the compute resources used for the primary training of Mistral 7B. No data is provided regarding GPU/TPU hours, hardware clusters, training duration, or the resulting carbon footprint. Independent audits (e.g., Stanford CRFM) have confirmed a score of 0 for these transparency indicators in official Mistral communications.

Benchmark Reproducibility

4.0 / 10

While Mistral AI reports scores on standard benchmarks (MMLU, GSM8K, etc.) and compares them against Llama 2, they do not provide the exact evaluation code, specific prompts, or few-shot examples used to achieve these results. The lack of a public evaluation harness or detailed reproduction instructions forces third parties to rely on independent implementations, which may lead to inconsistent results.

Identity Consistency

9.0 / 10

The model consistently identifies itself as an AI developed by Mistral AI and maintains a clear versioning identity (v0.1). It does not exhibit significant identity confusion or claim to be a model from a different provider. It is generally transparent about its nature as a language model, though it lacks built-in moderation mechanisms which it acknowledges in its documentation.

Downstream

22.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a highly permissive, industry-standard open-source license. The terms are clear, allowing for both commercial and non-commercial use, modification, and distribution without proprietary restrictions or conflicting usage terms. Documentation on the official website and Hugging Face consistently reflects this licensing.

Hardware Footprint

7.0 / 10

VRAM requirements for inference are well-understood and documented by the community and third-party providers (e.g., ~15GB for FP16, ~5GB for 4-bit). While Mistral's official documentation provides some guidance on memory efficiency through SWA and GQA, it lacks a comprehensive official table detailing the exact VRAM scaling across different context lengths and quantization levels (Q4, Q8, etc.).

Versioning Drift

5.0 / 10

The model uses a versioning system (v0.1, v0.2, v0.3), which provides some level of tracking. However, there is no detailed, publicly maintained changelog or formal documentation of performance drift or behavioral changes between minor updates. Users must rely on community findings or major version releases to understand significant shifts in model behavior.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs

Mistral-7B-Instruct-v0.1: Specifications and GPU VRAM Requirements