ApX logoApX logo

Mistral-7B-Instruct-v0.2

Parameters

7.3B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

15 Jan 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

4096

Number of Layers

32

Attention Heads

32

Key-Value Heads

8

Activation Function

-

Normalization

-

Position Embedding

ROPE

Mistral-7B-Instruct-v0.2

Mistral-7B-Instruct-v0.2 is an instruction-tuned large language model comprising 7.3 billion parameters. This model is engineered to interpret and execute specific instructions, rendering it suitable for applications such as conversational AI, automated dialogue systems, and content generation tasks like question answering and summarization. It is an enhanced iteration derived from the Mistral-7B-v0.2 base model, distinguishing itself through its fine-tuned instruction-following capabilities.

The architectural foundation of Mistral-7B-Instruct-v0.2 is the transformer, which integrates Grouped-Query Attention (GQA) to optimize inference efficiency. A key architectural distinction in this instruct variant, compared to earlier base models, is the deliberate exclusion of Sliding-Window Attention. Instead, the model supports an expanded context window of 32,000 tokens, facilitating the processing of extended text sequences while maintaining semantic coherence. It incorporates Rotary Position Embeddings (RoPE) with a theta value set at 1e6 and employs a Byte-fallback BPE tokenizer to handle a diverse range of textual inputs.

Mistral-7B-Instruct-v0.2 is designed for flexible deployment across various computing environments, including local systems and cloud-based platforms. Its operational design focuses on precise performance in instruction-following scenarios. The model is distributed under the Apache 2.0 License, which enables open access, use, and integration into diverse research and development projects without restriction.

About Mistral 7B

Mistral 7B, a 7.3 billion parameter model, utilizes a decoder-only transformer architecture. It features Sliding Window Attention and Grouped Query Attention for efficient long sequence processing. A Rolling Buffer Cache optimizes memory use, contributing to its design for efficient language processing.


Other Mistral 7B Models

Evaluation Benchmarks

Rank

#97

BenchmarkScoreRank

Web Development

WebDev Arena

1150

60

Rankings

Overall Rank

#97

Coding Rank

#87

Model Transparency

Total Score

B-

60 / 100

Mistral-7B-Instruct-v0.2 Transparency Report

Total Score

60

/ 100

B-

Audit Note

Mistral-7B-Instruct-v0.2 demonstrates strong transparency in its licensing and architectural specifications, providing a clear open-source foundation for developers. However, the model is severely limited by a lack of disclosure regarding its training data composition and compute resources. While the technical architecture is well-documented, the 'black box' nature of its upstream data and training costs remains a critical gap in its transparency profile.

Upstream

18.0 / 30

Architectural Provenance

7.5 / 10

Mistral-7B-Instruct-v0.2 is explicitly documented as an instruction-tuned version of the Mistral-7B-v0.2 base model. The architecture is a decoder-only transformer with 32 layers, 4096 hidden dimensions, and 14336 intermediate dimensions. It utilizes Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE) with a theta value of 1e6. Notably, this version (v0.2) removes the Sliding-Window Attention (SWA) found in v0.1 to support a full 32k context window. While the high-level architecture is well-documented in the 'Mistral 7B' paper and release notes, the specific fine-tuning methodology for the 'Instruct' variant is described generally as 'two-stage instruction tuning' without exhaustive procedural detail.

Dataset Composition

2.0 / 10

Data transparency is a significant weakness. Mistral AI states the model was fine-tuned on 'publicly available conversation datasets' but does not provide a specific list, proportions, or a breakdown of the pretraining or instruction data. There is no public documentation on data filtering, cleaning, or the specific mix of code, web, and book data used. The company explicitly refuses to disclose detailed data sources for competitive reasons, which falls under the 'vague marketing claims' category in the scoring guidelines.

Tokenizer Integrity

8.5 / 10

The model uses a Byte-fallback BPE tokenizer with a vocabulary size of 32,000. The tokenizer is publicly accessible via the 'mistral-common' library and Hugging Face 'transformers'. Documentation confirms it handles a wide range of inputs by falling back to bytes for out-of-vocabulary characters. The special tokens for instruction formatting ([INST], [/INST]) are clearly defined and documented for developers to ensure alignment between training and inference.

Model

21.0 / 40

Parameter Density

7.0 / 10

The model is a dense architecture with a clearly stated total of 7.3 billion parameters. Unlike MoE models, all parameters are active during inference. Detailed architectural specifications (layer counts, head counts, and dimension sizes) are available in the model configuration files. However, there is no public documentation on the specific parameter distribution across attention vs. feed-forward networks beyond what can be inferred from the standard transformer block structure.

Training Compute

2.0 / 10

Compute transparency is nearly non-existent. While it is known that the model was trained on a CoreWeave cluster, Mistral AI has not disclosed the total GPU hours, hardware counts (e.g., number of H100s), training duration, or the carbon footprint. Third-party estimates exist, but official verifiable data is absent, leading to a low score based on the 'no compute disclosure' red flag.

Benchmark Reproducibility

4.0 / 10

Mistral AI provides benchmark results for MMLU, GSM8K, and others in their blog posts and paper, but they do not release the exact evaluation code or the specific prompts/few-shot examples used to achieve those scores. While the model is frequently tested by third parties on leaderboards like LMSYS, the lack of official reproduction instructions and prompt transparency prevents a higher score. (Score adjusted for disclosed external findings regarding benchmark performance consistency).

Identity Consistency

8.0 / 10

The model generally identifies itself correctly as an AI developed by Mistral AI and is aware of its versioning (v0.2). It does not typically claim to be a competitor's model (like GPT-4). However, early documentation for v0.2 was initially inconsistent regarding its base model (v0.1 vs v0.2), which was later corrected. It maintains a coherent identity across standard deployments.

Downstream

21.0 / 30

License Clarity

9.0 / 10

The model is released under the Apache 2.0 license, which is a highly permissive, standard open-source license. The terms are clear, allowing for commercial use, modification, and distribution without significant restrictions. There are no known conflicting terms between the weights and the reference code provided by Mistral AI.

Hardware Footprint

7.0 / 10

VRAM requirements are well-documented by the community and supported by official configuration files. At FP16, the model requires ~14-15GB of VRAM, while 4-bit quantization (Q4) reduces this to ~4-5GB, making it accessible on consumer hardware. Memory scaling for the 32k context window is understood, though official documentation on the specific accuracy-tradeoffs of different quantization levels is primarily provided by third-party contributors rather than Mistral AI directly.

Versioning Drift

5.0 / 10

Mistral uses semantic-like versioning (v0.1, v0.2, v0.3), which is a positive practice. However, the transition from v0.1 to v0.2 involved silent updates to the model card and documentation regarding the base model and context window features. There is no detailed, centralized changelog or formal deprecation policy for older versions, making it difficult for developers to track subtle behavioral changes between minor releases.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs