Mixtral-8x7B-v0.1: Specifications and GPU VRAM Requirements

Mixtral-8x7B-v0.1

Closed Source

Open Weights

Active Parameters

46.7B

Context Length

32.768K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

9 Dec 2023

Knowledge Cutoff

Nov 2022

Technical Specifications

Total Expert Parameters

7.0B

Number of Experts

Active Experts

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

4096

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Mixtral-8x7B-v0.1

Mixtral-8x7B-v0.1 is a generative large language model developed by Mistral AI, distinguished by its Sparse Mixture of Experts (SMoE) architecture. This design enables the model to process information efficiently by conditionally activating a subset of its parameters for each input. Its primary purpose is to facilitate advanced text generation and comprehensive language understanding across a diverse range of applications.

The model is built upon a decoder-only transformer architecture. It integrates a Mixture-of-Experts layer where each layer comprises eight distinct feedforward blocks, known as 'experts'. A router network dynamically selects two of these experts to process each token, subsequently combining their outputs additively. This mechanism permits the model to leverage a substantial total parameter count of 46.7 billion while maintaining a significantly lower active parameter count of 12.9 billion per token during inference, thereby optimizing the balance between model capacity and computational efficiency. The architecture further incorporates Grouped Query Attention (GQA) and supports Flash Attention for enhanced performance.

Mixtral-8x7B-v0.1 supports a context length of 32,000 tokens, allowing it to process and generate responses based on extensive textual inputs. The model demonstrates proficiency in multilingual tasks, supporting English, French, Italian, German, and Spanish. It also exhibits strong performance in code generation tasks. The model can be fine-tuned for instruction-following tasks, making it a suitable foundation for building interactive applications that require precise adherence to user commands.

About Mixtral

The Mixtral model family, developed by Mistral AI, employs a sparse Mixture-of-Experts (SMoE) architecture. This design utilizes multiple expert networks per layer, where a router selects a subset to process each token. This enables large total parameter counts while maintaining computational efficiency by activating only a fraction of parameters per forward pass.

Other Mixtral Models

Mixtral-8x22B-v0.1

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Mixtral-8x7B-v0.1 available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

16k

32k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code