Qwen2-7B: Specifications and GPU VRAM Requirements

Qwen2-7B

Open Source

Open Weights

Parameters

Context Length

131.072K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

7 Jun 2024

Knowledge Cutoff

Dec 2023

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

3584

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Qwen2-7B

Qwen2-7B is a decoder-only Transformer model developed by Alibaba Cloud, forming a part of the Qwen2 series of large language models. It is specifically designed as a foundational model, intended for diverse natural language processing applications, including comprehensive language understanding and generation tasks. While the base Qwen2-7B model is suitable for further post-training procedures such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), instruction-tuned variants are also available for direct deployment in instruction-following scenarios, supporting various conversational and task-oriented applications. The model's training dataset incorporates a wide array of languages, including English, Chinese, and 27 additional languages, thereby extending its utility and enabling robust multilingual capabilities.

The architectural design of Qwen2-7B integrates several technical features aimed at optimizing performance and efficiency. It utilizes SwiGLU activation functions within its feed-forward networks and incorporates attention QKV bias. A notable innovation across the Qwen2 suite is the implementation of Group Query Attention (GQA), which is designed to enhance inference speed and reduce memory consumption. Positional encoding is managed by Rotary Position Embedding (RoPE), with techniques like Yet Another RoPE Normalization (YaRN) employed to facilitate extrapolation to longer context lengths. Normalization layers within the model architecture employ RMSNorm. Additionally, the model benefits from an enhanced tokenizer, engineered for adaptability across a spectrum of natural languages and programming codes.

Qwen2-7B demonstrates the capacity for processing substantial input sequences. The base model supports a pretraining context length of 32,000 tokens, with extrapolation capabilities extending up to 128,000 tokens. Its instruction-tuned variant supports a context length of up to 131,072 tokens, enabling the model to manage and reason over extensive texts. This model is engineered to exhibit proficient performance across various cognitive domains, including natural language understanding, general question answering, text summarization, content creation, coding assistance, and mathematical problem-solving. The 7B model is widely utilized due to its ability to run on accelerators equipped with 16GB memory using 16-bit floating points. The Qwen2 series models are released under the Apache 2.0 license, supporting open research, development, and commercial use.

About Qwen2

The Alibaba Qwen2 model family comprises large language models built upon the Transformer architecture. It includes both dense and Mixture-of-Experts (MoE) variants, designed for diverse language tasks. Technical features include Grouped Query Attention and support for extended context lengths up to 131,072 tokens, optimizing memory footprint for inference.

Other Qwen2 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#47

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.44	24
Graduate-Level QA GPQA	0.25	28
General Knowledge MMLU	0.25	35

Rankings

Overall Rank

#47

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code