Gemma 2 2B: Specifications and GPU VRAM Requirements

Gemma 2 2B

Closed Source

Open Weights

Parameters

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Gemma License

Release Date

27 Jun 2024

Knowledge Cutoff

Jun 2024

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

2048

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

GELU

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Gemma 2 2B

Gemma 2 2B is a compact, state-of-the-art open language model developed by Google, drawing upon the same foundational research and technology employed in the Gemini model series. This model is engineered as a text-to-text, decoder-only transformer, and is provided in English, with both pre-trained and instruction-tuned variants featuring openly accessible weights. Its design prioritizes efficiency, enabling deployment across a spectrum of computing environments, from resource-constrained edge devices and consumer-grade laptops to more robust cloud infrastructures. This accessibility fosters broader participation in the development and application of advanced artificial intelligence systems.

The architectural framework of Gemma 2 2B is rooted in a decoder-only transformer design, incorporating several established and innovative components. Key architectural elements, consistent with the predecessor Gemma models, include a standard context length of 8192 tokens and the utilization of Rotary Position Embeddings (RoPE) for handling positional information. The model employs an approximated GeGLU non-linearity for its activation functions. Notable enhancements in Gemma 2 include a hybrid normalization approach, integrating both post-normalization and pre-normalization with RMSNorm to enhance training stability and overall performance. Furthermore, Gemma 2 2B utilizes Grouped-Query Attention (GQA), an optimized attention mechanism where multiple query heads share a single key and value head, contributing to improved computational efficiency during inference. Specifically, the 2B variant implements Multi-Query Attention (MQA) with a single key-value head, a configuration effective at smaller model scales. The training methodology for the 2B model also incorporates knowledge distillation from larger models, facilitating superior performance relative to its parameter count. Additionally, the model alternates between local sliding window attention and global attention across its layers to effectively capture both short-range dependencies and broader contextual relationships. Logit soft-capping is applied in the attention and final layers to further stabilize the training process.

The design of Gemma 2 2B emphasizes efficient operation, making it particularly well-suited for deployment in environments with limited computational resources. Its capabilities extend to a variety of text generation applications, encompassing tasks such as question answering, text summarization, and logical reasoning. The model's compact footprint makes it a viable solution for integration into mobile AI applications and edge computing scenarios. To promote responsible AI development, Gemma 2 2B is augmented with advanced safety features, including the ShieldGemma classifiers, designed to detect and mitigate harmful content, and Gemma Scope, a tool for enhancing transparency in the model's decision-making processes.

About Gemma 2

Gemma 2 is Google's family of open large language models, offering 2B, 9B, and 27B parameter sizes. Built upon the Gemma architecture, it incorporates innovations such as interleaved local and global attention, logit soft-capping for training stability, and Grouped Query Attention for inference efficiency. The smaller models leverage knowledge distillation.

Other Gemma 2 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Gemma 2 2B available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights