ApX logoApX logo

Gemma 1 2B

Parameters

2B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Gemma Terms of Use

Release Date

21 Feb 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Query Attention

Attention Heads

16

Key-Value Heads

1

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

-

Dimensions

Hidden Dimension Size

2,048

Number of Layers

18

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 8.2kx 18 layersRMSNormPre-AttentionMulti-Query Attention16Q / 1KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkActivation+Final RMSNormOutput Logits

Gemma 1 2B

Gemma 1 2B is a lightweight, state-of-the-art open language model developed by Google, stemming from the same research and technology that underpins the Gemini family of models. This model is designed as a text-to-text, decoder-only transformer, primarily available in English, with both pre-trained and instruction-tuned variants. Its architectural design focuses on efficiency, making it suitable for deployment in environments with limited computational resources, such as laptops, desktops, or personal cloud infrastructure.

Architecturally, Gemma 1 2B incorporates several advanced components. It utilizes Multi-Query Attention (MQA) with a single key-value head, a design choice that optimizes for faster inference by sharing key and value projections across attention heads. Positional encoding is handled through Rotary Positional Embeddings (RoPE). The model's non-linear activation function is GeGLU (Gated Linear Unit), a variant of GLU that enhances expressive power. Normalization within the network is performed using RMSNorm. These elements contribute to the model's performance while maintaining a compact footprint.

The 2B variant is well-suited for a variety of text generation applications, including question answering, summarization, and reasoning tasks. The instruction-tuned versions of Gemma 1 2B are specifically refined to follow instructions effectively and engage in multi-turn conversations, making them adaptable for interactive applications like chatbots. Its compact size ensures it can operate on consumer-grade hardware, democratizing access to advanced AI capabilities for developers and researchers.

About Gemma 1

Gemma 1 is a family of lightweight, decoder-only transformer models from Google, available in 2B and 7B parameter sizes. Designed for various text generation tasks, they incorporate rotary positional embeddings, shared input/output embeddings, GEGLU activation, and RMSNorm. The 2B model uses multi-query attention, while 7B uses multi-head attention.


Other Gemma 1 Models

Evaluation Benchmarks

No evaluation benchmarks for Gemma 1 2B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

65 / 100

Gemma 1 2B Model Integrity Report

Total Score

65

/ 100

B

Audit Note

Gemma 1 2B exhibits strong transparency in its architectural design and tokenizer implementation, backed by a detailed technical report. However, it suffers from significant opacity regarding its training dataset composition and the specific compute resources consumed during development. While the model is highly consistent in its identity, its custom licensing and initial issues with benchmark reproducibility present hurdles for fully transparent independent verification.

Upstream

21.5 / 30

Architectural Provenance

8.5 / 10

Gemma 1 2B is extensively documented in the official technical report ('Gemma: Open Models Based on Gemini Research and Technology'). The architecture is explicitly defined as a decoder-only transformer with 18 layers, a hidden dimension of 2048, and 8 attention heads. It uniquely utilizes Multi-Query Attention (MQA) for the 2B variant, distinct from the 7B's Multi-Head Attention. Key modifications like RoPE embeddings, GeGLU activations, and RMSNorm are clearly stated and justified. The relationship to the Gemini family is transparent, though the specific 'distillation' or training recipe details from the larger Gemini models are described at a high level rather than with full procedural reproducibility.

Dataset Composition

4.0 / 10

While Google discloses the total token count (2 trillion for the 2B model) and general categories (web documents, mathematics, and code), it fails to provide a specific percentage breakdown or name the exact datasets used. The documentation mentions filtering for CSAM, PII, and quality using model-based classifiers, but these methodologies are not public. The lack of specific data sources or a detailed composition breakdown (e.g., 'StackOverflow: 5%') prevents independent verification of the training data's diversity or bias.

Tokenizer Integrity

9.0 / 10

The tokenizer is a SentencePiece-based model with a large vocabulary of 256,128 tokens, which is publicly accessible via Hugging Face and the official GitHub repository. Documentation specifies technical details such as digit splitting, byte-level encoding for unknown tokens, and the preservation of whitespace. The vocabulary size and tokenization approach are consistent across official documentation and third-party implementations like Transformers and vLLM.

Model

24.0 / 40

Parameter Density

7.0 / 10

The model is marketed as '2B', but technical documentation reveals the actual parameter count is approximately 2.5 billion. While the 'active' vs 'total' distinction is not applicable here as it is a dense model, the discrepancy between the marketing name and the actual size is documented in the technical report (Table 1). The architectural breakdown (layers, heads, embedding dimensions) is fully transparent, allowing for precise parameter calculation by researchers.

Training Compute

3.5 / 10

Google discloses the hardware used (TPUv5e) and the scale (512 TPUv5e chips across 2 pods for the 2B model). However, it does not provide the total training duration in hours, the total energy consumption, or the carbon footprint. While the 'Pathways' approach and sharding techniques are mentioned, the lack of specific compute-time metrics or environmental impact data results in a low score for this pillar.

Benchmark Reproducibility

4.0 / 10

Google provides a wide array of benchmark results (MMLU, GSM8K, HumanEval, etc.) in the technical report. However, the exact prompts, few-shot examples, and evaluation code were not initially released in a centralized, reproducible format. Third-party audits (e.g., Unsloth) discovered significant discrepancies and bugs in the initial release's implementation of the technical report's specifications (such as BOS token requirements and RoPE precision), which hindered immediate reproducibility. The score is further adjusted due to documented evidence of benchmark contamination in the training data.

Identity Consistency

9.5 / 10

Gemma 1 2B demonstrates high identity consistency. It correctly identifies itself as a model developed by Google and is transparent about its versioning (e.g., distinguishing between 1.0 and 1.1). There are no significant reports of the model claiming to be a competitor's product (like GPT-4). It maintains a clear boundary regarding its capabilities as a text-only model compared to the multimodal Gemini models.

Downstream

19.5 / 30

License Clarity

6.5 / 10

The model is released under the 'Gemma Terms of Use,' which is a custom 'open weights' license rather than a standard OSI-approved open-source license like Apache 2.0. While it allows for commercial use and redistribution, it includes restrictive clauses regarding 'Model Derivatives' and a 'Prohibited Use Policy' that Google can enforce remotely. The terms are legally clear but create a 'viral' effect where any model trained on Gemma output must also follow these terms, leading to some community ambiguity.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both Google and the community. Official model cards provide VRAM estimates for different precisions (e.g., ~4.7GB for BF16), and third-party tools like the Hugging Face Model Memory Utility provide granular data for quantization (e.g., ~1.2GB for INT4). The impact of context length (8k tokens) on memory is also publicly verifiable through standard transformer memory scaling formulas.

Versioning Drift

5.0 / 10

Google maintains a release log and uses version numbers (1.0, 1.1). However, the transition from 1.0 to 1.1 involved significant 'silent' changes in behavior due to a new RLHF method and bug fixes that were not fully detailed in a technical changelog. While the previous versions remain accessible, the lack of a detailed, line-by-line changelog for weights and alignment updates makes tracking drift difficult for developers.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs