ApX logoApX logo

Gemma 3 4B

Parameters

4B

Context Length

131.072K

Modality

Multimodal

Architecture

Dense

License

Gemma License

Release Date

12 Mar 2025

Knowledge Cutoff

Aug 2024

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

2048

Number of Layers

30

Attention Heads

32

Key-Value Heads

8

Activation Function

-

Normalization

RMS Normalization

Position Embedding

ROPE

Gemma 3 4B

Gemma 3 4B is a foundational vision-language model developed by Google, designed to process both text and image inputs while generating textual outputs. It is part of the Gemma 3 family of lightweight, state-of-the-art models built upon the same research and technology that powers Google's Gemini models. The 4 billion parameter variant is optimized for efficient performance across diverse hardware environments, ranging from cloud-scale deployments to on-device execution on workstations, laptops, and mobile devices.

Architecturally, Gemma 3 4B employs a decoder-only transformer design. Key innovations include an optimized attention mechanism featuring a 5:1 interleaving ratio of local sliding window self-attention layers with global self-attention layers, coupled with a reduced window size for local attention. This architectural modification aims to decrease KV-cache memory overhead, enabling efficient processing of extended context lengths without degrading perplexity. The model utilizes a custom SigLIP vision encoder, which transforms 896x896 pixel square images into tokens for the language model, with a "Pan&Scan" algorithm employed to handle images of varying aspect ratios or higher resolutions.

Gemma 3 4B is engineered for a wide array of generative AI tasks, including question answering, summarization, and complex reasoning. Its multimodal capabilities allow for comprehensive understanding and analysis of visual data, such as object identification or text extraction from images. The model supports a context window of 128,000 tokens and offers broad multilingual capabilities, handling over 140 languages. Additionally, it integrates function calling, enabling the creation of intelligent agents that can interact with external tools and application programming interfaces.

About Gemma 3

Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.


Other Gemma 3 Models

Evaluation Benchmarks

Rank

#51

BenchmarkScoreRank

Web Development

WebDev Arena

1303

42

Rankings

Overall Rank

#51

Coding Rank

#57

Model Transparency

Total Score

B

68 / 100

Gemma 3 4B Transparency Report

Total Score

68

/ 100

B

Audit Note

Gemma 3 4B exhibits strong transparency in its architectural design and hardware requirements, providing deep technical insights into its hybrid attention mechanism and multimodal integration. However, it remains opaque regarding its specific training data sources and the environmental impact of its compute resources. While the model is highly accessible for local deployment, its custom license and lack of detailed training logs represent significant gaps in its open-science profile.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

Gemma 3 4B is explicitly documented as a decoder-only transformer model with a hybrid attention mechanism. The technical report details a 5:1 interleaving ratio of local sliding window self-attention (1,024 token window) and global self-attention layers. It specifies the use of a 400M parameter SigLIP vision encoder and a 'Pan&Scan' algorithm for handling varying image aspect ratios. The training methodology, including the use of knowledge distillation from larger Gemini models, is publicly disclosed in the official technical report (arXiv:2503.19786).

Dataset Composition

4.5 / 10

Google provides high-level categories for the training data, including web documents (140+ languages), code, mathematics, and images. The 4B model was trained on 4 trillion tokens. However, specific dataset sources, proportions of each category, and detailed filtering/cleaning methodologies beyond general safety filtering (CSAM and sensitive data) are not disclosed. The reliance on 'proprietary' mixtures and distillation from closed models limits full transparency into the data provenance.

Tokenizer Integrity

9.0 / 10

The model uses the Gemini 2.0 tokenizer, which is a SentencePiece-based tokenizer with a vocabulary size of 262,208 tokens. It is publicly available via Hugging Face and integrated into the 'transformers' library. Documentation confirms support for over 140 languages and provides details on how the tokenizer handles multimodal inputs by reserving specific token slots for image embeddings.

Model

25.5 / 40

Parameter Density

7.5 / 10

The model is a dense architecture with 4.0 billion parameters. Detailed architectural specifications are available, including 30 layers, a hidden dimension of 2048, 32 attention heads, and 8 key-value heads (Grouped-Query Attention). While it is a dense model, the documentation clearly distinguishes it from the sparse or 'Matryoshka' variants (Gemma 3n) released in the same family, preventing parameter inflation confusion.

Training Compute

3.0 / 10

Information regarding training compute is minimal. While the technical report mentions the use of TPUv5e hardware for training, it does not disclose the total TPU/GPU hours, energy consumption, or carbon footprint. Cost estimates and environmental impact data are conspicuously absent from official documentation, falling into the 'low' transparency category for this pillar.

Benchmark Reproducibility

6.0 / 10

Google provides comprehensive benchmark results across standard suites (MMLU, GSM8K, HumanEval, etc.) and multimodal benchmarks (DocVQA, MMMU). However, while the technical report describes the evaluation settings (e.g., 0-shot vs few-shot), the exact prompts and full evaluation code are not consistently provided in a single reproducible repository, making third-party verification dependent on independent implementations like 'lm-evaluation-harness'.

Identity Consistency

9.0 / 10

Gemma 3 4B consistently identifies itself as a Google-developed model and maintains clear versioning between its 'PT' (Pre-trained) and 'IT' (Instruction-tuned) variants. It does not exhibit identity confusion with competitor models and is transparent about its multimodal limitations (e.g., the 1B variant being text-only while the 4B+ variants are multimodal).

Downstream

20.5 / 30

License Clarity

7.0 / 10

The model is released under the 'Gemma Terms of Use,' which is a custom 'open weights' license rather than a standard OSI-approved license like Apache 2.0. It permits commercial use but includes specific restrictions, such as prohibiting the use of the model to train other models (distillation) and requiring attribution. The terms are clear but more restrictive than true open-source licenses.

Hardware Footprint

8.5 / 10

Hardware requirements are exceptionally well-documented. Official guides provide VRAM estimates for various quantization levels (FP16, SFP8, Q4_0) and context lengths. For example, the 4B model is noted to require ~9.2 GB VRAM for text tasks in BF16 and ~3.4 GB in 4-bit quantization. The impact of the 128K context window on KV-cache memory is also detailed, showing a reduction in overhead due to the hybrid attention architecture.

Versioning Drift

5.0 / 10

While the model uses clear naming conventions (Gemma 3 4B IT/PT), there is no public, detailed changelog for minor weight updates or 'silent' refreshes. Users have reported configuration issues in early releases (e.g., missing top-level config fields in Hugging Face) that required manual patching until library updates were pushed, indicating some friction in version tracking and deployment stability.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
64k
128k

VRAM Required:

Recommended GPUs