Gemma 3 4B: Specifications and GPU VRAM Requirements

Gemma 3 4B

Closed Source

Open Weights

Parameters

Context Length

131.072K

Modality

Multimodal

Architecture

Dense

License

Gemma License

Release Date

12 Mar 2025

Knowledge Cutoff

Aug 2024

Technical Specifications

Attention Structure

Grouped-Query Attention

Hidden Dimension Size

2048

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

Normalization

RMS Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Gemma 3 4B

Gemma 3 4B is a foundational vision-language model developed by Google, designed to process both text and image inputs while generating textual outputs. It is part of the Gemma 3 family of lightweight, state-of-the-art models built upon the same research and technology that powers Google's Gemini models. The 4 billion parameter variant is optimized for efficient performance across diverse hardware environments, ranging from cloud-scale deployments to on-device execution on workstations, laptops, and mobile devices.

Architecturally, Gemma 3 4B employs a decoder-only transformer design. Key innovations include an optimized attention mechanism featuring a 5:1 interleaving ratio of local sliding window self-attention layers with global self-attention layers, coupled with a reduced window size for local attention. This architectural modification aims to decrease KV-cache memory overhead, enabling efficient processing of extended context lengths without degrading perplexity. The model utilizes a custom SigLIP vision encoder, which transforms 896x896 pixel square images into tokens for the language model, with a "Pan&Scan" algorithm employed to handle images of varying aspect ratios or higher resolutions.

Gemma 3 4B is engineered for a wide array of generative AI tasks, including question answering, summarization, and complex reasoning. Its multimodal capabilities allow for comprehensive understanding and analysis of visual data, such as object identification or text extraction from images. The model supports a context window of 128,000 tokens and offers broad multilingual capabilities, handling over 140 languages. Additionally, it integrates function calling, enabling the creation of intelligent agents that can interact with external tools and application programming interfaces.

About Gemma 3

Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.

Other Gemma 3 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#50

Benchmark	Score	Rank
Professional Knowledge MMLU Pro	0.44	25
Graduate-Level QA GPQA	0.31	27
Mathematics LiveBench Mathematics	0.31	28
Reasoning LiveBench Reasoning	0.20	29
Data Analysis LiveBench Data Analysis	0.39	29
Coding LiveBench Coding	0.16	30
General Knowledge MMLU	0.31	33

Rankings

Overall Rank

#50

Coding Rank

#41

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code