ApX logoApX logo

Gemma 4 E2B

Parameters

5.1B

Context Length

128K

Modality

Multimodal

Architecture

Dense

License

Apache 2.0

Release Date

2 Apr 2026

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

8

Key-Value Heads

1

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

512

Normalization

RMS Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

6,144

Number of Layers

35

FFN Intermediate Size (Dense)

6,144

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

262,144

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 6.1k · Context: 128k · Vocab: 262.1kx 35 layersRMSNormPre-AttentionGrouped-Query Attention8Q / 1KV heads · SW: 512Head dim: 256+RMSNormPre-FFNFeed-Forward NetworkGELUIntermediate: 6.1k+Final RMSNormOutput Logits

Gemma 4 E2B

Gemma 4 E2B is an ultra-efficient model with 2.3B effective parameters (5.1B with Per-Layer Embeddings) designed for mobile and IoT devices. Supports text, image, and audio input with 128K context window, delivering frontier capabilities on edge devices with near-zero latency and offline operation. Features built-in reasoning mode and native function calling for agentic workflows.

About Gemma 4

Gemma 4 is Google DeepMind's most advanced open model family, built from Gemini 3 research and technology. Featuring both Dense and Mixture-of-Experts (MoE) architectures, these multimodal models handle text, images, and audio (on smaller variants), with context windows up to 256K tokens. Designed for frontier-level performance across reasoning, coding, and agentic workflows, Gemma 4 delivers unprecedented intelligence-per-parameter from mobile devices to enterprise servers. Released under Apache 2.0 license.


Other Gemma 4 Models

Evaluation Benchmarks

No evaluation benchmarks for Gemma 4 E2B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

66 / 100

Gemma 4 E2B Model Integrity Report

Total Score

66

/ 100

B

Audit Note

Gemma 4 E2B exhibits strong transparency in licensing and architectural specifications, particularly regarding its unique 'effective parameter' design. However, it remains opaque concerning its training data composition and the specific compute resources utilized during development. While the model is highly accessible for edge deployment, the lack of data provenance and reproducible evaluation scripts limits its overall transparency profile.

Upstream

19.0 / 30

Architectural Provenance

7.0 / 10

Gemma 4 E2B is explicitly documented as a decoder-only Transformer with several specific modifications: Per-Layer Embeddings (PLE), Alternating Attention (sliding-window vs. global), and Shared KV Cache. The model card and technical blog posts detail the layer count (35) and the specific 4:1 ratio of local to global attention layers. While the training methodology (knowledge distillation from Gemini 3) is mentioned, the exact pre-training procedure and data mixtures remain high-level rather than fully reproducible.

Dataset Composition

3.0 / 10

Information regarding the training data is extremely limited. Official documentation states the model was trained on a 'diverse collection of datasets' including web text, code, and multimodal data (images/audio), but provides no specific percentage breakdowns, source names, or detailed filtering/cleaning methodologies. The lack of data provenance is a significant transparency gap common to the Gemma family.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly available via the 'gemma' GitHub repository and Hugging Face. It uses a SentencePiece-based approach with a large vocabulary of 262,144 tokens, supporting over 140 languages. Documentation clearly specifies the tokenization of multimodal inputs (e.g., 16x16 patches for images and mel-spectrograms for audio) and how they are interleaved into the text sequence.

Model

24.0 / 40

Parameter Density

8.0 / 10

Google provides clear distinctions between 'effective' and 'total' parameters. The E2B variant has 2.3B active parameters during inference but 5.1B total parameters due to the Per-Layer Embeddings (PLE) architecture. This breakdown is consistently reported across official model cards and third-party technical reviews, clarifying the memory vs. compute trade-offs.

Training Compute

2.0 / 10

There is almost no public disclosure regarding the specific compute resources used to train Gemma 4. While it is known to be trained on Google's TPU infrastructure, the total TPU/GPU hours, carbon footprint, and specific hardware counts are not provided in the release documentation or model cards.

Benchmark Reproducibility

5.0 / 10

While Google provides scores for standard benchmarks (MMLU Pro, AIME 2026, LiveCodeBench), the exact evaluation prompts and few-shot examples are not fully disclosed in a reproducible harness. Third-party verification from platforms like Artificial Analysis and LMSys Chatbot Arena exists, but the internal 'thinking mode' benchmarks lack public validation scripts.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as a Google-developed model within the Gemma 4 family. It is aware of its versioning and its specific multimodal capabilities (text, image, audio). There are no documented cases of the model claiming to be a competitor's product or misrepresenting its underlying architecture.

Downstream

23.0 / 30

License Clarity

10.0 / 10

Gemma 4 marks a significant shift to the Apache 2.0 license, which is a standard, permissive open-source license. The terms are unambiguous, allowing for unrestricted commercial use, modification, and distribution. This is a major improvement over previous custom 'Gemma Terms of Use' and provides maximum legal clarity for developers.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Documentation specifies that the E2B variant can run in under 1.5GB of RAM with 4-bit quantization and provides specific VRAM targets for FP16 (10GB) and 8-bit (5-8GB). Performance metrics for edge devices like Raspberry Pi 5 and mobile NPUs are also publicly available.

Versioning Drift

5.0 / 10

The model uses clear naming conventions (E2B, E4B, IT variants), but a formal semantic versioning changelog for weight updates is not yet established. While the initial release is well-documented, there is no public framework for tracking silent updates or performance drift over time beyond the initial model card.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs