Parameters
5.1B
Context Length
128K
Modality
Multimodal
Architecture
Dense
License
Apache 2.0
Release Date
2 Apr 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
8
Key-Value Heads
1
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
512
Normalization
RMS Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
6,144
Number of Layers
35
FFN Intermediate Size (Dense)
6,144
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
262,144
Gemma 4 E2B is an ultra-efficient model with 2.3B effective parameters (5.1B with Per-Layer Embeddings) designed for mobile and IoT devices. Supports text, image, and audio input with 128K context window, delivering frontier capabilities on edge devices with near-zero latency and offline operation. Features built-in reasoning mode and native function calling for agentic workflows.
Gemma 4 is Google DeepMind's most advanced open model family, built from Gemini 3 research and technology. Featuring both Dense and Mixture-of-Experts (MoE) architectures, these multimodal models handle text, images, and audio (on smaller variants), with context windows up to 256K tokens. Designed for frontier-level performance across reasoning, coding, and agentic workflows, Gemma 4 delivers unprecedented intelligence-per-parameter from mobile devices to enterprise servers. Released under Apache 2.0 license.
No evaluation benchmarks for Gemma 4 E2B available.
Overall Rank
-
Coding Rank
-
Total Score
66
/ 100
Gemma 4 E2B exhibits strong transparency in licensing and architectural specifications, particularly regarding its unique 'effective parameter' design. However, it remains opaque concerning its training data composition and the specific compute resources utilized during development. While the model is highly accessible for edge deployment, the lack of data provenance and reproducible evaluation scripts limits its overall transparency profile.
Architectural Provenance
Gemma 4 E2B is explicitly documented as a decoder-only Transformer with several specific modifications: Per-Layer Embeddings (PLE), Alternating Attention (sliding-window vs. global), and Shared KV Cache. The model card and technical blog posts detail the layer count (35) and the specific 4:1 ratio of local to global attention layers. While the training methodology (knowledge distillation from Gemini 3) is mentioned, the exact pre-training procedure and data mixtures remain high-level rather than fully reproducible.
Dataset Composition
Information regarding the training data is extremely limited. Official documentation states the model was trained on a 'diverse collection of datasets' including web text, code, and multimodal data (images/audio), but provides no specific percentage breakdowns, source names, or detailed filtering/cleaning methodologies. The lack of data provenance is a significant transparency gap common to the Gemma family.
Tokenizer Integrity
The tokenizer is publicly available via the 'gemma' GitHub repository and Hugging Face. It uses a SentencePiece-based approach with a large vocabulary of 262,144 tokens, supporting over 140 languages. Documentation clearly specifies the tokenization of multimodal inputs (e.g., 16x16 patches for images and mel-spectrograms for audio) and how they are interleaved into the text sequence.
Parameter Density
Google provides clear distinctions between 'effective' and 'total' parameters. The E2B variant has 2.3B active parameters during inference but 5.1B total parameters due to the Per-Layer Embeddings (PLE) architecture. This breakdown is consistently reported across official model cards and third-party technical reviews, clarifying the memory vs. compute trade-offs.
Training Compute
There is almost no public disclosure regarding the specific compute resources used to train Gemma 4. While it is known to be trained on Google's TPU infrastructure, the total TPU/GPU hours, carbon footprint, and specific hardware counts are not provided in the release documentation or model cards.
Benchmark Reproducibility
While Google provides scores for standard benchmarks (MMLU Pro, AIME 2026, LiveCodeBench), the exact evaluation prompts and few-shot examples are not fully disclosed in a reproducible harness. Third-party verification from platforms like Artificial Analysis and LMSys Chatbot Arena exists, but the internal 'thinking mode' benchmarks lack public validation scripts.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a Google-developed model within the Gemma 4 family. It is aware of its versioning and its specific multimodal capabilities (text, image, audio). There are no documented cases of the model claiming to be a competitor's product or misrepresenting its underlying architecture.
License Clarity
Gemma 4 marks a significant shift to the Apache 2.0 license, which is a standard, permissive open-source license. The terms are unambiguous, allowing for unrestricted commercial use, modification, and distribution. This is a major improvement over previous custom 'Gemma Terms of Use' and provides maximum legal clarity for developers.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. Documentation specifies that the E2B variant can run in under 1.5GB of RAM with 4-bit quantization and provides specific VRAM targets for FP16 (10GB) and 8-bit (5-8GB). Performance metrics for edge devices like Raspberry Pi 5 and mobile NPUs are also publicly available.
Versioning Drift
The model uses clear naming conventions (E2B, E4B, IT variants), but a formal semantic versioning changelog for weight updates is not yet established. While the initial release is well-documented, there is no public framework for tracking silent updates or performance drift over time beyond the initial model card.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online