ApX logo

Gemma 3n E2B IT

Active Parameters

6B

Context Length

32.768K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Google Gemma License

Release Date

20 May 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Total Expert Parameters

2.0B

Number of Experts

-

Active Experts

-

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2560

Number of Layers

30

Attention Heads

-

Key-Value Heads

-

Activation Function

-

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

System Requirements

VRAM requirements for different quantization methods and context sizes

Gemma 3n E2B IT

Gemma 3n E2B IT is a member of the Google Gemma 3n model family, engineered for efficient deployment and execution on resource-constrained devices, including mobile phones, laptops, and workstations. This model is designed to facilitate highly capable, real-time artificial intelligence inference directly at the edge. The E2B variant is specifically instruction-tuned for diverse applications.

The architectural foundation of Gemma 3n E2B IT is the Matryoshka Transformer, or MatFormer. A central innovation in this architecture is the implementation of selective parameter activation technology. This enables the model to operate with an effective memory footprint of approximately 2 billion parameters, even though the total number of parameters loaded during standard execution is 6 billion. This flexible parameter management allows for dynamic optimization of performance relative to computational resources. Furthermore, the model incorporates multimodal understanding capabilities, processing not only textual input but also images, video, and audio to generate textual outputs. For visual data, it employs a SigLIP vision encoder, which integrates a "Pan & Scan" algorithm to robustly handle varying image resolutions and aspect ratios. The attention mechanism within the model is structured with an interleaved pattern, alternating between five local layers, each utilizing a constrained sliding window of 1024 tokens, and one global layer. This design optimizes Key-Value (KV) cache management, which is essential for efficient processing of long contexts. Positional encoding is managed through Rotary Position Embeddings (RoPE), and the model leverages Grouped-Query Attention (GQA) along with RMSNorm for normalization.

In terms of operational characteristics, Gemma 3n E2B IT supports a context length of 32,768 tokens. It features comprehensive multilingual capabilities, having been trained on data encompassing over 140 languages, and utilizes a tokenizer optimized for broad language coverage. The model is applicable to a range of generative AI tasks, including question answering, summarization, and reasoning. Its efficient architecture makes it particularly suitable for integration into systems requiring low-resource deployment, such as content analysis tools, automated documentation systems, and interactive multimodal assistants. The model also supports function calling, enabling the construction of natural language interfaces for programmatic control.

About Gemma 3

Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.


Other Gemma 3 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#51

BenchmarkScoreRank

Agentic Coding

LiveBench Agentic

0.02

19

Professional Knowledge

MMLU Pro

0.41

26

0.16

29

0.26

29

Graduate-Level QA

GPQA

0.25

29

0.20

30

General Knowledge

MMLU

0.25

37

Rankings

Overall Rank

#51

Coding Rank

#39

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs