ApX logoApX logo

Gemma 3 12B

Parameters

12B

Context Length

128K

Modality

Multimodal

Architecture

Dense

License

Gemma Terms of Use

Release Date

12 Mar 2025

Knowledge Cutoff

Aug 2024

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

26.98 GB VRAM

Consumer

2x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

128,000 tokens

61.38 GB VRAM

Consumer

3x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 3.1k · Context: 128Kx 42 layersRMSNormPre-AttentionGrouped-Query Attention48Q / 12KV headsHead dim: 64+RMSNormPre-FFNFeed-Forward NetworkActivation+Final RMSNormOutput Logits

Evaluation Benchmarks

Rank

#85

BenchmarkScoreRank

Web Development

WebDev Arena

1342

60

General Text

Text Arena

1341

71

Rankings

Overall Rank

#85

Coding Rank

#70

About Gemma 3 12B

Gemma 3 12B is a 12-billion-parameter multimodal model developed by Google, designed to process both text and image inputs while generating textual outputs. This model is part of the Gemma family, which is built upon the foundational research and technology employed in the Gemini series of models. The architectural design features a decoder-only transformer with Grouped-Query Attention (GQA), incorporating a distinctive pattern of five local sliding window self-attention layers interleaved with one global self-attention layer. This configuration is engineered to optimize KV-cache memory utilization, thereby enhancing efficiency, particularly for longer sequences. Position embeddings are handled via Rotary Position Embeddings (RoPE), adapted with an increased base frequency for extended context windows.

Optimized for deployment across a range of hardware configurations, Gemma 3 12B can operate efficiently on single-GPU systems, workstations, laptops, and even mobile devices. Its multimodal capability is achieved through the integration of a tailored SigLIP vision encoder, which converts images into a sequence of soft tokens for processing. The model supports an expansive context length of 128,000 tokens, enabling it to process substantial amounts of information, including extensive documents and multiple images, within a single prompt. Furthermore, it offers broad multilingual support, encompassing over 140 languages.

Typical use cases for Gemma 3 12B include advanced natural language understanding and generation tasks such as question answering, comprehensive summarization, and intricate reasoning. Its multimodal capabilities extend to image interpretation, object identification within visual data, and the extraction of textual information from images, making it suitable for a diverse set of vision-language applications. The model also supports function calling, facilitating the development of natural language interfaces for programmatic interactions.

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

48

Key-Value Heads

12

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

-

Dimensions

Hidden Dimension Size

3,072

Number of Layers

42

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Model Integrity

Total Score

B

69 / 100

Gemma 3 12B Model Integrity Report

Total Score

69

/ 100

B

Audit Note

Gemma 3 12B exhibits strong transparency in its architectural design and hardware requirements, supported by a detailed technical report. However, it remains opaque regarding the specific composition of its 12-trillion-token training set and the total compute resources consumed during training. The use of a custom license and limited disclosure of evaluation prompts represent moderate barriers to full open-source reproducibility.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

Google provides a detailed technical report (arXiv:2503.19786) and model cards that explicitly describe the 12B variant's architecture. It is a decoder-only transformer using Grouped-Query Attention (GQA) and a specific 5:1 interleaving of local sliding window and global attention layers to optimize KV-cache. The multimodal integration uses a 400M SigLIP vision encoder with a 'Pan & Scan' strategy for flexible resolutions. The report details the use of Rotary Position Embeddings (RoPE) with frequency scaling (1M for global, 10k for local) and QK-norm instead of soft-capping. While the high-level methodology is clear, the exact layer-by-layer configuration and specific distillation teacher details remain partially proprietary.

Dataset Composition

4.5 / 10

The model was trained on 12 trillion tokens of text and image data. Documentation identifies broad categories: web documents (140+ languages), code, mathematics, and image-text pairs. It mentions rigorous filtering for CSAM and sensitive personal information. However, it lacks a granular percentage breakdown of the mixture (e.g., exact ratio of code vs. web) and does not provide specific names or versions of the datasets used, citing general 'diverse internet data' and 'internal datasets' for vision tuning.

Tokenizer Integrity

9.0 / 10

The tokenizer is a SentencePiece model with a 262,144 vocabulary size, shared with the Gemini 2.0 series. It is publicly available via the Hugging Face 'transformers' library and Google's official repositories. Documentation specifies technical details such as split digits, preserved whitespace, and byte-level encodings. Its performance across 140+ languages is documented, and the vocabulary is verified to be more balanced for non-English text compared to previous versions.

Model

27.5 / 40

Parameter Density

8.5 / 10

The model is explicitly defined as a 12.2 billion parameter dense model. The technical report clarifies that the vision encoder is a 400M parameter SigLIP variant, which is shared across the 4B, 12B, and 27B models. This distinction between the language backbone and the vision component provides high transparency regarding active parameter counts during multimodal tasks.

Training Compute

3.5 / 10

Google discloses the hardware used (TPUv4p, TPUv5p, and TPUv5e) but fails to provide the total number of GPU/TPU hours or the specific duration of the training run. There is no calculated carbon footprint or energy consumption data provided in the technical report or model card, which are key requirements for high scores in this category.

Benchmark Reproducibility

6.0 / 10

The technical report includes results for standard benchmarks like MMLU, MATH, HumanEval, and MMMU. It specifies shot counts (e.g., 4-shot for MATH) and compares results against previous Gemma versions and competitors. However, the exact evaluation code and the specific prompts/few-shot examples used for all benchmarks are not fully public, making exact third-party reproduction difficult without significant effort.

Identity Consistency

9.5 / 10

Gemma 3 12B demonstrates high identity consistency. It is trained to identify as 'Gemma' and correctly acknowledges its developer (Google). It does not claim to be a competitor's model (like GPT-4) and is transparent about its multimodal nature and versioning in its system prompts and documentation.

Downstream

20.0 / 30

License Clarity

7.0 / 10

The model is released under the 'Gemma Terms of Use,' which is a custom open-weights license. It allows for commercial use and redistribution but imposes specific restrictions (e.g., attribution requirements, prohibited use policy, and non-sublicensable terms). While the terms are legally clear, they are more restrictive than standard OSI-approved licenses like Apache 2.0, leading to documented compatibility issues with other open-source software.

Hardware Footprint

8.0 / 10

Google provides comprehensive VRAM estimates for various precisions (BF16, INT8, INT4) and specific hardware recommendations (e.g., 27.6 GB for BF16, 6.6 GB for INT4). Documentation also covers the memory impact of the 128K context window and the KV-cache optimization. Quantization-Aware Training (QAT) checkpoints are provided with documented accuracy-efficiency trade-offs.

Versioning Drift

5.0 / 10

The model uses a clear naming convention (Gemma 3 12B) and distinguishes between base (PT) and instruction-tuned (IT) variants. However, Google does not maintain a public, detailed changelog for minor weight updates or provide a formal policy regarding model drift or deprecation schedules for specific checkpoints.

About Gemma 3

Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.


Other Gemma 3 Models
Gemma 3 12B: Specifications and GPU VRAM Requirements