Parameters
12B
Context Length
128K
Modality
Multimodal
Architecture
Dense
License
Gemma Terms of Use
Release Date
12 Mar 2025
Knowledge Cutoff
Aug 2024
VRAM requirements for different quantization methods and context sizes
1,024 tokens
Consumer
2x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
128,000 tokens
Consumer
3x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
Rank
#85
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1342 | 60 |
General Text Text Arena | 1341 | 71 |
Overall Rank
#85
Coding Rank
#70
Gemma 3 12B is a 12-billion-parameter multimodal model developed by Google, designed to process both text and image inputs while generating textual outputs. This model is part of the Gemma family, which is built upon the foundational research and technology employed in the Gemini series of models. The architectural design features a decoder-only transformer with Grouped-Query Attention (GQA), incorporating a distinctive pattern of five local sliding window self-attention layers interleaved with one global self-attention layer. This configuration is engineered to optimize KV-cache memory utilization, thereby enhancing efficiency, particularly for longer sequences. Position embeddings are handled via Rotary Position Embeddings (RoPE), adapted with an increased base frequency for extended context windows.
Optimized for deployment across a range of hardware configurations, Gemma 3 12B can operate efficiently on single-GPU systems, workstations, laptops, and even mobile devices. Its multimodal capability is achieved through the integration of a tailored SigLIP vision encoder, which converts images into a sequence of soft tokens for processing. The model supports an expansive context length of 128,000 tokens, enabling it to process substantial amounts of information, including extensive documents and multiple images, within a single prompt. Furthermore, it offers broad multilingual support, encompassing over 140 languages.
Typical use cases for Gemma 3 12B include advanced natural language understanding and generation tasks such as question answering, comprehensive summarization, and intricate reasoning. Its multimodal capabilities extend to image interpretation, object identification within visual data, and the extraction of textual information from images, making it suitable for a diverse set of vision-language applications. The model also supports function calling, facilitating the development of natural language interfaces for programmatic interactions.
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
48
Key-Value Heads
12
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
-
Dimensions
Hidden Dimension Size
3,072
Number of Layers
42
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Total Score
69
/ 100
Gemma 3 12B exhibits strong transparency in its architectural design and hardware requirements, supported by a detailed technical report. However, it remains opaque regarding the specific composition of its 12-trillion-token training set and the total compute resources consumed during training. The use of a custom license and limited disclosure of evaluation prompts represent moderate barriers to full open-source reproducibility.
Architectural Provenance
Google provides a detailed technical report (arXiv:2503.19786) and model cards that explicitly describe the 12B variant's architecture. It is a decoder-only transformer using Grouped-Query Attention (GQA) and a specific 5:1 interleaving of local sliding window and global attention layers to optimize KV-cache. The multimodal integration uses a 400M SigLIP vision encoder with a 'Pan & Scan' strategy for flexible resolutions. The report details the use of Rotary Position Embeddings (RoPE) with frequency scaling (1M for global, 10k for local) and QK-norm instead of soft-capping. While the high-level methodology is clear, the exact layer-by-layer configuration and specific distillation teacher details remain partially proprietary.
Dataset Composition
The model was trained on 12 trillion tokens of text and image data. Documentation identifies broad categories: web documents (140+ languages), code, mathematics, and image-text pairs. It mentions rigorous filtering for CSAM and sensitive personal information. However, it lacks a granular percentage breakdown of the mixture (e.g., exact ratio of code vs. web) and does not provide specific names or versions of the datasets used, citing general 'diverse internet data' and 'internal datasets' for vision tuning.
Tokenizer Integrity
The tokenizer is a SentencePiece model with a 262,144 vocabulary size, shared with the Gemini 2.0 series. It is publicly available via the Hugging Face 'transformers' library and Google's official repositories. Documentation specifies technical details such as split digits, preserved whitespace, and byte-level encodings. Its performance across 140+ languages is documented, and the vocabulary is verified to be more balanced for non-English text compared to previous versions.
Parameter Density
The model is explicitly defined as a 12.2 billion parameter dense model. The technical report clarifies that the vision encoder is a 400M parameter SigLIP variant, which is shared across the 4B, 12B, and 27B models. This distinction between the language backbone and the vision component provides high transparency regarding active parameter counts during multimodal tasks.
Training Compute
Google discloses the hardware used (TPUv4p, TPUv5p, and TPUv5e) but fails to provide the total number of GPU/TPU hours or the specific duration of the training run. There is no calculated carbon footprint or energy consumption data provided in the technical report or model card, which are key requirements for high scores in this category.
Benchmark Reproducibility
The technical report includes results for standard benchmarks like MMLU, MATH, HumanEval, and MMMU. It specifies shot counts (e.g., 4-shot for MATH) and compares results against previous Gemma versions and competitors. However, the exact evaluation code and the specific prompts/few-shot examples used for all benchmarks are not fully public, making exact third-party reproduction difficult without significant effort.
Identity Consistency
Gemma 3 12B demonstrates high identity consistency. It is trained to identify as 'Gemma' and correctly acknowledges its developer (Google). It does not claim to be a competitor's model (like GPT-4) and is transparent about its multimodal nature and versioning in its system prompts and documentation.
License Clarity
The model is released under the 'Gemma Terms of Use,' which is a custom open-weights license. It allows for commercial use and redistribution but imposes specific restrictions (e.g., attribution requirements, prohibited use policy, and non-sublicensable terms). While the terms are legally clear, they are more restrictive than standard OSI-approved licenses like Apache 2.0, leading to documented compatibility issues with other open-source software.
Hardware Footprint
Google provides comprehensive VRAM estimates for various precisions (BF16, INT8, INT4) and specific hardware recommendations (e.g., 27.6 GB for BF16, 6.6 GB for INT4). Documentation also covers the memory impact of the 128K context window and the KV-cache optimization. Quantization-Aware Training (QAT) checkpoints are provided with documented accuracy-efficiency trade-offs.
Versioning Drift
The model uses a clear naming convention (Gemma 3 12B) and distinguishes between base (PT) and instruction-tuned (IT) variants. However, Google does not maintain a public, detailed changelog for minor weight updates or provide a formal policy regarding model drift or deprecation schedules for specific checkpoints.
Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.
APX AI
Online