Parameters
27B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
Gemma License
Release Date
27 Jun 2024
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
16
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
46
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Gemma 2 is a family of advanced, open models developed by Google DeepMind, stemming from the same research that informed the Gemini models. This model family aims to provide robust capabilities for a range of text generation tasks, including but not limited to question answering, summarization, and reasoning. The 27B variant is engineered for efficient inference, facilitating deployment across various hardware environments, from high-performance workstations to more constrained consumer devices.
The architecture of Gemma 2 represents a progression in Transformer design, integrating several key innovations. These include the adoption of Grouped-Query Attention (GQA) and a strategic interleaving of local and global attention layers. This architectural refinement contributes to enhanced performance and improved inference efficiency, particularly when processing extended contexts. Furthermore, the model employs Logit soft-capping for training stability and incorporates Rotary Position Embeddings (RoPE) for effective positional encoding. Notably, the smaller 2B and 9B models within the Gemma 2 family were developed using knowledge distillation from a larger teacher model.
The Gemma 2 27B model is designed to achieve a high level of performance within its parameter class, while prioritizing computational efficiency. This efficiency enables cost-effective deployment, as the model supports full precision inference on a single high-performance GPU or TPU. The model's capabilities are applicable to tasks requiring sophisticated natural language understanding and generation, making it suitable for applications in content creation, conversational AI systems, and fundamental natural language processing research.
Gemma 2 is Google's family of open large language models, offering 2B, 9B, and 27B parameter sizes. Built upon the Gemma architecture, it incorporates innovations such as interleaved local and global attention, logit soft-capping for training stability, and Grouped Query Attention for inference efficiency. The smaller models leverage knowledge distillation.
Rank
#135
| Benchmark | Score | Rank |
|---|---|---|
QA Assistant ProLLM QA Assistant | 0.804 | 19 |
Summarization ProLLM Summarization | 0.59 | 25 |
General Knowledge MMLU | 0.752 | 27 |
Reasoning LiveBench Reasoning | 0.59 | 37 |
Web Development WebDev Arena | 1288 | 81 |
Overall Rank
#135
Coding Rank
#90
Total Score
65
/ 100
Gemma 2 27B demonstrates strong transparency in its architectural design and tokenizer implementation, providing researchers with clear technical specifications of its unique Transformer modifications. However, the model remains opaque regarding its training data sources and the total environmental impact of its development. While it is a highly capable open-weights model, the reliance on a custom license and the lack of reproducible evaluation artifacts limit its overall transparency profile.
Architectural Provenance
The model architecture is extensively documented in the official technical report ('Gemma 2: Improving Open Language Models at a Practical Size'). It details a decoder-only Transformer with specific innovations: Grouped-Query Attention (GQA), interleaved local sliding window (4096 tokens) and global attention (8192 tokens), and logit soft-capping (capped at 50.0 for attention and 30.0 for final logits). Unlike the 2B and 9B variants which used distillation, the 27B model is explicitly stated to be trained from scratch. The report provides a clear table of hyperparameters including layer counts (46), head counts, and embedding dimensions.
Dataset Composition
Transparency regarding training data is minimal. Google discloses the total token count (13 trillion) and general categories (web documents, code, science articles), but provides no specific percentage breakdown, source names, or sampling proportions. The methodology for 'data mixture' is vaguely attributed to 'ablations similar to Gemini 1.0' without further detail. While filtering for CSAM and PII is mentioned, the lack of source-level transparency or a dataset card makes the claims unverifiable.
Tokenizer Integrity
The tokenizer is publicly available on Hugging Face and documented as a SentencePiece-based subword tokenizer with a large vocabulary of 256,128 tokens. It uses byte-level encoding, digit splitting, and preserves whitespace. The vocabulary size and implementation details are consistent across official documentation and third-party libraries (transformers, llama.cpp), allowing for full verification of tokenization behavior.
Parameter Density
The model's parameter count is clearly stated as 27B, and it is a dense architecture. Detailed architectural breakdowns are provided in the technical report, specifying the number of layers (46), hidden size (4608), and the specific configuration of GQA heads. There is no ambiguity regarding active vs. total parameters as it is not an MoE model.
Training Compute
Google discloses the hardware used (TPUv5p) and the software stack (JAX and ML Pathways), but fails to provide the total compute budget in terms of TPU-hours or GPU-equivalent hours. No specific carbon footprint calculation or energy consumption data for the 27B training run is provided in the technical report, which only offers high-level '4M' efficiency marketing claims rather than model-specific environmental data.
Benchmark Reproducibility
While the technical report lists scores for standard benchmarks (MMLU, GSM8K, HumanEval), it lacks the specific evaluation code, exact prompts, or few-shot examples required for precise reproduction. Third-party reports have noted significant performance discrepancies when logit soft-capping is not correctly implemented in inference engines, highlighting a lack of detailed reproduction instructions for the model's unique architectural features.
Identity Consistency
The model consistently identifies as Gemma 2 and maintains a clear versioning identity. It does not exhibit the identity confusion seen in some other open-weight models that claim to be GPT-4. It is transparent about its nature as an AI developed by Google, and its capabilities generally align with its documented performance profile.
License Clarity
The model is released under the 'Gemma Terms of Use,' which is a custom license rather than a standard OSI-approved license like Apache 2.0. While it allows for commercial use and redistribution, it includes specific 'Use Restrictions' and a 'Responsible Use Policy.' The terms are clearly written and publicly accessible, but the custom nature and restrictive clauses prevent a higher score.
Hardware Footprint
Hardware requirements are well-documented by both Google and the community. Official documentation notes that the 27B model can run on a single A100 (80GB) or H100 at full precision. Community documentation on Hugging Face and Ollama provides detailed VRAM requirements for various quantization levels (4-bit, 8-bit) and context lengths, though Google's own documentation focuses primarily on high-end datacenter hardware.
Versioning Drift
Google uses version numbers (e.g., Gemma 2) and maintains a model card on Hugging Face. However, there is no detailed, public-facing changelog for minor weight updates or iterative 'safety' tuning. Users have reported 'silent' changes in behavior and performance issues related to library compatibility (e.g., Flash Attention vs. Soft-capping) that were not proactively documented in a centralized version history.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online