Active Parameters
25.2B
Context Length
256K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
2 Apr 2026
Knowledge Cutoff
-
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
16
Key-Value Heads
8
Attention Head Dimension
256
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
1,024
Normalization
RMS Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
2,112
Number of Layers
30
FFN Intermediate Size (Dense)
704
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
262,144
Mixture of Experts
Total Expert Parameters
3.8B
Number of Experts
128
Active Experts
8
Shared Experts
-
FFN Intermediate Size (per Expert)
704
Dense Layers Before MoE
-
Gemma 4 26B A4B is a Mixture-of-Experts model with 25.2B total parameters but only 3.8B active per inference, achieving the speed of a 4B model with near-31B performance. Features 128 experts (8 active) with 256K context window, supporting text and image input. Optimized for fast inference on consumer GPUs while delivering frontier-level reasoning and coding capabilities.
Gemma 4 is Google DeepMind's most advanced open model family, built from Gemini 3 research and technology. Featuring both Dense and Mixture-of-Experts (MoE) architectures, these multimodal models handle text, images, and audio (on smaller variants), with context windows up to 256K tokens. Designed for frontier-level performance across reasoning, coding, and agentic workflows, Gemma 4 delivers unprecedented intelligence-per-parameter from mobile devices to enterprise servers. Released under Apache 2.0 license.
Rank
#73
No evaluation benchmarks for Gemma 4 26B A4B available.
Overall Rank
#73
Coding Rank
-
Total Score
70
/ 100
Gemma 4 26B A4B exhibits strong transparency in its licensing and architectural specifications, particularly regarding its Mixture-of-Experts structure and hardware requirements. However, it suffers from significant opacity in training data provenance and compute resources, lacking a formal technical paper to verify its underlying methodology. The transition to a standard Apache 2.0 license is a commendable step toward industry-leading transparency for open-weight models.
Architectural Provenance
Gemma 4 26B A4B is explicitly documented as a Mixture-of-Experts (MoE) model derived from the Gemini 3 research lineage. Technical documentation details a hybrid attention mechanism alternating between local sliding-window (1024 tokens) and global full-context layers. It utilizes 128 experts with a routing policy that activates 8 experts plus 1 shared expert per token. While the high-level architecture is well-described across official blog posts and model cards, a formal peer-reviewed technical paper with full ablation studies is currently absent, preventing a higher score.
Dataset Composition
Disclosure regarding training data is limited to vague marketing claims. Documentation states the model was trained on a 'diverse' dataset supporting over 140 languages and interleaved multimodal inputs (text and images). However, there is no public breakdown of data sources (e.g., percentages of web, code, or academic data), no detailed filtering/cleaning methodology, and no disclosure of the specific proportions of synthetic vs. organic data used. The lack of a technical report leaves these critical details unverifiable.
Tokenizer Integrity
The tokenizer is publicly accessible via the official GitHub repository and Hugging Face collections. It supports a large vocabulary consistent with the claimed 140+ language support. Technical specifications for tokenization of multimodal inputs (variable resolution image tokens) are documented, with supported visual token budgets (70, 140, 280, 560, 1120) clearly stated. Integration with standard libraries like Transformers and vLLM allows for independent verification of tokenization behavior.
Parameter Density
Google provides exemplary transparency regarding parameter density for this variant. The model is clearly labeled '26B A4B', explicitly denoting 25.2B total parameters with 3.8B active parameters per inference. Documentation further clarifies the expert structure (128 total experts, 8 active per token) and the use of a shared expert. This level of detail prevents the common MoE 'parameter inflation' confusion and provides clear expectations for both memory (total params) and compute (active params).
Training Compute
Information regarding training compute is almost entirely absent. While documentation mentions the model can be fine-tuned on TPUs and H100s, there is no disclosure of the total GPU/TPU hours required for the initial pre-training, no hardware cluster specifications used for the primary run, and no carbon footprint or environmental impact calculations. This represents a significant transparency gap typical of proprietary-derived models.
Benchmark Reproducibility
Official benchmarks (MMLU-Pro: 82.4%, GSM8K: 94.1%) are provided with some methodological notes, such as the use of fixed temperature (0.1) and top-p (0.95) sampling. However, the evaluation code itself is not fully centralized in a reproducible repository, and specific few-shot prompts used for all frontier benchmarks are not exhaustively disclosed. Third-party verification on leaderboards like Open LLM Leaderboard and Arena AI provides some external validation, but the lack of a technical paper limits full reproducibility.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as a member of the Gemma 4 family and acknowledging its MoE architecture in system-level interactions. It is transparent about its versioning and its relationship to the Gemini research line. There are no documented instances of the model claiming to be a competitor's product or misrepresenting its parameter count in self-identification tasks.
License Clarity
Gemma 4 marks a significant shift to a standard, OSI-approved Apache 2.0 license. This is a major transparency improvement over previous 'Gemma Terms of Use' licenses. The terms are clear, publicly accessible, and allow for unrestricted commercial use, modification, and redistribution without revenue caps or usage restrictions. The license is consistently applied across weights, code, and documentation.
Hardware Footprint
Hardware requirements are extensively documented for various quantization levels (FP16, Q8, Q4). Official and community documentation (e.g., Unsloth, vLLM) provides specific VRAM targets: ~18GB for 4-bit and ~60GB for BF16. Crucially, the documentation includes memory scaling data for the 256K context window, noting that the hybrid attention mechanism allows for more efficient VRAM usage at long contexts compared to standard dense models. Quantization trade-offs (e.g., <2.8% loss at 4-bit) are also disclosed.
Versioning Drift
The model uses basic versioning (Gemma 4 26B A4B), but a comprehensive, public changelog for weight updates or 'silent' fine-tuning refreshes is not maintained. While the release date is clear, there is no formal infrastructure for tracking behavioral drift over time or accessing specific 'checkpoint' versions beyond the initial release. This makes it difficult for developers to ensure long-term stability in production environments.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online