Parameters
9.2B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
Gemma-Community
Release Date
14 Nov 2024
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
16
Key-Value Heads
8
Attention Head Dimension
256
Position Embedding
Absolute Position Embedding
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
4,096
Normalization
RMS Normalization
Activation Function
Gated GELU
Dimensions
Hidden Dimension Size
3,584
Number of Layers
42
FFN Intermediate Size (Dense)
14,336
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
256,000
Sahabat-AI-Gemma2-9B is a specialized large language model designed to handle the linguistic complexities of the Indonesian archipelago, including regional dialects such as Javanese and Sundanese. Developed through a collaboration between GoTo and Indosat Ooredoo Hutchison, with technical support from AI Singapore and NVIDIA, the model is built upon the Gemma 2 9B architecture. It undergoes a rigorous continued pre-training (CPT) phase using approximately 50 billion tokens of Indonesian-centric data. This localized training enables the model to capture deep cultural context and grammatical nuances that are often lost in general-purpose multilingual models.
The technical architecture follows the dense decoder-only transformer design of Gemma 2, incorporating significant optimizations for inference efficiency and training stability. It utilizes Grouped-Query Attention (GQA) with 16 query heads and 8 key-value heads, effectively reducing memory bandwidth requirements during generation. A hallmark of this architecture is the interleaving of global and local sliding window attention layers, which balances long-range dependency modeling with computational performance. The model employs the GeGLU activation function and implements a hybrid normalization scheme using RMSNorm in both pre-norm and post-norm configurations to maintain signal integrity across its 42 layers.
Positioned for deployment in diverse Indonesian applications, Sahabat-AI-Gemma2-9B is engineered for tasks such as multilingual question answering, sentiment analysis, and translation. It utilizes Rotary Position Embeddings (RoPE) and features logit soft-capping to prevent gradient explosion during training and improve overall generation quality. As an open-weights release under the Gemma Community License, it provides a foundational resource for developers to build localized AI services, ranging from enterprise-grade virtual assistants to educational tools optimized for Indonesia's unique digital landscape.
Sahabat-AI is an Indonesian language model family co-initiated by GoTo and Indosat Ooredoo Hutchison. Developed with AI Singapore and NVIDIA, it is a collection of models (based on Gemma 2 and Llama 3) specifically optimized for Bahasa Indonesia and regional languages like Javanese and Sundanese.
No evaluation benchmarks for Sahabat-AI-Gemma2-9B available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
Sahabat-AI-Gemma2-9B exhibits strong transparency regarding its architectural foundation and hardware requirements, providing clear specifications for deployment. While it offers good disclosure of training hardware and general data sources, it lacks a detailed percentage breakdown of its pre-training corpus and public access to its full evaluation suite. The model's identity and licensing are well-defined, though long-term versioning and environmental impact reporting remain areas for improvement.
Architectural Provenance
The model is explicitly identified as a continued pre-training (CPT) and instruction-tuned variant of Google's Gemma 2 9B. The architecture is well-documented as a dense decoder-only transformer with 42 layers, utilizing Grouped-Query Attention (GQA), GeGLU activation, and interleaved global/local sliding window attention. The transition from the base SEA-Lionv3 model to Sahabat-AI is clearly stated in official model cards on Hugging Face and NVIDIA's documentation.
Dataset Composition
The training data volume is disclosed as approximately 50B tokens for the CPT phase and specific counts for instruction tuning (448k Indonesian, 96k Javanese, 98k Sundanese, and 129k English pairs). While the general sources (web, curated instructions, synthetic data) and collaborators (AI Singapore, Indonesian universities, media groups) are named, a precise percentage breakdown of the 50B token pre-training corpus is missing, and the raw data is not public.
Tokenizer Integrity
The model uses the standard Gemma 2 tokenizer, which is publicly accessible and well-documented with a vocabulary size of 256,000 tokens. The documentation explicitly confirms the use of the default tokenizer to maintain compatibility with the Gemma ecosystem, and its performance across the target Indonesian dialects is verified through the SEA HELM benchmark results.
Parameter Density
The model's parameter count is precisely stated as 9.24 billion. As a dense architecture, all parameters are active during inference. Detailed architectural specifications including the number of layers (42), hidden dimension size, and attention head configuration (16 query, 8 KV) are available through the base Gemma 2 documentation and inherited by this variant.
Training Compute
Compute details are provided for both the CPT and fine-tuning phases. The CPT phase utilized 32x NVIDIA H100 80GB GPUs for 7 days using MosaicML Composer. The fine-tuning and alignment phases took 4 and 2 hours respectively on 8x H100 GPUs. While hardware and duration are clear, specific energy consumption or carbon footprint calculations were not provided in the official documentation.
Benchmark Reproducibility
Evaluation results are provided for the SEA HELM (BHASA) benchmark and standard English benchmarks (IFEval, MMLU-Pro). The documentation specifies the use of native prompts and zero-shot/five-shot settings. However, the exact prompt templates and full evaluation code are not publicly hosted in a dedicated repository, and the team notes discrepancies with official leaderboards due to inference platform differences (vLLM vs. Transformers).
Identity Consistency
The model consistently identifies as part of the Sahabat-AI family and acknowledges its Gemma 2 9B foundation. It is transparent about its specialization for Indonesian regional languages. There are no documented instances of the model claiming to be a different architecture or misrepresenting its origins in official testing or documentation.
License Clarity
The model is released under the Gemma Community License, which is a permissive open-weights license allowing for commercial use but subject to Google's specific terms. While the license is clear, there is some minor ambiguity in third-party repositories (like Ollama) occasionally labeling it as Apache 2.0, though official sources consistently cite the Gemma Community License.
Hardware Footprint
Detailed VRAM requirements are available for various configurations, including FP16 (approx. 21.34 GB) and quantized versions (e.g., Q4_K_M at 6.6 GB). Documentation provides specific GPU recommendations (RTX 3090/A6000) and notes the impact of context length (8192 tokens) on memory usage, offering high transparency for deployment.
Versioning Drift
The model uses basic versioning (v1), and the transition from 'v1.0' to 'v1' is documented in Hugging Face commit histories. However, there is no formal semantic versioning system or public changelog detailing specific weight updates or performance drift over time. The project is relatively new, so long-term tracking is not yet established.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online