Parameters
9.2B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
Gemma-Community
Release Date
14 Nov 2024
Knowledge Cutoff
Mar 2024
Attention
Attention Structure
Multi-Head Attention
Attention Heads
16
Key-Value Heads
8
Attention Head Dimension
256
Position Embedding
Absolute Position Embedding
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
4,096
Normalization
RMS Normalization
Activation Function
Gated GELU
Dimensions
Hidden Dimension Size
3,584
Number of Layers
42
FFN Intermediate Size (Dense)
14,336
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
256,000
Sahabat-AI-Gemma2-9B-Instruct is a specialized large language model developed through a strategic collaboration between GoTo Group, Indosat Ooredoo Hutchison, and AI Singapore. Built upon the Google Gemma 2 architecture, this variant is the result of continued pre-training (CPT) and intensive instruction tuning specifically tailored for the Indonesian linguistic ecosystem. It is engineered to provide high-fidelity conversational capabilities not only in standard Bahasa Indonesia but also in major regional dialects, including Javanese and Sundanese, addressing the cultural and linguistic nuances inherent to the Indonesian archipelago.
The underlying architecture follows a decoder-only transformer design that incorporates several modern refinements for efficiency and stability. It utilizes Grouped-Query Attention (GQA) to optimize inference throughput and memory bandwidth, which is particularly effective for maintaining performance during long-context processing. For training stability and representational accuracy, the model employs RMSNorm for pre- and post-normalization across layers and integrates logit soft-capping to prevent divergence. The instruction-tuning phase involved a supervised fine-tuning process using a localized dataset of over 600,000 instruction-completion pairs, followed by on-policy alignment and model merging to refine its response quality and adherence to complex prompts.
Technically, the model is optimized for a wide array of natural language processing tasks, including sentiment analysis, toxicity detection, causal reasoning, and abstractive summarization within Southeast Asian contexts. By leveraging the base Gemma 2 9B weights, it inherits a robust world-knowledge foundation while specializing in regional idioms and cultural contexts that are often underrepresented in global models. This makes it a suitable candidate for developers building localized digital assistants, automated customer service interfaces, and educational tools designed for the Indonesian market.
Sahabat-AI is an Indonesian language model family co-initiated by GoTo and Indosat Ooredoo Hutchison. Developed with AI Singapore and NVIDIA, it is a collection of models (based on Gemma 2 and Llama 3) specifically optimized for Bahasa Indonesia and regional languages like Javanese and Sundanese.
No evaluation benchmarks for Sahabat-AI-Gemma2-9B-Instruct available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
Sahabat-AI-Gemma2-9B-Instruct exhibits a high degree of transparency regarding its architectural lineage and the specific hardware used for its final training stages. It provides clear documentation on its instruction-tuning dataset sizes and licensing terms. The primary areas for improvement include more granular disclosure of the continued pre-training data sources and the publication of full evaluation code to ensure complete benchmark reproducibility.
Architectural Provenance
The model is explicitly identified as a continued pre-training (CPT) and instruction-tuned variant of Google's Gemma 2 9B architecture. Documentation clearly details the transition from the base Gemma 2 to the CPT version (SEA-Lionv3 base) and finally to the Sahabat-AI-Gemma2-9B-Instruct. Technical refinements such as Grouped-Query Attention (GQA), RMSNorm, and logit soft-capping are well-documented. The training methodology, including full parameter fine-tuning, on-policy alignment, and model merging, is publicly disclosed in the model card and associated technical descriptions.
Dataset Composition
The model provides specific counts for its instruction-tuning datasets: 448,000 Indonesian, 96,000 Javanese, 98,000 Sundanese, and 129,000 English instruction-completion pairs. It also notes that the continued pre-training phase involved approximately 50 billion tokens. However, while categories (synthetic, hand-curated, publicly available) are mentioned, the specific raw data sources and the exact mixture of the 50B token CPT corpus are not fully itemized, preventing a higher score.
Tokenizer Integrity
The model utilizes the standard Gemma 2 tokenizer, which has a well-documented vocabulary size of 256,128 tokens. This tokenizer is publicly accessible via the Hugging Face repository. The documentation confirms the use of this default tokenizer and its alignment with the model's architectural requirements, though specific analysis of its efficiency on regional Indonesian dialects (Javanese/Sundanese) compared to standard Indonesian is not deeply detailed in the public documentation.
Parameter Density
The model's parameter count is clearly stated as 9.24 billion. As a dense model, all parameters are active during inference. The architectural breakdown is inherited from the well-documented Gemma 2 9B framework, including details on attention mechanisms and layer normalization. The use of '9B' in the name is accurate to its total parameter density.
Training Compute
The documentation provides specific hardware and duration details: fine-tuning took approximately 4 hours and alignment took 2 hours, both conducted on a cluster of 8x NVIDIA H100-80GB GPUs. While the specific compute for the 50B token continued pre-training phase is less granularly detailed than the instruction-tuning phase, the disclosure of the final tuning stages is exemplary for transparency.
Benchmark Reproducibility
Evaluation results are provided for the SEA HELM (BHASA) benchmark and IFEval, with specific scores for tasks like sentiment analysis and toxicity detection. The documentation specifies the use of zero-shot and few-shot (5-shot) prompting with native prompts. However, the exact evaluation code and the specific localized versions of the IFEval prompts are not fully public, and the model card notes discrepancies between internal vLLM-based testing and Hugging Face Leaderboard results due to context window limitations.
Identity Consistency
The model consistently identifies itself as part of the Sahabat-AI family and correctly attributes its lineage to the Gemma 2 architecture. There is no evidence of identity confusion or claims of being a competitor's model. The documentation and model card maintain a coherent identity across platforms (Hugging Face, NVIDIA NGC, and GitHub).
License Clarity
The model is released under the Gemma Community License, which is a standard, publicly available license allowing for both commercial and non-commercial use with specific attribution and usage restrictions. The license terms are clearly stated on the Hugging Face repository and the NVIDIA NGC catalog. There are no conflicting terms between the weights and the provided code.
Hardware Footprint
VRAM requirements are well-documented across different platforms, with specific estimates for FP16 (approx. 18-21GB) and various quantization levels (e.g., Q4_0 requiring ~5.4GB). Recommended hardware (e.g., RTX 3090 for 24GB total VRAM) is provided. The documentation also notes the context length limit (8192 tokens) and its impact on memory, though detailed scaling curves for context length are not provided.
Versioning Drift
The model uses a 'v1' designation, and the release date is clearly tracked. However, there is no formal semantic versioning system or a detailed public changelog for minor updates or weight adjustments. While the repository is maintained, the infrastructure for tracking behavioral drift or accessing historical versions of the CPT weights is limited.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online