ApX logoApX logo

Sahabat-AI-Gemma2-9B-Instruct

Parameters

9.2B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

Gemma-Community

Release Date

14 Nov 2024

Knowledge Cutoff

Mar 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

16

Key-Value Heads

8

Attention Head Dimension

256

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

4,096

Normalization

RMS Normalization

Activation Function

Gated GELU

Dimensions

Hidden Dimension Size

3,584

Number of Layers

42

FFN Intermediate Size (Dense)

14,336

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

256,000

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 3.6k · Context: 8.2k · Vocab: 256kx 42 layersRMSNormPre-AttentionMulti-Head Attention16Q / 8KV heads · SW: 4.1kHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkGated GELUIntermediate: 14.3k+Final RMSNormOutput Logits

Sahabat-AI-Gemma2-9B-Instruct

Sahabat-AI-Gemma2-9B-Instruct is a specialized large language model developed through a strategic collaboration between GoTo Group, Indosat Ooredoo Hutchison, and AI Singapore. Built upon the Google Gemma 2 architecture, this variant is the result of continued pre-training (CPT) and intensive instruction tuning specifically tailored for the Indonesian linguistic ecosystem. It is engineered to provide high-fidelity conversational capabilities not only in standard Bahasa Indonesia but also in major regional dialects, including Javanese and Sundanese, addressing the cultural and linguistic nuances inherent to the Indonesian archipelago.

The underlying architecture follows a decoder-only transformer design that incorporates several modern refinements for efficiency and stability. It utilizes Grouped-Query Attention (GQA) to optimize inference throughput and memory bandwidth, which is particularly effective for maintaining performance during long-context processing. For training stability and representational accuracy, the model employs RMSNorm for pre- and post-normalization across layers and integrates logit soft-capping to prevent divergence. The instruction-tuning phase involved a supervised fine-tuning process using a localized dataset of over 600,000 instruction-completion pairs, followed by on-policy alignment and model merging to refine its response quality and adherence to complex prompts.

Technically, the model is optimized for a wide array of natural language processing tasks, including sentiment analysis, toxicity detection, causal reasoning, and abstractive summarization within Southeast Asian contexts. By leveraging the base Gemma 2 9B weights, it inherits a robust world-knowledge foundation while specializing in regional idioms and cultural contexts that are often underrepresented in global models. This makes it a suitable candidate for developers building localized digital assistants, automated customer service interfaces, and educational tools designed for the Indonesian market.

About Sahabat-AI

Sahabat-AI is an Indonesian language model family co-initiated by GoTo and Indosat Ooredoo Hutchison. Developed with AI Singapore and NVIDIA, it is a collection of models (based on Gemma 2 and Llama 3) specifically optimized for Bahasa Indonesia and regional languages like Javanese and Sundanese.


Other Sahabat-AI Models

Evaluation Benchmarks

No evaluation benchmarks for Sahabat-AI-Gemma2-9B-Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

73 / 100

Sahabat-AI-Gemma2-9B-Instruct Model Integrity Report

Total Score

73

/ 100

B+

Audit Note

Sahabat-AI-Gemma2-9B-Instruct exhibits a high degree of transparency regarding its architectural lineage and the specific hardware used for its final training stages. It provides clear documentation on its instruction-tuning dataset sizes and licensing terms. The primary areas for improvement include more granular disclosure of the continued pre-training data sources and the publication of full evaluation code to ensure complete benchmark reproducibility.

Upstream

23.0 / 30

Architectural Provenance

8.5 / 10

The model is explicitly identified as a continued pre-training (CPT) and instruction-tuned variant of Google's Gemma 2 9B architecture. Documentation clearly details the transition from the base Gemma 2 to the CPT version (SEA-Lionv3 base) and finally to the Sahabat-AI-Gemma2-9B-Instruct. Technical refinements such as Grouped-Query Attention (GQA), RMSNorm, and logit soft-capping are well-documented. The training methodology, including full parameter fine-tuning, on-policy alignment, and model merging, is publicly disclosed in the model card and associated technical descriptions.

Dataset Composition

6.5 / 10

The model provides specific counts for its instruction-tuning datasets: 448,000 Indonesian, 96,000 Javanese, 98,000 Sundanese, and 129,000 English instruction-completion pairs. It also notes that the continued pre-training phase involved approximately 50 billion tokens. However, while categories (synthetic, hand-curated, publicly available) are mentioned, the specific raw data sources and the exact mixture of the 50B token CPT corpus are not fully itemized, preventing a higher score.

Tokenizer Integrity

8.0 / 10

The model utilizes the standard Gemma 2 tokenizer, which has a well-documented vocabulary size of 256,128 tokens. This tokenizer is publicly accessible via the Hugging Face repository. The documentation confirms the use of this default tokenizer and its alignment with the model's architectural requirements, though specific analysis of its efficiency on regional Indonesian dialects (Javanese/Sundanese) compared to standard Indonesian is not deeply detailed in the public documentation.

Model

29.5 / 40

Parameter Density

7.5 / 10

The model's parameter count is clearly stated as 9.24 billion. As a dense model, all parameters are active during inference. The architectural breakdown is inherited from the well-documented Gemma 2 9B framework, including details on attention mechanisms and layer normalization. The use of '9B' in the name is accurate to its total parameter density.

Training Compute

7.0 / 10

The documentation provides specific hardware and duration details: fine-tuning took approximately 4 hours and alignment took 2 hours, both conducted on a cluster of 8x NVIDIA H100-80GB GPUs. While the specific compute for the 50B token continued pre-training phase is less granularly detailed than the instruction-tuning phase, the disclosure of the final tuning stages is exemplary for transparency.

Benchmark Reproducibility

6.0 / 10

Evaluation results are provided for the SEA HELM (BHASA) benchmark and IFEval, with specific scores for tasks like sentiment analysis and toxicity detection. The documentation specifies the use of zero-shot and few-shot (5-shot) prompting with native prompts. However, the exact evaluation code and the specific localized versions of the IFEval prompts are not fully public, and the model card notes discrepancies between internal vLLM-based testing and Hugging Face Leaderboard results due to context window limitations.

Identity Consistency

9.0 / 10

The model consistently identifies itself as part of the Sahabat-AI family and correctly attributes its lineage to the Gemma 2 architecture. There is no evidence of identity confusion or claims of being a competitor's model. The documentation and model card maintain a coherent identity across platforms (Hugging Face, NVIDIA NGC, and GitHub).

Downstream

20.5 / 30

License Clarity

8.5 / 10

The model is released under the Gemma Community License, which is a standard, publicly available license allowing for both commercial and non-commercial use with specific attribution and usage restrictions. The license terms are clearly stated on the Hugging Face repository and the NVIDIA NGC catalog. There are no conflicting terms between the weights and the provided code.

Hardware Footprint

7.0 / 10

VRAM requirements are well-documented across different platforms, with specific estimates for FP16 (approx. 18-21GB) and various quantization levels (e.g., Q4_0 requiring ~5.4GB). Recommended hardware (e.g., RTX 3090 for 24GB total VRAM) is provided. The documentation also notes the context length limit (8192 tokens) and its impact on memory, though detailed scaling curves for context length are not provided.

Versioning Drift

5.0 / 10

The model uses a 'v1' designation, and the release date is clearly tracked. However, there is no formal semantic versioning system or a detailed public changelog for minor updates or weight adjustments. While the repository is maintained, the infrastructure for tracking behavioral drift or accessing historical versions of the CPT weights is limited.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs

Sahabat-AI-Gemma2-9B-Instruct: Specifications and GPU VRAM Requirements