Parameters
8B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
Llama-3.1-Community
Release Date
14 Nov 2024
Knowledge Cutoff
Mar 2023
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
32
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
Sahabat-AI-Llama3-8B-Instruct is a specialized large language model developed through a collaboration between GoTo Group and Indosat Ooredoo Hutchison. This model is constructed using a continued pre-training (CPT) approach on the Meta Llama 3 architecture, specifically optimized to reflect the linguistic patterns and cultural context of Indonesia. By incorporating a significant corpus of Indonesian text and regional languages such as Javanese and Sundanese, the model provides localized language processing capabilities that account for regional idioms and social contexts.
The technical framework is a dense, decoder-only Transformer architecture comprising 32 layers and a hidden dimension of 4096. It employs Grouped Query Attention (GQA) with 32 query heads and 8 key-value heads to improve inference efficiency. The model utilizes Rotary Positional Embeddings (RoPE) for sequence modeling and SwiGLU activation functions within its feed-forward layers. Training was facilitated by the NVIDIA NeMo framework, allowing the weights to be refined on a dataset of approximately 50 billion tokens, followed by supervised fine-tuning on hundreds of thousands of instruction-completion pairs.
This instruction-tuned variant is designed for high-quality interactions in both formal and informal Indonesian. It addresses specific cultural sensitivities and linguistic variations that are often missing in general-purpose global models. Primary applications include automated customer support for the Indonesian market, localized content synthesis, and technical assistance within the regional digital ecosystem. The model is compatible with the Transformers library and optimized for deployment on standardized accelerated computing infrastructure.
Sahabat-AI is an Indonesian language model family co-initiated by GoTo and Indosat Ooredoo Hutchison. Developed with AI Singapore and NVIDIA, it is a collection of models (based on Gemma 2 and Llama 3) specifically optimized for Bahasa Indonesia and regional languages like Javanese and Sundanese.
No evaluation benchmarks for Sahabat-AI-Llama3-8B-Instruct available.
Overall Rank
-
Coding Rank
-
Total Score
67
/ 100
Sahabat-AI demonstrates good transparency regarding its architectural foundations and the specific volume of instruction-tuning data used for regional language optimization. While it provides clear hardware specifications for its fine-tuning phase and maintains a consistent identity, it lacks detailed disclosure on the specific sources of its continued pre-training data and comprehensive benchmark reproduction assets. The model is honest about its lack of safety alignment, placing the responsibility for downstream filtering on the user.
Architectural Provenance
The model is explicitly identified as a continued pre-training (CPT) variant of the Meta Llama 3 architecture. Documentation specifies it is a dense, decoder-only Transformer with 32 layers, 4096 hidden dimensions, and Grouped Query Attention (GQA). The training methodology is described as a combination of full parameter fine-tuning, on-policy alignment, and model merging using the NVIDIA NeMo framework. While the base architecture is well-documented, the specific modifications for the CPT phase are described at a high level without a full technical paper detailing the exact layer-wise changes if any.
Dataset Composition
The model card provides specific token counts for the CPT phase (approximately 50 billion tokens) and instruction tuning (448,000 Indonesian, 96,000 Javanese, 98,000 Sundanese, and 129,000 English pairs). However, the specific sources of the 50B CPT tokens are not disclosed beyond 'publicly available online data' and 'synthetic instructions'. There is no detailed breakdown of the web-to-code ratio or specific filtering/cleaning methodologies provided, which are critical for high-tier transparency.
Tokenizer Integrity
The model utilizes the default Llama 3 tokenizer with a vocabulary size of 128,000 tokens. This tokenizer is publicly accessible and its performance on the target languages (Indonesian, Javanese, Sundanese) is verifiable through the weights. The documentation confirms the context length is 8192 tokens, though some inference engines like vLLM may cap it at 4096. The alignment between the tokenizer and the claimed language support is well-documented.
Parameter Density
The model is clearly stated to have 8.03 billion total parameters. As a dense architecture, all parameters are active during inference. The architectural breakdown (layers, hidden dims, attention heads) is provided in the technical specifications. There is no ambiguity regarding MoE active vs. total parameters, and the parameter count is consistent across official sources.
Training Compute
The documentation provides specific hardware details (8x H100-80GB GPUs) and durations for the fine-tuning (4 hours) and alignment (2 hours) phases. However, the compute resources used for the 50B token continued pre-training phase are not explicitly detailed in terms of total GPU hours or hardware specifications. Carbon footprint and total estimated cost for the entire development cycle are missing.
Benchmark Reproducibility
Evaluation results are provided for the SEA HELM (BHASA) benchmark and standard English benchmarks (IFEval, MMLU-Pro). While the benchmarks are named and some methodology is described (zero-shot with native prompts), the exact evaluation code and full prompt sets are not publicly linked in the repository. The documentation notes discrepancies with official leaderboards due to inference engine differences (vLLM vs. Transformers), which adds a layer of complexity to independent verification.
Identity Consistency
The model consistently identifies itself as Sahabat-AI and is transparent about its origins as a fine-tuned Llama 3 variant. It does not claim to be a different model (like GPT-4) and provides clear versioning (v1). The model card explicitly lists its intended use cases and limitations, including a lack of safety alignment, which demonstrates high honesty regarding its identity and capabilities.
License Clarity
The model is released under the Llama 3.1 Community License. This license is publicly available and outlines terms for commercial use (up to 700M monthly active users) and redistribution. However, there is a slight discrepancy in documentation where some files refer to the 'Llama 3 Community License' while others mention 'Llama 3.1', which could lead to minor legal ambiguity regarding specific derivative work terms.
Hardware Footprint
Basic VRAM requirements are provided through third-party and community documentation (e.g., ~16GB for FP16, ~5GB for 4-bit quantization). The official documentation mentions the use of vLLM and Hugging Face Transformers but lacks a comprehensive official table of VRAM vs. context length scaling or specific quantization accuracy trade-off data provided directly by the developers.
Versioning Drift
The model uses a versioning string (v1), but there is no public changelog or detailed version history tracking changes between internal iterations or checkpoints. There is no formal mechanism described for tracking or notifying users of behavioral drift, and previous versions of the weights are not easily accessible in a structured historical archive.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens