ApX logoApX logo

SEA-LION-7B-Instruct

Parameters

7.1B

Context Length

2.048K

Modality

Text

Architecture

Dense

License

Apache-2.0

Release Date

1 Feb 2024

Knowledge Cutoff

Sep 2023

Technical Specifications

Attention Structure

Multi-Head Attention

Hidden Dimension Size

4096

Number of Layers

32

Attention Heads

32

Key-Value Heads

32

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

SEA-LION-7B-Instruct

SEA-LION-7B-Instruct is a specialized large language model designed specifically for the Southeast Asian (SEA) region, providing optimized instruction-following capabilities across a diverse range of regional languages. Developed by AI Singapore, this model is built upon the MosaicML Pretrained Transformer (MPT) architecture, a decoder-only framework engineered for efficient training and inference. The model utilizes a custom-designed SEABPETokenizer with a significant vocabulary size of 256,000, which is specifically tailored to handle the unique linguistic structures and character sets of Southeast Asian languages, thereby reducing tokenization overhead and improving semantic representation compared to generic tokenizers.

Technically, the architecture is a dense transformer that incorporates key optimizations such as Grouped Query Attention (GQA) for improved memory efficiency and performance during inference. It employs Rotary Positional Embeddings (RoPE) to facilitate better handling of long-range dependencies within its context window. The instruction-tuning phase involved training on a rigorously curated dataset of English and Indonesian instruction-completion pairs, along with smaller sets for other ASEAN languages like Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, and Lao. This tuning process was performed using parameter-efficient fine-tuning (PEFT) techniques, specifically Low-Rank Adaptation (LoRA), ensuring the model maintains its foundational knowledge while specializing in task-oriented responses.

Performance characteristics of the model center on its ability to perform natural language understanding (NLU), generation (NLG), and reasoning (NLR) tasks within a Southeast Asian cultural and linguistic context. It is particularly effective for use cases such as regional question-answering, localized sentiment analysis, and translation between English and SEA languages. By prioritizing commercially permissive and high-quality training data, the model serves as a reliable foundation for developers building AI applications that require cultural nuance and linguistic accuracy for the Singaporean and broader ASEAN markets.

About SEA-LION

Southeast Asian Languages In One Network (SEA-LION) is a family of language models developed by AI Singapore for Southeast Asian languages. The models support English, Indonesian, Malay, Thai, Vietnamese, Tagalog, Burmese, Khmer, Lao, Tamil, and Chinese. It focuses on regional linguistic patterns and is available in base and instruction-tuned variants.


Other SEA-LION Models

Evaluation Benchmarks

No evaluation benchmarks for SEA-LION-7B-Instruct available.

Rankings

Overall Rank

-

Coding Rank

-

Model Transparency

Total Score

B+

75 / 100

SEA-LION-7B-Instruct Transparency Report

Total Score

75

/ 100

B+

Audit Note

SEA-LION-7B-Instruct exhibits a high level of transparency regarding its architectural origins and tokenizer design, providing specific hardware and duration details for its training. The project excels in disclosing the linguistic composition of its training data, which is critical for its regional focus. Areas for improvement include more granular documentation of version drift and more accessible hardware requirement matrices for various quantization levels.

Upstream

24.5 / 30

Architectural Provenance

8.0 / 10

The model is explicitly identified as being built on the MosaicML Pretrained Transformer (MPT) architecture, a decoder-only framework. Technical documentation specifies the use of Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE). Detailed layer configurations (32 layers, d_model 4096) are publicly available in the official documentation and model cards. The training methodology, including the use of MosaicML Composer for pre-training and LoRA for instruction tuning, is well-documented.

Dataset Composition

7.5 / 10

AI Singapore provides a detailed breakdown of the 980B token pre-training dataset, including specific percentages for English (58.2%), Chinese (9.3%), and various Southeast Asian languages (e.g., Vietnamese 6.5%, Indonesian 1.5%). Sources like RefinedWeb, mC4, and The Stack are named. For the instruction-tuning phase, the documentation specifies the use of thousands of English and Indonesian pairs, though the exact public availability of the full instruction dataset is limited to descriptions of curation and native-speaker verification.

Tokenizer Integrity

9.0 / 10

The model uses a custom SEABPETokenizer with a clearly stated vocabulary size of 256,000 tokens. The tokenizer is publicly accessible on Hugging Face, and its development (trained on 20M lines from a cleaned mC4 corpus) is documented. The use of SentencePiece and the BPE approach are verified, and the tokenizer is specifically designed to reduce tokenization overhead for Southeast Asian scripts, which is a verifiable technical claim.

Model

30.5 / 40

Parameter Density

8.5 / 10

The model is a dense transformer with 7.1 billion total parameters. Detailed architectural parameters are provided in official documentation (e.g., 32 layers, head_dim 32, d_model 4096). As a dense model, active parameters equal total parameters, and there is no ambiguity regarding MoE structures. The impact of the large vocabulary on the embedding layer's parameter share is also implicitly verifiable through the provided architecture specs.

Training Compute

7.0 / 10

The documentation provides specific hardware details: 32 AWS EC2 p4d.24xlarge instances utilizing 256 NVIDIA A100 40GB GPUs. The training duration is explicitly stated as 22 days for the 7B variant. While a specific carbon footprint calculation is not provided in the primary model card, the disclosure of hardware type, count, and duration allows for independent estimation of energy consumption.

Benchmark Reproducibility

6.0 / 10

The model was evaluated using the BHASA benchmark, which has associated public documentation and GitHub repositories. Zero-shot evaluation settings and sample sizes (100-1000 instances) are disclosed. However, while the benchmark framework is public, the specific prompt templates used for all 11 languages are not fully centralized in the model card, requiring users to reference external papers (BHASA) for full reproduction details.

Identity Consistency

9.0 / 10

The model consistently identifies as SEA-LION and is transparent about its versioning (v1) and its specific focus on Southeast Asian languages. It does not attempt to mimic other models like GPT-4 in its official documentation or system prompts. It clearly states its purpose as a regional specialist and acknowledges its limitations regarding safety alignment in the base version.

Downstream

19.5 / 30

License Clarity

8.0 / 10

The model is released under the Apache-2.0 license, which is a standard, permissive open-source license. Documentation clearly distinguishes between the commercially permissive version (SEA-LION-7B-Instruct) and the research-only version (SEA-LION-7B-Instruct-Research, which used CC BY-NC-SA 4.0). This distinction prevents licensing ambiguity for commercial developers.

Hardware Footprint

6.5 / 10

Official documentation provides VRAM estimates, noting that approximately 30GB of VRAM is required to load the model in full precision, which exceeds standard consumer hardware. While it mentions the availability of GGUF versions for quantization, it lacks a detailed matrix of VRAM requirements across different quantization levels (Q4, Q8, etc.) and their specific accuracy trade-offs.

Versioning Drift

5.0 / 10

The model uses a versioning system (v1), and AI Singapore maintains a roadmap for future versions (v2, v3). However, a detailed, granular changelog for minor weight updates or specific drift assessments between sub-versions is not prominently featured. The transition from 'Research' to 'Instruct' variants is documented, but semantic versioning for the weights themselves is less rigorous than for the software libraries.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
1k
2k

VRAM Required:

Recommended GPUs