ApX logoApX logo

SEA-LION-7B

Parameters

7.1B

Context Length

2.048K

Modality

Text

Architecture

Dense

License

Apache-2.0

Release Date

1 Dec 2023

Knowledge Cutoff

Sep 2023

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

32

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

-

Sliding Window Attention

No

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

16,384

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

256,000

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 4.1k · Context: 2k · Vocab: 256kx 32 layersLayerNormPre-AttentionMulti-Head Attention32Q / 32KV headsHead dim: 128+LayerNormPre-FFNFeed-Forward NetworkGELUIntermediate: 16.4k+Final LayerNormOutput Logits

SEA-LION-7B

SEA-LION-7B (Southeast Asian Languages In One Network) is a 7.1 billion parameter decoder-only transformer model developed by AI Singapore to address the linguistic and cultural specificities of the Southeast Asian region. Built on the MosaicML Pretrained Transformer (MPT) architecture, the model is trained from scratch on a massive 980 billion token corpus. This training set is uniquely balanced, featuring significant representation for 11 regional languages including Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, and Lao, alongside English and Chinese, ensuring the model captures regional nuances often overlooked by Western-centric LLMs.

Technically, SEA-LION-7B diverges from standard MPT configurations by utilizing absolute learned positional embeddings rather than ALiBi, which provides a stable foundation for its 2,048-token context window. The architecture consists of 32 transformer layers with a hidden dimension of 4096 and 32 attention heads. It employs Low-Precision LayerNorm for normalization and uses the GeLU (Gaussian Error Linear Unit) activation function. A critical innovation is the SEABPETokenizer, a custom Byte-Pair Encoding tokenizer with a 256,000-token vocabulary specifically optimized to reduce the token-to-word ratio for Southeast Asian scripts, thereby improving inference efficiency and comprehension.

Designed for research and regional application deployment, SEA-LION-7B serves as a base for specialized natural language understanding and generation tasks. Its performance characteristics are tailored for multilingual translation, sentiment analysis, and culturally aware text generation within the ASEAN context. The model's open-weights release under the MIT license encourages community-driven fine-tuning and adaptation for specific regional industrial use cases while maintaining a transparent and accessible framework for researchers and developers.

About SEA-LION

Southeast Asian Languages In One Network (SEA-LION) is a family of language models developed by AI Singapore for Southeast Asian languages. The models support English, Indonesian, Malay, Thai, Vietnamese, Tagalog, Burmese, Khmer, Lao, Tamil, and Chinese. It focuses on regional linguistic patterns and is available in base and instruction-tuned variants.


Other SEA-LION Models

Evaluation Benchmarks

No evaluation benchmarks for SEA-LION-7B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

75 / 100

SEA-LION-7B Model Integrity Report

Total Score

75

/ 100

B+

Audit Note

SEA-LION-7B exhibits a strong transparency profile, particularly regarding its architectural modifications and the linguistic composition of its training data. The project provides exemplary documentation for its custom tokenizer and maintains a clear, consistent identity focused on Southeast Asian representation. While compute-related environmental data and granular versioning history are currently lacking, the model's open-weights approach and detailed data sourcing set a high standard for regional AI initiatives.

Upstream

24.5 / 30

Architectural Provenance

8.0 / 10

SEA-LION-7B is explicitly documented as being built on the MosaicML Pretrained Transformer (MPT) architecture. AI Singapore provides clear technical specifications including 32 layers, a hidden dimension of 4096, and 32 attention heads. They disclose a significant architectural modification: the use of absolute learned positional embeddings instead of the standard ALiBi used in MPT, which was chosen to provide a stable foundation for its 2,048-token context window. The model was trained from scratch, and the training methodology using MosaicML Composer is publicly stated.

Dataset Composition

7.5 / 10

The training data composition is disclosed with a high degree of granularity. The 980B token corpus is broken down by source and language: English (58.2% via RefinedWeb), Chinese (9.3% via mC4), and specific percentages for Southeast Asian languages (e.g., Indonesian 1.5%, Vietnamese 6.46%, Malay 0.29%). The 'SEA-LION-PILE' dataset is publicly named, and the methodology for filtering and cleaning (including the use of fasttext for language classification) is documented. However, access to the full raw dataset is restricted, preventing complete independent verification.

Tokenizer Integrity

9.0 / 10

The model uses a custom SEABPETokenizer with a 256,000-token vocabulary, which is significantly larger than standard Western models to better represent Southeast Asian scripts. The tokenizer is publicly available on Hugging Face, and its training methodology (sampling 20M lines from the training data using SentencePiece) is clearly documented. The efficiency gains (token-to-word ratio) for regional languages are explicitly claimed and verifiable through the provided tokenizer files.

Model

30.0 / 40

Parameter Density

8.5 / 10

The model is a dense decoder-only transformer with 7.1 billion parameters. Detailed architectural parameters are provided in the model card and technical documentation, including layer counts, head dimensions, and vocabulary size. There is no ambiguity regarding active vs. total parameters as it is not an MoE architecture. The impact of quantization is also addressed through the official release of GGUF versions (Q2_K to Q8_0), providing transparency on density vs. performance trade-offs.

Training Compute

6.0 / 10

AI Singapore discloses the hardware used (32 instances of NVIDIA A100 40GB GPUs) and the training duration (22 days). While this allows for a rough estimate of compute resources, they do not provide an official carbon footprint calculation or a detailed energy consumption report. The hardware specifications are clear, but the environmental impact data is missing, which is a key requirement for a high score in this pillar.

Benchmark Reproducibility

6.5 / 10

The model was evaluated using the BHASA benchmark and the SEA-HELM framework, both of which have associated papers and GitHub repositories. Evaluation results for standard benchmarks (ARC, MMLU, HellaSwag) are provided. However, while the benchmarks are named and some methodology is described (e.g., zero-shot Indonesian prompts), the exact evaluation code and full prompt sets used for the internal BHASA tests are not as easily accessible as the model weights themselves, creating a moderate gap in full reproducibility.

Identity Consistency

9.0 / 10

SEA-LION-7B demonstrates high identity consistency. It is clearly branded as a Southeast Asian-centric model and does not attempt to mimic competitors like GPT-4 or Llama. The model card and technical reports maintain a consistent narrative regarding its purpose, versioning (v1), and limitations (e.g., 2048 context window). There are no documented cases of the model misrepresenting its origin or capabilities.

Downstream

20.5 / 30

License Clarity

8.5 / 10

The base model is released under the MIT license, which is a highly permissive and clear open-source license. The instruction-tuned 'Research' variant uses a more restrictive CC BY-NC-SA 4.0 license, but this distinction is clearly labeled on the respective Hugging Face repositories. The terms for commercial use are explicitly stated for the base model, and there is no evidence of conflicting terms between the code and the weights.

Hardware Footprint

7.0 / 10

VRAM requirements are well-documented for various precisions. The documentation notes that FP16 requires ~14-15GB, while 4-bit quantization (GGUF) fits within ~4.5GB. The availability of multiple quantized versions (Q2 through Q8) on the official repository provides users with clear paths for deployment on consumer hardware. However, detailed memory scaling charts for context length vs. batch size are not explicitly provided in the core documentation.

Versioning Drift

5.0 / 10

While the model uses a versioning system (v1, v2, v3), the documentation for 'v1' specifically lacks a detailed, granular changelog for minor updates or weight refreshes. The transition between versions is documented at a high level (e.g., moving from MPT to Llama/Gemma bases in later versions), but the tracking of silent drift or performance changes within the v1 lifecycle is minimal. This makes it difficult for developers to track specific behavioral changes over time.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
1k
2k

VRAM Required:

Recommended GPUs