Parameters
3B
Context Length
4.096K
Modality
Text
Architecture
Dense
License
Apache-2.0
Release Date
15 Jan 2024
Knowledge Cutoff
Jan 2024
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
-
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
MaLLaM-3B (Malaysia Large Language Model) is a foundational 3 billion parameter dense model engineered specifically for the Malaysian linguistic context. Developed from scratch by Malaysia AI and Mesolitica, the model addresses the scarcity of high-quality local language representations by leveraging a curated dataset of 90 billion tokens. This training corpus comprises 349GB of diverse Malaysian digital artifacts, including government documents, local news, literature from the Dewan Bahasa Pustaka, and colloquial social media exchanges. By utilizing a custom-trained Byte Pair Encoding (BPE) tokenizer, the model captures unique Malaysian idioms, slang, and cultural references that are often diluted in English-centric foundational models.
Technically, MaLLaM-3B adopts the Mistral transformer-based decoder-only architecture, which facilitates efficient inference and high performance relative to its parameter count. The model utilizes Grouped-Query Attention (GQA) to optimize the KV cache, thereby reducing memory overhead during sequence generation. It implements the SwiGLU activation function and RMSNorm for stable and accelerated convergence during pre-training. For position encoding, the model employs Rotary Position Embeddings (RoPE), enabling it to maintain precise token relationships within its standard 4096-token context window.
Designed primarily for edge deployment and localized applications, MaLLaM-3B is optimized for environments where low-latency text generation and bilingual proficiency in Bahasa Malaysia and English are required. Its compact architecture makes it suitable for integration into mobile applications, localized chatbots, and on-premise document processing systems. Released under the Apache 2.0 license, the model provides an open-weights foundation for researchers and developers to build downstream tasks such as sentiment analysis, summarization, and instruction-following assistants tailored for the Malaysian demographic.
Malaysian Large Language Model (MaLLaM) is an open-source language model family developed to support Bahasa Malaysia and English. The model is trained on Malaysian text data including local news, literature, and digital content. It is designed to process Malaysian linguistic nuances and cultural context, available in multiple parameter sizes for different hardware deployments.
No evaluation benchmarks for MaLLaM-3B available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
MaLLaM-3B demonstrates strong transparency regarding its architectural origins and the localized nature of its training data. It provides a clear open-source path through its Apache 2.0 license and custom tokenizer documentation. The primary areas for improvement include more granular reporting of training compute metrics and more rigorous, reproducible benchmark disclosures to validate its performance claims.
Architectural Provenance
MaLLaM-3B is explicitly documented as a dense decoder-only transformer model based on the Mistral architecture. Technical details are provided in the official GitHub repository and an arXiv technical report, confirming the use of Grouped-Query Attention (GQA), SwiGLU activation, RMSNorm, and Rotary Position Embeddings (RoPE). The model was trained from scratch rather than being a fine-tuned version of an existing model, which is clearly stated and supported by the training methodology documentation.
Dataset Composition
The training corpus is described as a 349GB (90 billion token) dataset specifically curated for the Malaysian context. Documentation identifies five primary categories: Dedup text, Extra dedup (research papers), Filtered StarCoder, Instruction data, and MADLAD-400 MS. Specific sources like Lowyat, Cari, and government documents are named. While the general composition is clear, a precise percentage breakdown of each category within the final 90B tokens is not explicitly tabulated in a granular format.
Tokenizer Integrity
The model uses a custom-trained Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 32,000. The tokenizer is publicly available on Hugging Face and was specifically trained on a multilingual corpus including Malay, English, Mandarin, Tamil, Jawi, and Arabic to capture local linguistic nuances. Documentation provides clear instructions on its development and intended language support.
Parameter Density
The model is a dense architecture with 3 billion parameters. Unlike Mixture-of-Experts (MoE) models where active parameters might be obscured, the dense nature makes the parameter count straightforward. The architectural configuration (Mistral-based) is standard and verifiable through the provided configuration files on Hugging Face.
Training Compute
Training was conducted using a distributed cluster of 40 GPUs (NVIDIA A100s) managed via Kubernetes and DeepSpeed Zero3. The use of spot instances and AWS infrastructure is disclosed. However, while the hardware type and cluster size are provided, the exact total GPU-hours and the specific carbon footprint calculations are not detailed in the available technical reports.
Benchmark Reproducibility
The technical report mentions competitive performance against ChatGPT-3.5 and Malaysian Mistral on instruction-following tasks. However, the specific evaluation code, exact prompt templates, and comprehensive results across standard benchmarks like MMLU or GSM8K are less detailed compared to major foundational releases. There is a lack of third-party verification for the reported internal benchmarks.
Identity Consistency
The model is consistently branded as MaLLaM (Malaysia Large Language Model) across all official documentation, GitHub, and Hugging Face. It distinguishes itself clearly from English-centric models and maintains a coherent identity as a localized foundational model. There are no reports of the model misidentifying itself as a competitor's product.
License Clarity
The model and its weights are released under the Apache 2.0 license, which is a standard, permissive open-source license. This is explicitly stated on the Hugging Face model card and the official GitHub repository, providing clear terms for commercial and non-commercial use.
Hardware Footprint
The model is designed for edge deployment, and its 3B parameter size implies a baseline VRAM requirement of approximately 6GB for FP16. While basic VRAM estimates are available through community calculators and the model's compact nature, official documentation could be more explicit regarding the specific accuracy-performance tradeoffs for various quantization levels (Q4, Q8).
Versioning Drift
While the model is part of a family (1.1B, 3B, 5B), there is limited evidence of a formal semantic versioning system or a detailed public changelog for weight updates. The repository tracks development, but clear markers for 'v1.0' vs 'v1.1' with associated drift analysis are not prominent.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens