Parameters
7B
Context Length
4.096K
Modality
Text
Architecture
Dense
License
Apache-2.0
Release Date
15 Jan 2024
Knowledge Cutoff
Jan 2024
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
32
Key-Value Heads
8
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
MaLLaM-7B (Malaysian Large Language Model) is a dense decoder-only transformer designed to process and generate text with high fidelity to the linguistic patterns of the Malaysian region. Developed by Mesolitica, the model is pre-trained from scratch on a specialized dataset comprising approximately 90 billion tokens, derived from a diverse range of Malaysian sources including government documents, local news, and social media forums. This extensive exposure to localized content allows the model to handle regional dialects, slang, and cultural nuances that are frequently underrepresented in more generalized global models.
The architecture of MaLLaM-7B follows the Mistral-7B design pattern, utilizing a standard transformer structure optimized for efficient inference and training. It employs a Byte Pair Encoding (BPE) tokenizer with a 32,000-vocabulary size, specifically trained on Malaysian multilingual data including Malay, English, Mandarin, Tamil, and Jawi scripts. The model integrates modern architectural refinements such as Rotary Positional Embeddings (RoPE) and Grouped Query Attention (GQA), which facilitate improved handling of sequence dependencies and computational efficiency during the generation process.
Technically, MaLLaM-7B is configured with a hidden dimension of 4096 and consists of 32 transformer layers. It is trained with a context window of 4096 tokens, making it suitable for tasks such as multi-turn dialogue, document summarization, and localized text completion. The model is released under the Apache 2.0 license, promoting transparency and accessibility for researchers and developers working within the Southeast Asian NLP ecosystem. It serves as a foundational component for building applications that require deep alignment with Malaysian linguistic identity and idiomatic expressions.
Malaysian Large Language Model (MaLLaM) is an open-source language model family developed to support Bahasa Malaysia and English. The model is trained on Malaysian text data including local news, literature, and digital content. It is designed to process Malaysian linguistic nuances and cultural context, available in multiple parameter sizes for different hardware deployments.
No evaluation benchmarks for MaLLaM-7B available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
MaLLaM-7B demonstrates strong transparency regarding its architectural origins and localized dataset composition, providing verifiable evidence of its training sources and tokenizer design. While it excels in licensing and identity consistency, it lacks detailed compute metrics and a formal versioning changelog. The model's commitment to open weights and public documentation of its Malaysian-centric training data sets a high standard for regional LLM development.
Architectural Provenance
The model is explicitly identified as a dense decoder-only transformer following the Mistral-7B design pattern. Documentation confirms the use of standard refinements including Rotary Positional Embeddings (RoPE) and Grouped Query Attention (GQA). Technical specifications are detailed, including 32 layers, a hidden dimension of 4096, and a 4096-token context window. While the model is described as 'trained from scratch,' the reliance on the Mistral architecture is well-documented in the official repository and technical reports.
Dataset Composition
The training data is disclosed as a 349GB JSONL dataset (approx. 90 billion tokens) derived from 197 specialized Malaysian sources. Specific categories are named, including government documents (e.g., parliament transcripts), local news (Bernama, Star), and social media (Lowyat, Cari). The developers provide a list of scraped websites and reproduction notebooks for data collection. However, a precise percentage breakdown of the final training mixture (e.g., web vs. code vs. instructions) is not explicitly quantified in a single comprehensive table.
Tokenizer Integrity
The model uses a custom Byte Pair Encoding (BPE) tokenizer with a 32,000-vocabulary size, specifically trained on a multilingual Malaysian corpus. It supports Malay, English, Mandarin, Tamil, and Jawi scripts. The tokenizer is publicly available on Hugging Face, and its alignment with the claimed language support is verifiable through the provided configuration files and training methodology documentation.
Parameter Density
The model is clearly defined as a 7B parameter dense architecture. Unlike Mixture-of-Experts (MoE) models, all parameters are active during inference. The architectural configuration (32 layers, 4096 hidden size) is standard for this class and consistently reported across the official Hugging Face model card and GitHub repository.
Training Compute
Hardware specifications are partially disclosed; the developers mention using a Ray cluster with 5 nodes of 4x A100 80GB GPUs. While the hardware type is clear, the total GPU hours, training duration, and carbon footprint are not explicitly stated in the primary documentation. Some cost-efficiency claims (87% savings using AWS Trainium) are mentioned in press releases but lack the raw compute metrics required for a higher score.
Benchmark Reproducibility
The model has been evaluated on localized benchmarks like MalayMMLU and compared against ChatGPT 3.5. While some evaluation scripts are available in the repository, the exact prompts and few-shot examples used for the official reported scores are not fully documented in a centralized, reproducible format. Third-party verification is available through the MalayMMLU leaderboard, but the internal evaluation methodology remains partially opaque.
Identity Consistency
MaLLaM-7B consistently identifies itself as a Malaysian Large Language Model developed by Mesolitica. It does not exhibit identity confusion with other major models like GPT-4 or Llama. The versioning (e.g., v1.1, v2.5) is clearly tracked in the Hugging Face collections, and the model's limitations regarding its specialized regional focus are transparently discussed.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The terms for commercial use, modification, and redistribution are clear and consistent across the GitHub repository and Hugging Face model card. There are no conflicting proprietary restrictions mentioned in the official documentation.
Hardware Footprint
VRAM requirements are documented for standard 16-bit (approx. 14-16GB) and 4-bit (approx. 4-5GB) inference. The model card provides sample code for loading with BitsAndBytes 4-bit quantization. While it lacks a detailed context-scaling memory table, the baseline requirements for consumer and enterprise hardware are well-defined.
Versioning Drift
The project uses version numbers (e.g., MaLLaM-7B, MaLLaM-v2), but it lacks a formal, detailed changelog or semantic versioning history that tracks specific weight updates or performance drift over time. Users must rely on separate Hugging Face model entries to track progress, which makes monitoring silent updates difficult.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens