Active Parameters
106B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
6 Mar 2026
Knowledge Cutoff
-
Total Expert Parameters
10.3B
Number of Experts
128
Active Experts
8
Attention Structure
Multi-Layer Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
ROPE
Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 106B total parameters and 10.3B active parameters, designed for superior performance across complex tasks. Released March 6, 2026 under Apache 2.0 license. Uses MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192, v_head_dim=128), large head_dim of 576, and 128 experts with top-8 routing. Features 128K native context (extensible via YaRN scaling with factor 40), and delivers exceptional performance in agentic tasks, mathematics, and coding. Consistently matches or surpasses major closed-source models with state-of-the-art results across 22 Indian languages while maintaining competitive global benchmark performance.
Sarvam AI's sovereign foundation models built for India's languages, culture, and context. Released in March 2026, these advanced Mixture-of-Experts (MoE) models offer state-of-the-art performance across 22 Indian languages while maintaining competitive results on global benchmarks. Designed with focus on reasoning, coding, multilingual capabilities, and agentic tasks. Open-sourced under Apache 2.0 license, optimized for practical deployment from resource-constrained environments to high-performance applications.
No evaluation benchmarks for Sarvam-105B available.
Overall Rank
-
Coding Rank
-
Total Score
68
/ 100
Sarvam-105B demonstrates strong transparency in its architectural design and licensing, providing deep technical details on its MoE structure and permissive Apache 2.0 terms. However, it suffers from significant gaps in data provenance and independent benchmark verification, relying heavily on self-reported metrics without disclosing specific data sources or evaluation code. The model's transparency profile is currently characterized by high-quality architectural disclosure paired with opaque upstream data and downstream reproducibility practices.
Architectural Provenance
The model's architecture is extensively documented in official blog posts and Hugging Face model cards. It is a Mixture-of-Experts (MoE) transformer built from scratch using the NVIDIA NeMo framework and Megatron-LM. Specific technical details are provided, including the use of Multi-head Latent Attention (MLA) to compress the KV cache, a 32-layer depth (1 dense + 31 MoE), and a decoupled QK head dimension (q_head_dim=192, v_head_dim=128). The use of YaRN scaling for its 128K context window is also explicitly stated.
Dataset Composition
While the provider discloses the total token count (12 trillion) and the general categories of data (web, code, math, and multilingual content across 22 Indian languages), it lacks a detailed percentage breakdown of these sources. There is no public documentation on specific data filtering or cleaning methodologies, and no sample data or specific source names (e.g., specific web crawls or datasets) are provided beyond vague 'curated in-house' claims.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository. Technical documentation specifies its optimization for Indic languages, achieving fertility rates of 1.4 to 2.1 compared to the 4-8x seen in standard multilingual tokenizers. The vocabulary size and support for 22 Indian languages are clearly stated and verifiable through the provided configuration files.
Parameter Density
The model provides exemplary transparency regarding its parameter density. It explicitly distinguishes between its 106B total parameters and its 10.3B active parameters per token. The MoE configuration is detailed, specifying 128 experts with a top-8 routing strategy and one shared expert. The architectural breakdown of the backbone (equivalent to a 10-13B dense transformer) is also publicly disclosed.
Training Compute
The provider identifies the hardware used (over 1,000 NVIDIA H100 GPUs at Yotta's Shakti cluster) and the framework (NVIDIA NeMo). However, it fails to disclose the total GPU hours, the duration of the training run, the estimated cost, or the carbon footprint associated with the training process.
Benchmark Reproducibility
While Sarvam provides scores for several standard benchmarks (MMLU, Math500, LiveCodeBench v6) and a custom Indic benchmark (IndiVibe), it does not provide the evaluation code or the specific prompts used for these results. Furthermore, the model has not yet appeared on major independent leaderboards like the Hugging Face Open LLM Leaderboard or LMSYS Chatbot Arena, making the self-reported results difficult to verify independently.
Identity Consistency
The model consistently identifies itself as Sarvam-105B and is transparent about its versioning and capabilities. It does not exhibit identity confusion or claim to be a model from a different provider. Documentation clearly outlines its intended use cases in agentic tasks and multilingual reasoning.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, permissive open-source license. The terms are clear, allowing for both commercial and non-commercial use, modification, and distribution without conflicting proprietary restrictions.
Hardware Footprint
Basic guidance on hardware requirements is available, such as the need for approximately 64GB of VRAM for non-quantized inference. The documentation mentions optimizations for NVIDIA Blackwell and the use of NVFP4 quantization, but detailed VRAM scaling for different context lengths and specific accuracy tradeoffs for various quantization levels (Q4, Q8) are not comprehensively documented for the end-user.
Versioning Drift
The model uses versioning (Sarvam-105B) and has a clear release date. However, there is no public changelog or established mechanism for tracking silent updates or performance drift. As a newly released model, a history of versioning and deprecation notices has not yet been established.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens