Parameters
6B
Context Length
4.096K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
2 Nov 2023
Knowledge Cutoff
Jun 2023
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
32
Key-Value Heads
4
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
The Yi-6B model, developed by 01.AI, is a 6-billion parameter large language model engineered for efficient and accessible language processing tasks. It is a core component of the Yi model family, designed to offer substantial performance while maintaining moderate resource requirements, making it suitable for both personal and academic applications. The model is distinguished by its bilingual capabilities, having been trained on an expansive 3-trillion token multilingual corpus, enabling proficiency in both English and Chinese language understanding and generation.
Architecturally, Yi-6B is built upon a dense transformer framework. Its attention mechanism incorporates Grouped-Query Attention (GQA), a modification applied to both the 6B and 34B Yi models. This approach is known to reduce training and inference costs compared to traditional Multi-Head Attention without compromising performance on smaller models. The model employs SwiGLU as its activation function and RMSNorm for normalization, drawing architectural parallels with models such as Llama. Its positional embeddings leverage the Rotary Positional Embedding (RoPE) scheme, facilitating effective context management. The Yi-6B model features a hidden dimension size of 4096, comprises 32 layers, and utilizes 32 attention query heads alongside 4 key-value heads.
The Yi-6B model is engineered for robust performance across a spectrum of natural language processing tasks, including language understanding, commonsense reasoning, and reading comprehension. Its efficient design and open-weight release under the Apache 2.0 license contribute to its applicability in various scenarios, from rapid prototyping in real-time applications to fine-tuning for specific domains. The model features a default context window of 4,096 tokens, with variants offering extended context lengths up to 200,000 tokens for handling more extensive textual inputs.
No evaluation benchmarks for Yi-6B available.
Overall Rank
-
Coding Rank
-
Total Score
60
/ 100
The Yi-6B model exhibits strong transparency in its architectural specifications and hardware requirements, supported by a formal technical report. However, it suffers from significant opacity regarding its training compute resources and the granular composition of its 3.1T token dataset. While the use of an Apache 2.0 license is a positive step, conflicting commercial application requirements and early-release naming inconsistencies have historically clouded its transparency profile.
Architectural Provenance
The Yi-6B model is well-documented in an official technical report ('Yi: Open Foundation Models') which details its 'modified' Transformer architecture. It explicitly identifies the use of Grouped-Query Attention (GQA), SwiGLU activation, and Rotary Positional Embeddings (RoPE). While it acknowledges being based on the Llama implementation, it clarifies that it was trained from scratch. However, the 'proprietary' nature of the specific training infrastructure and some methodology details are not fully public, preventing a higher score.
Dataset Composition
01.AI discloses that the model was trained on a 3.1 trillion token multilingual corpus (primarily English and Chinese). While the technical report describes a 'cascaded data deduplication and quality filtering pipeline' involving heuristic and learned filters, it lacks a detailed percentage breakdown of data sources (e.g., specific web crawls, books, or code proportions). The data itself is not public, and the description remains at a high level of 'highly-engineered' data without granular source transparency.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository and the official GitHub. It uses a SentencePiece BPE implementation with a vocabulary size of 64,000 tokens. Documentation explains the choice to avoid dummy prefixes for better bilingual (English/Chinese) performance and the use of an 'identity tokenizer' for punctuation. The vocabulary is well-aligned with the claimed bilingual support.
Parameter Density
The model's parameter count is clearly stated as 6 billion. As a dense model, all parameters are active during inference. The technical report provides a specific architectural breakdown: 32 layers, a hidden size of 4096, 32 query heads, and 4 KV heads. This level of detail is exemplary for a dense architecture.
Training Compute
Information regarding the specific compute resources used for training Yi-6B is extremely limited. While the report mentions a 'scalable super-computing infrastructure,' it does not disclose the total GPU/TPU hours, the specific hardware count used for the 6B variant, the training duration, or the carbon footprint. This is a significant transparency gap.
Benchmark Reproducibility
01.AI provides benchmark results on standard sets like MMLU, C-Eval, and AlpacaEval in their technical report. They mention following Llama 2's evaluation methodology and using greedy decoding. However, the exact evaluation code and full prompt sets used for all internal benchmarks are not fully public in a single reproducible repository, and independent verification has noted sensitivity to prompt formatting.
Identity Consistency
The model generally identifies as an AI developed by 01.AI in its chat variants. However, research indicates that the base model (Yi-6B) lacks inherent self-identity without fine-tuning, and there have been documented instances of identity confusion in the broader Yi family where models might misidentify their origin or version when prompted in specific languages or contexts.
License Clarity
The model weights and code are released under the Apache 2.0 license, which is highly transparent. However, 01.AI's official website and some documentation include a requirement to 'apply for a commercial license for free' for certain use cases, creating a conflict with the standard 'unrestricted' nature of Apache 2.0. This ambiguity in the commercial terms reduces the score.
Hardware Footprint
Hardware requirements are well-documented on the Hugging Face model card and in the GitHub README. It provides specific VRAM estimates for inference (approx. 12GB for FP16) and training (approx. 45GB with Adam). Furthermore, it details requirements for 4-bit and 8-bit quantized versions (AWQ/GPTQ), making it highly accessible for users to plan deployment.
Versioning Drift
The model follows a basic versioning scheme (e.g., Yi-6B, Yi-1.5-6B), and 01.AI maintains a changelog on GitHub and Hugging Face. However, the versioning does not strictly follow semantic versioning for weights, and some updates (like the 200K context extension) were released as separate variants rather than versioned iterations of the base, making tracking of 'drift' in the original model difficult.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens