Parameters
2.7B
Context Length
2.048K
Modality
Text
Architecture
Dense
License
MIT License
Release Date
12 Oct 2023
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
32
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
32
FFN Intermediate Size (Dense)
10,240
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
51,200
Microsoft Phi-2 is a small language model (SLM) with 2.7 billion parameters, representing a continuation of Microsoft Research's efforts in developing highly capable models at a compact scale. The model is designed to facilitate research into language understanding and reasoning while emphasizing efficiency and accessibility. A core objective behind its release is to provide the research community with an unconstrained, small model for investigating crucial safety challenges, including the mitigation of toxicity and the analysis of societal biases within AI systems.
The architectural foundation of Phi-2 is a Transformer-based design, employing a next-word prediction objective. Its training methodology prioritizes data quality, utilizing a substantial corpus of 1.4 trillion tokens derived from both synthetic and meticulously filtered web data. The synthetic component, generated using advanced models like GPT-3.5 and GPT-4, focuses on "textbook-quality" content to impart robust common sense reasoning, general knowledge, and specific domain understanding in areas such as science. Web data underwent stringent filtering to ensure high educational value and content integrity. The training process for Phi-2 spanned 14 days, leveraging a cluster of 96 A100 GPUs, and incorporated techniques such as Flash Attention. Notably, Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF) or explicit instruction fine-tuning, yet it exhibits favorable behavior regarding toxicity and bias.
Phi-2's performance characteristics position it as a proficient tool for various natural language processing applications, including question answering, conversational AI, and code generation. Its compact parameter count makes it suitable for deployment on consumer-grade GPUs, enabling efficient inference. The model demonstrates strong reasoning and language understanding capabilities, often performing comparably to or surpassing significantly larger models in specific benchmarks. Its design fosters exploration in areas such as mechanistic interpretability and fine-tuning experiments, making it a valuable resource for researchers and developers aiming to innovate with resource-efficient language models.
Microsoft's Phi-2 is a 2.7 billion parameter Transformer-based model, developed for efficient language understanding and reasoning. Its technical innovations include training on "textbook-quality" synthetic and filtered web data, alongside scaled knowledge transfer from its predecessor, Phi-1.5, facilitating emergent capabilities within a compact architecture.
No evaluation benchmarks for Phi-2 available.
Overall Rank
-
Coding Rank
-
Total Score
70
/ 100
Phi-2 exhibits strong transparency regarding its architecture and licensing, benefiting from a permissive MIT license and clear hardware requirements. However, it falls short in dataset transparency and compute environmental impact, relying on proprietary 'textbook-quality' data descriptions without providing the full composition or generation methodology. The model serves as a highly accessible research tool, though its benchmark reproducibility is hampered by the lack of public evaluation code.
Architectural Provenance
Microsoft provides a clear description of Phi-2 as a decoder-only Transformer model with 2.7 billion parameters. The training methodology is detailed in official blog posts and the model card, highlighting a next-word prediction objective and a unique 'scaled knowledge transfer' from its predecessor, Phi-1.5. While the specific architectural modifications (like the use of MixFormer and Flash Attention) are mentioned, a full peer-reviewed technical paper with exhaustive architectural diagrams is absent, though the Hugging Face implementation provides high transparency into the code structure.
Dataset Composition
The model was trained on 1.4 trillion tokens, and Microsoft discloses the general composition: a mixture of 'textbook-quality' synthetic data (generated by GPT-3.5/4) and filtered web data from sources like Falcon RefinedWeb and SlimPajama. However, the exact percentage breakdown between synthetic and web data is not explicitly provided, and the specific filtering heuristics or the 'textbook' generation prompts remain proprietary, limiting full reproducibility of the dataset.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face 'transformers' library. It has a known vocabulary size of 51,200 (with 50,295 active tokens), and its behavior is well-documented in community discussions and official model cards. The use of a standard BPE-based approach is verifiable through the provided 'tokenizer.json' and 'vocab.json' files in the repository.
Parameter Density
Phi-2 is a dense model with a clearly stated 2.7 billion parameters. Unlike MoE models, there is no ambiguity regarding active vs. total parameters. The architectural breakdown (layers, heads, embedding dimensions) is fully transparent through the configuration files on Hugging Face, and the impact of its compact size on performance is the central theme of its documentation.
Training Compute
Microsoft discloses that the model was trained for 14 days using 96 NVIDIA A100-80G GPUs. While this provides a clear hardware and duration metric, there is no official disclosure of the total carbon footprint, energy consumption in MWh, or the specific cost of the training run, which are key requirements for high scores in this category.
Benchmark Reproducibility
While Microsoft provides extensive benchmark results (MMLU, GSM8K, HumanEval) and specifies the few-shot settings (e.g., 5-shot for MMLU), the exact evaluation code and full prompt sets used for these internal evaluations are not publicly released in a single reproducible repository. Third-party evaluations on the Open LLM Leaderboard provide some verification, but discrepancies in scoring across different versions of benchmarks are noted.
Identity Consistency
Phi-2 demonstrates high identity consistency. It is a base model without instruction tuning, yet it does not typically hallucinate being a competitor's model (like GPT-4) in standard completions. It is clearly versioned within the Phi family, and its limitations as a non-aligned base model are explicitly stated in the 'Intended Uses' and 'Limitations' sections of its documentation.
License Clarity
The model is released under the MIT License, which is a highly permissive, standard open-source license allowing for commercial use, modification, and distribution. This was a significant upgrade from its initial restricted research license, and the current terms are clear, public, and unambiguous.
Hardware Footprint
Hardware requirements are well-documented by both Microsoft and the community. VRAM requirements for FP16 (approx. 5.2 GB) and various quantization levels (e.g., 4-bit requiring ~1.8 GB) are widely available. Documentation includes guidance on using Flash Attention to optimize memory and performance, and the model's suitability for consumer-grade GPUs is a verified claim.
Versioning Drift
Phi-2 follows a clear naming convention within the Phi family (Phi-1 -> 1.5 -> 2). However, it lacks a formal, granular changelog for weight updates or minor iterations. While the Hugging Face repository tracks file changes, there is no structured semantic versioning for the model weights themselves, making it difficult to track subtle 'silent' updates if they occur.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online