ApX logoApX logo

Phi-2

Parameters

2.7B

Context Length

2.048K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

12 Oct 2023

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

32

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

32

FFN Intermediate Size (Dense)

10,240

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

51,200

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 2k · Vocab: 51.2kx 32 layersLayerNormPre-AttentionMulti-Head Attention32Q / 32KV headsHead dim: 64+LayerNormPre-FFNFeed-Forward NetworkGELUIntermediate: 10.2k+Final LayerNormOutput Logits

Phi-2

Microsoft Phi-2 is a small language model (SLM) with 2.7 billion parameters, representing a continuation of Microsoft Research's efforts in developing highly capable models at a compact scale. The model is designed to facilitate research into language understanding and reasoning while emphasizing efficiency and accessibility. A core objective behind its release is to provide the research community with an unconstrained, small model for investigating crucial safety challenges, including the mitigation of toxicity and the analysis of societal biases within AI systems.

The architectural foundation of Phi-2 is a Transformer-based design, employing a next-word prediction objective. Its training methodology prioritizes data quality, utilizing a substantial corpus of 1.4 trillion tokens derived from both synthetic and meticulously filtered web data. The synthetic component, generated using advanced models like GPT-3.5 and GPT-4, focuses on "textbook-quality" content to impart robust common sense reasoning, general knowledge, and specific domain understanding in areas such as science. Web data underwent stringent filtering to ensure high educational value and content integrity. The training process for Phi-2 spanned 14 days, leveraging a cluster of 96 A100 GPUs, and incorporated techniques such as Flash Attention. Notably, Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF) or explicit instruction fine-tuning, yet it exhibits favorable behavior regarding toxicity and bias.

Phi-2's performance characteristics position it as a proficient tool for various natural language processing applications, including question answering, conversational AI, and code generation. Its compact parameter count makes it suitable for deployment on consumer-grade GPUs, enabling efficient inference. The model demonstrates strong reasoning and language understanding capabilities, often performing comparably to or surpassing significantly larger models in specific benchmarks. Its design fosters exploration in areas such as mechanistic interpretability and fine-tuning experiments, making it a valuable resource for researchers and developers aiming to innovate with resource-efficient language models.

About Phi-2

Microsoft's Phi-2 is a 2.7 billion parameter Transformer-based model, developed for efficient language understanding and reasoning. Its technical innovations include training on "textbook-quality" synthetic and filtered web data, alongside scaled knowledge transfer from its predecessor, Phi-1.5, facilitating emergent capabilities within a compact architecture.


Other Phi-2 Models
  • No related models available

Evaluation Benchmarks

No evaluation benchmarks for Phi-2 available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

70 / 100

Phi-2 Model Integrity Report

Total Score

70

/ 100

B+

Audit Note

Phi-2 exhibits strong transparency regarding its architecture and licensing, benefiting from a permissive MIT license and clear hardware requirements. However, it falls short in dataset transparency and compute environmental impact, relying on proprietary 'textbook-quality' data descriptions without providing the full composition or generation methodology. The model serves as a highly accessible research tool, though its benchmark reproducibility is hampered by the lack of public evaluation code.

Upstream

21.0 / 30

Architectural Provenance

7.5 / 10

Microsoft provides a clear description of Phi-2 as a decoder-only Transformer model with 2.7 billion parameters. The training methodology is detailed in official blog posts and the model card, highlighting a next-word prediction objective and a unique 'scaled knowledge transfer' from its predecessor, Phi-1.5. While the specific architectural modifications (like the use of MixFormer and Flash Attention) are mentioned, a full peer-reviewed technical paper with exhaustive architectural diagrams is absent, though the Hugging Face implementation provides high transparency into the code structure.

Dataset Composition

5.0 / 10

The model was trained on 1.4 trillion tokens, and Microsoft discloses the general composition: a mixture of 'textbook-quality' synthetic data (generated by GPT-3.5/4) and filtered web data from sources like Falcon RefinedWeb and SlimPajama. However, the exact percentage breakdown between synthetic and web data is not explicitly provided, and the specific filtering heuristics or the 'textbook' generation prompts remain proprietary, limiting full reproducibility of the dataset.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the Hugging Face 'transformers' library. It has a known vocabulary size of 51,200 (with 50,295 active tokens), and its behavior is well-documented in community discussions and official model cards. The use of a standard BPE-based approach is verifiable through the provided 'tokenizer.json' and 'vocab.json' files in the repository.

Model

26.5 / 40

Parameter Density

9.0 / 10

Phi-2 is a dense model with a clearly stated 2.7 billion parameters. Unlike MoE models, there is no ambiguity regarding active vs. total parameters. The architectural breakdown (layers, heads, embedding dimensions) is fully transparent through the configuration files on Hugging Face, and the impact of its compact size on performance is the central theme of its documentation.

Training Compute

4.0 / 10

Microsoft discloses that the model was trained for 14 days using 96 NVIDIA A100-80G GPUs. While this provides a clear hardware and duration metric, there is no official disclosure of the total carbon footprint, energy consumption in MWh, or the specific cost of the training run, which are key requirements for high scores in this category.

Benchmark Reproducibility

4.5 / 10

While Microsoft provides extensive benchmark results (MMLU, GSM8K, HumanEval) and specifies the few-shot settings (e.g., 5-shot for MMLU), the exact evaluation code and full prompt sets used for these internal evaluations are not publicly released in a single reproducible repository. Third-party evaluations on the Open LLM Leaderboard provide some verification, but discrepancies in scoring across different versions of benchmarks are noted.

Identity Consistency

9.0 / 10

Phi-2 demonstrates high identity consistency. It is a base model without instruction tuning, yet it does not typically hallucinate being a competitor's model (like GPT-4) in standard completions. It is clearly versioned within the Phi family, and its limitations as a non-aligned base model are explicitly stated in the 'Intended Uses' and 'Limitations' sections of its documentation.

Downstream

22.5 / 30

License Clarity

9.5 / 10

The model is released under the MIT License, which is a highly permissive, standard open-source license allowing for commercial use, modification, and distribution. This was a significant upgrade from its initial restricted research license, and the current terms are clear, public, and unambiguous.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both Microsoft and the community. VRAM requirements for FP16 (approx. 5.2 GB) and various quantization levels (e.g., 4-bit requiring ~1.8 GB) are widely available. Documentation includes guidance on using Flash Attention to optimize memory and performance, and the model's suitability for consumer-grade GPUs is a verified claim.

Versioning Drift

5.0 / 10

Phi-2 follows a clear naming convention within the Phi family (Phi-1 -> 1.5 -> 2). However, it lacks a formal, granular changelog for weight updates or minor iterations. While the Hugging Face repository tracks file changes, there is no structured semantic versioning for the model weights themselves, making it difficult to track subtle 'silent' updates if they occur.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
1k
2k

VRAM Required:

Recommended GPUs