ApX logoApX logo

Phi-1

Parameters

1.3B

Context Length

2K

Modality

Text

Architecture

Dense

License

MIT

Release Date

15 Jun 2023

Knowledge Cutoff

-

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

4.65 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

2,048 tokens

5.08 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 2K · Vocab: 51.2kx 24 layersLayerNormPre-AttentionMulti-Head Attention32Q / 32KV headsHead dim: 64+LayerNormPre-FFNFeed-Forward NetworkGELUIntermediate: 8.2k+Final LayerNormOutput Logits

Evaluation Benchmarks

No evaluation benchmarks for Phi-1 available.

Rankings

Overall Rank

-

Coding Rank

-

About Phi-1

Microsoft's Phi-1 is a compact, Transformer-based language model specifically engineered for Python code generation. Its development emphasizes the efficacy of high-quality, curated training data over sheer data volume or model scale, a principle articulated in the foundational "Textbooks Are All You Need" research. The model's training regimen involved a distinct approach, utilizing a combination of meticulously filtered code-language data from public repositories and synthetically generated Python textbooks and exercises from large language models such as GPT-3.5. This data strategy aimed to imbue the model with a "textbook-quality" understanding of programming concepts and practices, fostering robust learning despite its modest size.

The architectural design of Phi-1 is rooted in a Transformer decoder-only structure, featuring 24 layers, a hidden dimension size of 2048, and 32 attention heads. Key innovations incorporated to enhance training efficiency and performance include the adoption of Rotary Position Embedding (RoPE) for handling sequence position information and FlashAttention for accelerated attention computation. This combination of a streamlined architecture with optimized components allows Phi-1 to process input sequences efficiently while maintaining contextual coherence. The model's training focused on next-token prediction, enabling it to generate coherent and syntactically correct Python code.

Phi-1 is primarily designed for tasks involving the generation of simple Python functions from docstrings, demonstrating its utility in code generation applications. Its performance characteristics, particularly in Python coding benchmarks like HumanEval and MBPP, indicate that it can achieve results comparable to significantly larger models, underscoring the impact of its high-quality data curation. While specialized for Python, its capabilities provide a foundation for understanding the potential of small language models in targeted domains.

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

32

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

Layer Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

24

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

51,200

Model Integrity

Total Score

B+

75 / 100

Phi-1 Model Integrity Report

Total Score

75

/ 100

B+

Audit Note

Phi-1 exhibits a high level of transparency regarding its architectural design and the specific composition of its training data, particularly for a model of its era. Its use of a standard MIT license and clear disclosure of training hardware and time sets a positive precedent for open-weights research. However, the model's transparency is hampered by limited reproducibility of its benchmark results and a lack of public access to the synthetic datasets used during training.

Upstream

24.0 / 30

Architectural Provenance

8.5 / 10

The model's architecture is extensively documented in the 'Textbooks Are All You Need' paper and official model cards. It is a decoder-only Transformer with 1.3 billion parameters, 24 layers, a hidden dimension of 2048, and 32 attention heads. Specific technical choices like Rotary Position Embedding (RoPE) and FlashAttention are explicitly disclosed. The training methodology, including the two-stage process (pretraining on 'CodeTextbook' and finetuning on 'CodeExercises'), is clearly described with step counts and learning rate schedules.

Dataset Composition

7.5 / 10

Microsoft provides a detailed breakdown of the training data: 6 billion tokens of filtered web code (from The Stack and StackOverflow), 1 billion tokens of synthetic 'textbook' data generated by GPT-3.5, and 180 million tokens of synthetic exercises. While the exact filtering classifier and the full synthetic dataset are not public, the proportions and sources are disclosed with high specificity compared to industry standards.

Tokenizer Integrity

8.0 / 10

Phi-1 uses the same tokenizer as the CodeGen-350M-mono model, which is publicly accessible. The vocabulary size is stated as 51,200 (padded for GPU efficiency from a base of ~50,257). Documentation on the tokenizer's integration within the Hugging Face 'transformers' library is comprehensive, allowing for direct inspection of tokenization behavior and vocabulary mapping.

Model

29.5 / 40

Parameter Density

9.0 / 10

The model is a dense architecture with a clearly stated 1.3 billion total parameters. Detailed architectural specifications, including the MLP-inner dimension (8192) and attention head dimensions (64), are provided in the technical paper. There is no ambiguity regarding active vs. total parameters as it is not an MoE model.

Training Compute

7.0 / 10

Training compute is well-documented: the model was trained on 8 Nvidia A100 GPUs. Pretraining took approximately 4 days (770 GPU hours), and finetuning took an additional 7 hours. While specific carbon footprint calculations or total dollar costs are not in the primary paper, the hardware and time metrics allow for reliable third-party estimation.

Benchmark Reproducibility

4.5 / 10

While the paper reports clear scores on HumanEval (50.6%) and MBPP (55.5%), it lacks the full release of the evaluation code and exact prompt templates used for these specific results. Independent researchers have noted challenges in reproducing these exact figures due to sensitivity to prompt formatting and the lack of a standardized evaluation harness at the time of release. (Score adjusted for known issues).

Identity Consistency

9.0 / 10

Phi-1 consistently identifies as a research model specialized for Python. It does not exhibit identity confusion with larger models like GPT-4, despite using GPT-3.5 for synthetic data generation. The model card explicitly defines its scope as a 'text-to-code' model and warns against its use for general conversation or production coding.

Downstream

21.5 / 30

License Clarity

9.5 / 10

The model is released under the MIT License, which is a highly permissive, standard open-source license. This allows for commercial use, modification, and distribution with minimal restrictions. The licensing terms are clear and consistent across the GitHub repository and Hugging Face model card.

Hardware Footprint

7.0 / 10

VRAM requirements are well-understood due to the model's small size (approx. 2.6GB for weights in FP16). Documentation and community testing provide clear guidance on running the model on consumer hardware (e.g., RTX 3060). However, official documentation on quantization-specific accuracy tradeoffs (e.g., 4-bit vs 8-bit) is less detailed than the architectural specs.

Versioning Drift

5.0 / 10

The model follows a basic versioning structure (Phi-1, Phi-1.5, etc.), but lacks a detailed, granular changelog for weight updates or minor revisions. While the initial release is stable, there is limited infrastructure for tracking silent updates or behavioral drift over time within the same version identifier.

About Phi-1

Phi-1 is Microsoft's foundational 1.3 billion-parameter Transformer-based small language model. Its purpose is specializing in Python code generation. A core innovation involves training on meticulously curated, "textbook-quality" data, demonstrating that high-quality data can enable capable models without extensive scale.


Other Phi-1 Models
  • No related models available