Parameters
1.3B
Context Length
2K
Modality
Text
Architecture
Dense
License
MIT
Release Date
15 Jun 2023
Knowledge Cutoff
-
VRAM requirements for different quantization methods and context sizes
1,024 tokens
Consumer
1x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
2,048 tokens
Consumer
1x RTX 4090
24GB VRAM
Datacenter
1x NVIDIA A100
80GB VRAM
Apple Silicon
1x Apple M3 Max
128GB VRAM
No evaluation benchmarks for Phi-1 available.
Overall Rank
-
Coding Rank
-
Microsoft's Phi-1 is a compact, Transformer-based language model specifically engineered for Python code generation. Its development emphasizes the efficacy of high-quality, curated training data over sheer data volume or model scale, a principle articulated in the foundational "Textbooks Are All You Need" research. The model's training regimen involved a distinct approach, utilizing a combination of meticulously filtered code-language data from public repositories and synthetically generated Python textbooks and exercises from large language models such as GPT-3.5. This data strategy aimed to imbue the model with a "textbook-quality" understanding of programming concepts and practices, fostering robust learning despite its modest size.
The architectural design of Phi-1 is rooted in a Transformer decoder-only structure, featuring 24 layers, a hidden dimension size of 2048, and 32 attention heads. Key innovations incorporated to enhance training efficiency and performance include the adoption of Rotary Position Embedding (RoPE) for handling sequence position information and FlashAttention for accelerated attention computation. This combination of a streamlined architecture with optimized components allows Phi-1 to process input sequences efficiently while maintaining contextual coherence. The model's training focused on next-token prediction, enabling it to generate coherent and syntactically correct Python code.
Phi-1 is primarily designed for tasks involving the generation of simple Python functions from docstrings, demonstrating its utility in code generation applications. Its performance characteristics, particularly in Python coding benchmarks like HumanEval and MBPP, indicate that it can achieve results comparable to significantly larger models, underscoring the impact of its high-quality data curation. While specialized for Python, its capabilities provide a foundation for understanding the potential of small language models in targeted domains.
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
32
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
24
FFN Intermediate Size (Dense)
8,192
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
51,200
Total Score
75
/ 100
Phi-1 exhibits a high level of transparency regarding its architectural design and the specific composition of its training data, particularly for a model of its era. Its use of a standard MIT license and clear disclosure of training hardware and time sets a positive precedent for open-weights research. However, the model's transparency is hampered by limited reproducibility of its benchmark results and a lack of public access to the synthetic datasets used during training.
Architectural Provenance
The model's architecture is extensively documented in the 'Textbooks Are All You Need' paper and official model cards. It is a decoder-only Transformer with 1.3 billion parameters, 24 layers, a hidden dimension of 2048, and 32 attention heads. Specific technical choices like Rotary Position Embedding (RoPE) and FlashAttention are explicitly disclosed. The training methodology, including the two-stage process (pretraining on 'CodeTextbook' and finetuning on 'CodeExercises'), is clearly described with step counts and learning rate schedules.
Dataset Composition
Microsoft provides a detailed breakdown of the training data: 6 billion tokens of filtered web code (from The Stack and StackOverflow), 1 billion tokens of synthetic 'textbook' data generated by GPT-3.5, and 180 million tokens of synthetic exercises. While the exact filtering classifier and the full synthetic dataset are not public, the proportions and sources are disclosed with high specificity compared to industry standards.
Tokenizer Integrity
Phi-1 uses the same tokenizer as the CodeGen-350M-mono model, which is publicly accessible. The vocabulary size is stated as 51,200 (padded for GPU efficiency from a base of ~50,257). Documentation on the tokenizer's integration within the Hugging Face 'transformers' library is comprehensive, allowing for direct inspection of tokenization behavior and vocabulary mapping.
Parameter Density
The model is a dense architecture with a clearly stated 1.3 billion total parameters. Detailed architectural specifications, including the MLP-inner dimension (8192) and attention head dimensions (64), are provided in the technical paper. There is no ambiguity regarding active vs. total parameters as it is not an MoE model.
Training Compute
Training compute is well-documented: the model was trained on 8 Nvidia A100 GPUs. Pretraining took approximately 4 days (770 GPU hours), and finetuning took an additional 7 hours. While specific carbon footprint calculations or total dollar costs are not in the primary paper, the hardware and time metrics allow for reliable third-party estimation.
Benchmark Reproducibility
While the paper reports clear scores on HumanEval (50.6%) and MBPP (55.5%), it lacks the full release of the evaluation code and exact prompt templates used for these specific results. Independent researchers have noted challenges in reproducing these exact figures due to sensitivity to prompt formatting and the lack of a standardized evaluation harness at the time of release. (Score adjusted for known issues).
Identity Consistency
Phi-1 consistently identifies as a research model specialized for Python. It does not exhibit identity confusion with larger models like GPT-4, despite using GPT-3.5 for synthetic data generation. The model card explicitly defines its scope as a 'text-to-code' model and warns against its use for general conversation or production coding.
License Clarity
The model is released under the MIT License, which is a highly permissive, standard open-source license. This allows for commercial use, modification, and distribution with minimal restrictions. The licensing terms are clear and consistent across the GitHub repository and Hugging Face model card.
Hardware Footprint
VRAM requirements are well-understood due to the model's small size (approx. 2.6GB for weights in FP16). Documentation and community testing provide clear guidance on running the model on consumer hardware (e.g., RTX 3060). However, official documentation on quantization-specific accuracy tradeoffs (e.g., 4-bit vs 8-bit) is less detailed than the architectural specs.
Versioning Drift
The model follows a basic versioning structure (Phi-1, Phi-1.5, etc.), but lacks a detailed, granular changelog for weight updates or minor revisions. While the initial release is stable, there is limited infrastructure for tracking silent updates or behavioral drift over time within the same version identifier.
Phi-1 is Microsoft's foundational 1.3 billion-parameter Transformer-based small language model. Its purpose is specializing in Python code generation. A core innovation involves training on meticulously curated, "textbook-quality" data, demonstrating that high-quality data can enable capable models without extensive scale.
APX AI
Online