Parameters
1.3B
Context Length
2K
Modality
Text
Architecture
Dense
License
MIT
Release Date
10 Sept 2023
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Head Attention
Attention Heads
32
Key-Value Heads
32
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
GELU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
24
FFN Intermediate Size (Dense)
8,192
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
51,200
Microsoft's Phi-1.5 is a Transformer-based language model containing 1.3 billion parameters. It was developed to continue the investigation into the capabilities of smaller language models, specifically focusing on common sense reasoning and general knowledge in natural language contexts. The model's design aims to provide the research community with a non-restricted, accessible model to explore challenges associated with large language models, such as reducing toxicity and enhancing controllability.
The architecture of Phi-1.5 is consistent with its predecessor, Phi-1, employing a decoder-only Transformer configuration. This architecture comprises 24 layers, with 32 attention heads, each having a dimension of 64. The model integrates Rotary Position Embeddings (RoPE) for positional encoding, utilizing a rotary dimension of 32, and leverages Flash Attention to enhance training speed and memory efficiency. A key innovation in Phi-1.5's development lies in its training methodology, which predominantly utilized a high-quality, synthetic "textbook-like" dataset. This dataset, totaling 30 billion tokens, includes 7 billion tokens from Phi-1's training data and approximately 20 billion newly generated synthetic tokens, primarily for imparting common sense reasoning and broad knowledge.
Phi-1.5 demonstrates capabilities in various natural language processing tasks, including text generation, question answering, and Python code generation. Although it is a base model not specifically fine-tuned for instruction following or through reinforcement learning from human feedback, it can produce relevant responses in formats such as QA and chat. Its compact size and specialized training regimen enable it to perform complex reasoning tasks, positioning it as a tool for research in areas like in-context learning and addressing model limitations.
Microsoft's Phi-1.5 is a 1.3 billion parameter Transformer model, a successor to Phi-1. It was trained on a curated synthetic dataset of "textbook-quality" for common sense reasoning. The architecture comprises 24 layers, 32 attention heads, and incorporates rotary embeddings.
No evaluation benchmarks for Phi-1.5 available.
Overall Rank
-
Coding Rank
-
Total Score
73
/ 100
Phi-1.5 exhibits a bifurcated transparency profile, offering excellent clarity on its physical architecture and licensing while remaining opaque regarding its training data. The use of a standard MIT license and clear hardware requirements makes it highly accessible for deployment. However, the reliance on unreleased synthetic datasets and the lack of reproducible evaluation scripts for key benchmarks represent significant hurdles for independent verification.
Architectural Provenance
The model's architecture is explicitly documented as a decoder-only Transformer with 24 layers, 32 attention heads (head dimension 64), and an MLP inner dimension of 8192. It utilizes Rotary Position Embeddings (RoPE) with a rotary dimension of 32 and Flash Attention. The technical report 'Textbooks Are All You Need II' provides a clear lineage from the previous Phi-1 model, confirming it is a dense model trained from scratch using a next-word prediction objective. While the high-level architecture is well-defined, specific implementation details for the 'mixformer' variant mentioned in some technical discussions are less comprehensively detailed in the primary paper.
Dataset Composition
Microsoft discloses that the training set consists of 30 billion tokens, with a breakdown of 7B tokens from Phi-1 (6B code, 1B synthetic) and 20B new synthetic tokens generated by GPT-3.5. However, the specific 20,000 topics used to seed the synthetic data are not public, and the synthetic dataset itself is not released for audit. The filtering methodology for the code subset (The Stack and StackOverflow) is described but lacks the granularity required for full reproducibility. The reliance on undisclosed synthetic data from a proprietary teacher model (GPT-3.5) creates a significant transparency gap regarding the exact nature of the training distribution.
Tokenizer Integrity
The model uses the CodeGenTokenizer (specifically from codegen-mono), which is publicly accessible. The vocabulary size is documented as 51,200, though there is a known technical discrepancy where the tokenizer's internal vocab is 50,257 while the model's embedding layer is padded to 51,200 for GPU efficiency (multiples of 64). This mismatch is documented in community discussions and official config files, allowing for verification. The tokenizer's alignment with the model's coding and natural language focus is well-supported by its origin in the CodeGen family.
Parameter Density
The parameter count is precisely stated as 1.3 billion. As a dense architecture, all parameters are active during inference, and there is no ambiguity regarding sparse or MoE components. The architectural breakdown (layers, heads, dimensions) is clearly provided in the technical report and verifiable via the public configuration files on Hugging Face.
Training Compute
The technical report and model card provide specific hardware details: training was conducted on 32 NVIDIA A100-40G GPUs over a period of 8 days. This allows for a direct calculation of approximately 6,144 GPU hours. While the official report is somewhat brief on environmental impact, third-party research and the provided hardware/time metrics allow for reasonable estimation of the carbon footprint (estimated at ~90kg CO2e by independent researchers).
Benchmark Reproducibility
While standard benchmarks (WinoGrande, ARC, GSM8K, HumanEval) are reported with specific scores, the exact evaluation prompts and few-shot examples are not fully disclosed in the technical report. There are documented difficulties in the research community regarding the reproduction of GSM8K results, with users noting a lack of clarity on the specific evaluation scripts used by Microsoft. The score is further adjusted due to significant concerns regarding benchmark contamination in the synthetic training data.
Identity Consistency
Phi-1.5 generally maintains a consistent identity as a research model from Microsoft. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4. However, as a base model without instruction tuning, it can occasionally drift into generating text that mimics its training data (textbooks) rather than maintaining a conversational persona, which is a known and documented limitation of its 'base' nature.
License Clarity
The model is released under the MIT License, which is a highly permissive, standard open-source license. This was a notable change from earlier, more restrictive research-only terms, and it is now clearly stated on the official Hugging Face repository and in Microsoft's communications. There are no conflicting terms between the weights and the code.
Hardware Footprint
Memory requirements are well-documented by both Microsoft and the community. The model requires approximately 2.6 GB of VRAM for FP16 inference, and detailed requirements for 4-bit quantization (~670 MB) are available. Scaling behavior for context length (up to 2048 tokens) and its impact on VRAM are understood, and the model is widely verified to run on consumer-grade hardware as claimed.
Versioning Drift
The model uses basic versioning (Phi-1.5), but there is no formal semantic versioning or detailed changelog for minor weight updates. While the initial release was well-documented, subsequent minor adjustments or the existence of variants like 'phi-1.5-web' (which was not released) create some confusion. There is no established mechanism for tracking silent updates to the weights on the Hugging Face Hub.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online