ApX logoApX logo

Phi-1.5

Parameters

1.3B

Context Length

2K

Modality

Text

Architecture

Dense

License

MIT

Release Date

10 Sept 2023

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

32

Key-Value Heads

32

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

GELU

Dimensions

Hidden Dimension Size

2,048

Number of Layers

24

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

51,200

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 2k · Context: 2K · Vocab: 51.2kx 24 layersRMSNormPre-AttentionMulti-Head Attention32Q / 32KV headsHead dim: 64+RMSNormPre-FFNFeed-Forward NetworkGELUIntermediate: 8.2k+Final RMSNormOutput Logits

Phi-1.5

Microsoft's Phi-1.5 is a Transformer-based language model containing 1.3 billion parameters. It was developed to continue the investigation into the capabilities of smaller language models, specifically focusing on common sense reasoning and general knowledge in natural language contexts. The model's design aims to provide the research community with a non-restricted, accessible model to explore challenges associated with large language models, such as reducing toxicity and enhancing controllability.

The architecture of Phi-1.5 is consistent with its predecessor, Phi-1, employing a decoder-only Transformer configuration. This architecture comprises 24 layers, with 32 attention heads, each having a dimension of 64. The model integrates Rotary Position Embeddings (RoPE) for positional encoding, utilizing a rotary dimension of 32, and leverages Flash Attention to enhance training speed and memory efficiency. A key innovation in Phi-1.5's development lies in its training methodology, which predominantly utilized a high-quality, synthetic "textbook-like" dataset. This dataset, totaling 30 billion tokens, includes 7 billion tokens from Phi-1's training data and approximately 20 billion newly generated synthetic tokens, primarily for imparting common sense reasoning and broad knowledge.

Phi-1.5 demonstrates capabilities in various natural language processing tasks, including text generation, question answering, and Python code generation. Although it is a base model not specifically fine-tuned for instruction following or through reinforcement learning from human feedback, it can produce relevant responses in formats such as QA and chat. Its compact size and specialized training regimen enable it to perform complex reasoning tasks, positioning it as a tool for research in areas like in-context learning and addressing model limitations.

About Phi-1.5

Microsoft's Phi-1.5 is a 1.3 billion parameter Transformer model, a successor to Phi-1. It was trained on a curated synthetic dataset of "textbook-quality" for common sense reasoning. The architecture comprises 24 layers, 32 attention heads, and incorporates rotary embeddings.


Other Phi-1.5 Models
  • No related models available

Evaluation Benchmarks

No evaluation benchmarks for Phi-1.5 available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

73 / 100

Phi-1.5 Model Integrity Report

Total Score

73

/ 100

B+

Audit Note

Phi-1.5 exhibits a bifurcated transparency profile, offering excellent clarity on its physical architecture and licensing while remaining opaque regarding its training data. The use of a standard MIT license and clear hardware requirements makes it highly accessible for deployment. However, the reliance on unreleased synthetic datasets and the lack of reproducible evaluation scripts for key benchmarks represent significant hurdles for independent verification.

Upstream

21.0 / 30

Architectural Provenance

8.0 / 10

The model's architecture is explicitly documented as a decoder-only Transformer with 24 layers, 32 attention heads (head dimension 64), and an MLP inner dimension of 8192. It utilizes Rotary Position Embeddings (RoPE) with a rotary dimension of 32 and Flash Attention. The technical report 'Textbooks Are All You Need II' provides a clear lineage from the previous Phi-1 model, confirming it is a dense model trained from scratch using a next-word prediction objective. While the high-level architecture is well-defined, specific implementation details for the 'mixformer' variant mentioned in some technical discussions are less comprehensively detailed in the primary paper.

Dataset Composition

4.5 / 10

Microsoft discloses that the training set consists of 30 billion tokens, with a breakdown of 7B tokens from Phi-1 (6B code, 1B synthetic) and 20B new synthetic tokens generated by GPT-3.5. However, the specific 20,000 topics used to seed the synthetic data are not public, and the synthetic dataset itself is not released for audit. The filtering methodology for the code subset (The Stack and StackOverflow) is described but lacks the granularity required for full reproducibility. The reliance on undisclosed synthetic data from a proprietary teacher model (GPT-3.5) creates a significant transparency gap regarding the exact nature of the training distribution.

Tokenizer Integrity

8.5 / 10

The model uses the CodeGenTokenizer (specifically from codegen-mono), which is publicly accessible. The vocabulary size is documented as 51,200, though there is a known technical discrepancy where the tokenizer's internal vocab is 50,257 while the model's embedding layer is padded to 51,200 for GPU efficiency (multiples of 64). This mismatch is documented in community discussions and official config files, allowing for verification. The tokenizer's alignment with the model's coding and natural language focus is well-supported by its origin in the CodeGen family.

Model

29.0 / 40

Parameter Density

9.0 / 10

The parameter count is precisely stated as 1.3 billion. As a dense architecture, all parameters are active during inference, and there is no ambiguity regarding sparse or MoE components. The architectural breakdown (layers, heads, dimensions) is clearly provided in the technical report and verifiable via the public configuration files on Hugging Face.

Training Compute

7.0 / 10

The technical report and model card provide specific hardware details: training was conducted on 32 NVIDIA A100-40G GPUs over a period of 8 days. This allows for a direct calculation of approximately 6,144 GPU hours. While the official report is somewhat brief on environmental impact, third-party research and the provided hardware/time metrics allow for reasonable estimation of the carbon footprint (estimated at ~90kg CO2e by independent researchers).

Benchmark Reproducibility

4.0 / 10

While standard benchmarks (WinoGrande, ARC, GSM8K, HumanEval) are reported with specific scores, the exact evaluation prompts and few-shot examples are not fully disclosed in the technical report. There are documented difficulties in the research community regarding the reproduction of GSM8K results, with users noting a lack of clarity on the specific evaluation scripts used by Microsoft. The score is further adjusted due to significant concerns regarding benchmark contamination in the synthetic training data.

Identity Consistency

9.0 / 10

Phi-1.5 generally maintains a consistent identity as a research model from Microsoft. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4. However, as a base model without instruction tuning, it can occasionally drift into generating text that mimics its training data (textbooks) rather than maintaining a conversational persona, which is a known and documented limitation of its 'base' nature.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model is released under the MIT License, which is a highly permissive, standard open-source license. This was a notable change from earlier, more restrictive research-only terms, and it is now clearly stated on the official Hugging Face repository and in Microsoft's communications. There are no conflicting terms between the weights and the code.

Hardware Footprint

8.0 / 10

Memory requirements are well-documented by both Microsoft and the community. The model requires approximately 2.6 GB of VRAM for FP16 inference, and detailed requirements for 4-bit quantization (~670 MB) are available. Scaling behavior for context length (up to 2048 tokens) and its impact on VRAM are understood, and the model is widely verified to run on consumer-grade hardware as claimed.

Versioning Drift

5.0 / 10

The model uses basic versioning (Phi-1.5), but there is no formal semantic versioning or detailed changelog for minor weight updates. While the initial release was well-documented, subsequent minor adjustments or the existence of variants like 'phi-1.5-web' (which was not released) create some confusion. There is no established mechanism for tracking silent updates to the weights on the Hugging Face Hub.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
1k
2k

VRAM Required:

Recommended GPUs

Phi-1.5: Specifications and GPU VRAM Requirements