ApX logoApX logo

Phi-3-small

Parameters

7B

Context Length

8.192K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

22 Apr 2024

Knowledge Cutoff

Oct 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

-

Activation Function

Gated GELU

Dimensions

Hidden Dimension Size

4,096

Number of Layers

32

FFN Intermediate Size (Dense)

14,336

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

100,352

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 4.1k · Context: 8.2k · Vocab: 100.4kx 32 layersNormPre-AttentionGrouped-Query Attention32Q / 8KV headsHead dim: 128+NormPre-FFNFeed-Forward NetworkGated GELUIntermediate: 14.3k+Final NormOutput Logits

Phi-3-small

Microsoft's Phi-3-small is a member of the Phi family of small language models (SLMs), engineered to deliver high performance within a compact computational footprint. This model variant, with 7 billion parameters, is positioned for broad commercial and research applications where resource efficiency and responsiveness are critical. It addresses scenarios demanding robust language understanding, logical reasoning, and efficient processing on constrained hardware environments, including on-device deployments.

The underlying architecture of Phi-3-small is a dense, decoder-only Transformer. It incorporates several design choices aimed at optimizing performance and memory efficiency, notably leveraging Grouped Query Attention (GQA) where four query heads share a single key-value head, thereby reducing the KV cache footprint. Additionally, the model utilizes alternating layers of dense and blocksparse attention mechanisms, which further contribute to efficient memory management while preserving long-context retrieval capabilities. The training methodology includes a meticulous process of Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), ensuring the model's alignment with human preferences and safety guidelines.

Phi-3-small is designed to operate with a default context length of 8,192 tokens (8K), with a further extended variant supporting up to 128,000 tokens through the application of LongRope technology. The model's training regimen involved an extensive dataset comprising 4.8 trillion tokens, derived from a combination of rigorously filtered public documents, high-quality educational content, and synthetically generated data, emphasizing data quality and reasoning density. This enables the model to excel in tasks such as complex language understanding, mathematical problem-solving, and code generation, making it suitable for deployment across various hardware platforms, from cloud-based inference to edge devices and mobile platforms.

About Phi-3

Microsoft's Phi-3 models are small language models designed for efficient operation on resource-constrained devices. They utilize a transformer decoder architecture and are trained on extensively filtered, high-quality data, including synthetic compositions. This approach enables a compact yet capable model family.


Other Phi-3 Models

Evaluation Benchmarks

Rank

#149

BenchmarkScoreRank

Web Development

WebDev Arena

1171

84

Rankings

Overall Rank

#149

Coding Rank

#106

Model Integrity

Total Score

B

66 / 100

Phi-3-small Model Integrity Report

Total Score

66

/ 100

B

Audit Note

Phi-3-small demonstrates strong transparency in its architectural design and licensing, providing a detailed technical report and a permissive MIT license. However, it remains opaque regarding the specific composition of its 4.8T token training set and relies on internal, non-public evaluation tools that limit benchmark reproducibility. The model's transparency profile is that of a 'weights-available' corporate product rather than a fully open-science project.

Upstream

20.0 / 30

Architectural Provenance

7.5 / 10

Microsoft provides a detailed technical report (arXiv:2404.14219) specifying that Phi-3-small is a dense decoder-only Transformer with 32 layers and a hidden size of 4096. It explicitly documents the use of Grouped Query Attention (GQA) with 4 queries per key and a unique alternating pattern of dense and blocksparse attention layers to optimize the KV cache. While the base architecture is well-documented, the specific 'blocksparse' implementation details are described at a high level without full source code for the custom kernels used in training.

Dataset Composition

4.0 / 10

The model was trained on 4.8 trillion tokens. Documentation mentions three main categories: 1) filtered public web data, 2) high-quality educational/code data, and 3) synthetic 'textbook-like' data. However, Microsoft does not provide a specific percentage breakdown of these sources (beyond a mention of 10% multilingual data) or name the specific datasets used. The 'synthetic data' generation process is described conceptually but lacks the transparency required for verification or reproduction of the data mix.

Tokenizer Integrity

8.5 / 10

Phi-3-small uses the tiktoken-based tokenizer with a vocabulary size of 100,352, which is a significant departure from the Llama-based tokenizer used in Phi-3-mini. The tokenizer is publicly accessible via Hugging Face and the vocabulary size is clearly stated in the technical report. It is well-documented as being optimized for multilingual support, though detailed alignment between the tokenizer's training data and the model's 4.8T token corpus is not fully disclosed.

Model

24.5 / 40

Parameter Density

7.0 / 10

The model is clearly identified as a 7B parameter dense model. Microsoft provides a structural breakdown (32 layers, 32 heads, 4096 hidden dimension). While it uses blocksparse attention, it is not a Mixture-of-Experts (MoE) model, so the distinction between total and active parameters is not applicable here. The documentation is clear, though it lacks a precise parameter count beyond the '7B' marketing label (e.g., 7.39B).

Training Compute

5.0 / 10

Microsoft discloses the hardware used (1024 H100-80G GPUs) and the training duration (18 days). This allows for a rough estimate of compute resources. However, it fails to provide a calculated carbon footprint or the specific energy efficiency metrics of the cluster. The information is better than most proprietary models but lacks the environmental transparency seen in exemplary open-science projects.

Benchmark Reproducibility

3.5 / 10

While Microsoft reports scores on standard benchmarks (MMLU, GSM8K, etc.) in the technical report, they explicitly state that the prompts and few-shot examples are part of an 'internal tool' and are not fully public. This significantly hinders third-party reproduction. Furthermore, independent research has highlighted significant performance gaps when using different evaluation pipelines, suggesting the reported numbers are highly sensitive to the undisclosed internal settings.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a Microsoft Phi-3 model in system prompts and documentation. It maintains a clear versioning identity within the Phi-3 family (Small vs. Mini vs. Medium). There are no documented cases of the model claiming to be a competitor's product (like GPT-4) or denying its nature as an AI developed by Microsoft.

Downstream

21.5 / 30

License Clarity

9.5 / 10

Phi-3-small is released under the highly permissive MIT License, which is clearly stated on the official Hugging Face repository and Microsoft's blog. The license allows for commercial use, modification, and distribution with minimal restrictions. There are no conflicting 'non-commercial' clauses in the primary license text for the weights.

Hardware Footprint

7.0 / 10

Microsoft and partners (like NVIDIA) provide VRAM requirements for various deployment scenarios. Documentation exists for FP16 and INT4 (via ONNX/DirectML) requirements. The impact of LongRope for 128K context on memory is discussed, though detailed scaling tables for VRAM vs. context length are primarily provided by third-party community benchmarks rather than a single comprehensive official source.

Versioning Drift

5.0 / 10

Microsoft uses a naming convention (e.g., Phi-3-small-8k-instruct) but lacks a strict semantic versioning system for weight updates. While they released a 'June 2024 Update' with a changelog on Hugging Face, updates are often delivered as new model cards rather than tracked versions of a single artifact. This makes it difficult for developers to track silent changes or roll back to specific sub-versions without manual commit tracking.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
4k
8k

VRAM Required:

Recommended GPUs

Phi-3-small: Specifications and GPU VRAM Requirements