ApX logoApX logo

Phi-3-mini

Parameters

3.8B

Context Length

4.096K

Modality

Text

Architecture

Dense

License

MIT

Release Date

22 Apr 2024

Knowledge Cutoff

Oct 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

32

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

2,047

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

3,072

Number of Layers

32

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

32,064

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 3.1k · Context: 4.1k · Vocab: 32.1kx 32 layersRMSNormPre-AttentionGrouped-Query Attention32Q / 8KV heads · SW: 2kHead dim: 96+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 8.2k+Final RMSNormOutput Logits

Phi-3-mini

Microsoft's Phi-3-mini is a lightweight, state-of-the-art small language model (SLM) designed to deliver high performance within resource-constrained environments, including mobile and edge devices. It is a foundational component of the Phi-3 model family, aiming to offer compelling capabilities at a significantly smaller scale compared to larger models. The model serves as a practical solution for scenarios where computational efficiency and reduced operational costs are paramount, thereby broadening the accessibility of advanced AI.

Architecturally, Phi-3-mini is a dense decoder-only Transformer model. Its training methodology is a key innovation, utilizing a meticulously curated dataset that is a scaled-up version of the one employed for Phi-2. This dataset comprises heavily filtered publicly available web data and synthetic "textbook-quality" data, intentionally designed to foster strong reasoning and knowledge acquisition. The model undergoes a rigorous post-training process, incorporating both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to enhance instruction adherence, robustness, and safety alignment. It features a hidden dimension size of 3072, 32 layers, 32 attention heads, and leverages grouped-query attention (GQA) with 8 key-value heads.

Phi-3-mini is primarily intended for broad commercial and research applications that require strong reasoning abilities, particularly in areas such as mathematics and logic. Its compact size facilitates deployment in latency-bound scenarios and on hardware with limited memory and compute capabilities, such as mobile phones and IoT devices. The model is available in two context length variants: a default 4K token version and a 128K token version (Phi-3-mini-128K), which utilizes LongRope for extended context handling. These characteristics make it suitable for diverse use cases ranging from general-purpose AI systems to specialized applications where efficient local inference is a requirement.

About Phi-3

Microsoft's Phi-3 models are small language models designed for efficient operation on resource-constrained devices. They utilize a transformer decoder architecture and are trained on extensively filtered, high-quality data, including synthetic compositions. This approach enables a compact yet capable model family.


Other Phi-3 Models

Evaluation Benchmarks

Rank

#152

BenchmarkScoreRank

Web Development

WebDev Arena

1143

87

Rankings

Overall Rank

#152

Coding Rank

#112

Model Integrity

Total Score

B+

72 / 100

Phi-3-mini Model Integrity Report

Total Score

72

/ 100

B+

Audit Note

Phi-3-mini exhibits strong transparency regarding its physical architecture, licensing, and hardware requirements, making it highly accessible for local deployment. However, it maintains significant opacity concerning its training data composition and the specific methodologies used for its benchmark evaluations. While the use of the MIT license is exemplary, the reliance on proprietary internal tools for performance claims and the lack of detail on synthetic data generation limit its overall transparency profile.

Upstream

20.5 / 30

Architectural Provenance

8.0 / 10

The Phi-3-mini architecture is explicitly documented in the official technical report as a dense decoder-only Transformer. It utilizes a hidden dimension of 3072, 32 layers, and 32 attention heads. Notably, it adopts the Llama-2 block structure to ensure compatibility with existing community tools. The report details the use of Grouped-Query Attention (GQA) with 4 queries sharing 1 key, and the 128K variant's use of LongRope for context extension is well-documented. While the high-level methodology is clear, specific hyperparameter tuning details for the pretraining phase are less exhaustive than for the 7B variant.

Dataset Composition

4.0 / 10

Microsoft discloses that the model was trained on 3.3 trillion tokens (later updated to 4.9T for some variants) consisting of 'textbook-quality' synthetic data and heavily filtered web data. However, the specific proportions of these sources are not provided. The filtering criteria ('educational level') are described conceptually but the exact implementation, specific web domains, or the identity of the models used to generate the synthetic data remain proprietary. No sample data or detailed breakdown of the 3.3T tokens is publicly available.

Tokenizer Integrity

8.5 / 10

The model uses the Llama-2 tokenizer with a vocabulary size of 32,064 tokens, which is publicly accessible and well-documented. The technical report explicitly mentions the removal of BoS tokens and the addition of specific chat template tokens (<|system|>, <|user|>, <|end|>, <|assistant|>). The tokenizer is available via the Hugging Face transformers library, allowing for full inspection and verification of tokenization behavior.

Model

28.5 / 40

Parameter Density

9.0 / 10

The parameter count is precisely stated as 3.8 billion. As a dense model, all parameters are active during inference, and this is clearly communicated. The architectural breakdown (layers, heads, hidden dimensions) is fully disclosed in the technical report, providing a clear map of parameter distribution across the model's components.

Training Compute

7.0 / 10

Microsoft provides specific hardware and duration details: the model was trained on 512 NVIDIA H100-80G GPUs over a period of 10 days. This allows for a reasonable estimation of total compute (approx. 122,880 GPU hours). However, the report lacks a formal carbon footprint calculation or a detailed breakdown of energy consumption, which are required for a perfect score in this category.

Benchmark Reproducibility

3.5 / 10

While the technical report lists scores across numerous standard benchmarks (MMLU, GSM8K, HumanEval), it admits that the evaluations were conducted using a 'Microsoft internal tool.' The exact prompts and few-shot examples are not fully disclosed, and the evaluation code is not public. Independent researchers have noted significant performance discrepancies when using standard evaluation harnesses like LM Eval Harness compared to the reported figures.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a Microsoft-developed AI and is transparent about its versioning (e.g., distinguishing between 4K and 128K variants). It generally acknowledges its limitations as a small language model, particularly regarding its knowledge cutoff (October 2023) and its primary optimization for English and reasoning tasks.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model weights and associated code are released under the highly permissive MIT License. This is explicitly stated on the official Hugging Face repository and in the technical report. There are no conflicting commercial restrictions or 'open-ish' terms; it is a standard, legally clear open-source license that allows for broad commercial and research use.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented for various deployment scenarios. Microsoft and third-party partners (like ONNX Runtime) provide VRAM estimates for FP16 (~8GB) and 4-bit quantization (~1.8GB to 2GB). The impact of quantization (AWQ vs RTN) on accuracy is discussed in technical blogs, and the model's ability to run on specific mobile hardware (iPhone 14) is verified with performance metrics (12 tokens/sec).

Versioning Drift

5.0 / 10

Microsoft has released multiple versions (initial April release, June update, and the subsequent Phi-3.5 series), but they do not consistently use strict semantic versioning. While 'Release Notes' are provided on Hugging Face, they often lack a detailed changelog of specific weight adjustments or data recipe changes. Users have reported 'silent' updates where model behavior changed without a corresponding version increment in the metadata.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
2k
4k

VRAM Required:

Recommended GPUs

Phi-3-mini: Specifications and GPU VRAM Requirements