ApX logoApX logo

Phi-4-Mini

Parameters

3.8B

Context Length

128K

Modality

Text

Architecture

Dense

License

MIT

Release Date

27 Feb 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

24

Key-Value Heads

8

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

262,144

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

3,072

Number of Layers

32

FFN Intermediate Size (Dense)

8,192

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

200,064

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 3.1k · Context: 128k · Vocab: 200.1kx 32 layersRMSNormPre-AttentionGrouped-Query Attention24Q / 8KV heads · SW: 262.1kHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 8.2k+Final RMSNormOutput Logits

Phi-4-Mini

Microsoft Phi-4-Mini is a lightweight, open model from the Phi-4 family, engineered to operate efficiently in resource-constrained environments. This model is constructed from a combination of high-quality synthetic data and filtered public web content, with a particular emphasis on data dense in reasoning. Its core architecture is a dense, decoder-only Transformer, optimized with techniques such as grouped-query attention (GQA) and LongRoPE positional encoding to enhance inference speed and manage extended context lengths. The model incorporates an expanded vocabulary of 200,064 tokens, facilitating broad multilingual support.

Key advancements in Phi-4-Mini include an enhancement process that integrates supervised fine-tuning (SFT) and direct preference optimization (DPO), along with Reinforcement Learning from Human Feedback (RLHF) for robust instruction adherence and safety measures. This training methodology enables the model to exhibit strong reasoning capabilities, particularly in mathematical and logical tasks, and supports advanced functions such as function calling. The design prioritizes computational efficiency and low-latency performance, making it suitable for deployment in scenarios where memory and processing power are limited.

The intended use cases for Phi-4-Mini span general-purpose AI systems and applications that require strong reasoning in memory or compute-constrained environments, or those with latency-bound requirements. It is designed to accelerate research in language models and serve as a foundational building block for generative AI features. The model's compact size and optimized architecture allow for deployment on edge devices, including various mobile operating systems, by leveraging tools such as Microsoft Olive and the ONNX GenAI Runtime.

About Phi-4

The Microsoft Phi-4 model family comprises small language models prioritizing efficient, high-capability reasoning. Its development emphasizes robust data quality and sophisticated synthetic data integration. This approach enables enhanced performance and on-device deployment capabilities.


Other Phi-4 Models

Evaluation Benchmarks

Rank

#121

BenchmarkScoreRank

General Knowledge

MMLU

0.673

32

Rankings

Overall Rank

#121

Coding Rank

-

Model Integrity

Total Score

B+

75 / 100

Phi-4-Mini Model Integrity Report

Total Score

75

/ 100

B+

Audit Note

Phi-4-Mini exhibits high transparency regarding its physical architecture and licensing, utilizing a standard MIT license and providing specific hardware training metrics. While it offers a clear technical breakdown of its Transformer structure and tokenizer, it remains less transparent about the specific composition and sources of its 5-trillion-token training mixture. The model's documentation is evidence-based and avoids most marketing vagueness, though it relies on proprietary synthetic data processes that limit full upstream auditability.

Upstream

22.0 / 30

Architectural Provenance

8.0 / 10

The model is explicitly documented as a dense decoder-only Transformer with 3.8 billion parameters. Microsoft provides a technical report detailing specific architectural choices, including the use of 32 Transformer layers, a hidden state size of 3,072, and Grouped-Query Attention (GQA) with 24 query heads and 8 key/value heads. It also documents the use of LongRoPE for context extension and tied input/output embeddings. The training methodology, including supervised fine-tuning (SFT) and Direct Preference Optimization (DPO), is clearly stated in the official documentation.

Dataset Composition

5.0 / 10

Microsoft discloses that the model was trained on 5 trillion tokens from a mix of filtered public web data and synthetic data. While they provide general categories (educational data, code, synthetic 'textbook-like' data) and mention that synthetic data is a primary focus for reasoning, they do not provide a precise percentage breakdown of the 5T tokens or specific source names for the 'acquired academic books.' The methodology for data filtering and decontamination is described at a high level, but the exact datasets remain proprietary.

Tokenizer Integrity

9.0 / 10

The model uses the 'o200k_base' tiktoken tokenizer with a clearly stated vocabulary size of 200,064 tokens. The tokenizer is publicly available on Hugging Face, and its support for 24 languages is documented and verifiable through the provided configuration files. The transition from the Phi-3.5 tokenizer to this larger vocabulary for better multilingual support is explicitly justified in technical communications.

Model

30.0 / 40

Parameter Density

9.0 / 10

The model is clearly identified as a dense architecture with 3.8B total parameters. There is no ambiguity regarding active vs. total parameters as seen in MoE models. Detailed architectural specifications, such as the number of layers and head configurations, are provided in the technical report and model cards, allowing for a complete understanding of parameter distribution.

Training Compute

7.0 / 10

Microsoft provides specific hardware and duration details for the training process: 512 A100-80G GPUs for 21 days. This allows for a reasonable estimation of total compute (approx. 258,000 GPU hours). While they do not provide a direct carbon footprint calculation or exact dollar cost, the disclosure of hardware type, count, and duration is significantly more transparent than most industry peers.

Benchmark Reproducibility

6.0 / 10

The model is evaluated using OpenAI's SimpleEval framework, which is a public and reproducible standard. Microsoft specifies the versions and settings (e.g., 0-shot, 5-shot, CoT) for major benchmarks like MMLU, GSM8K, and MATH. However, they also reference 'internal benchmarks' for certain capabilities, and the full evaluation code/prompts for all reported metrics are not consolidated in a single public repository for one-click reproduction.

Identity Consistency

8.0 / 10

The model generally identifies itself correctly as a Microsoft-developed AI. Technical documentation acknowledges that earlier versions had minor identity confusion issues (claiming to be from other companies) and states that ad-hoc training data was used to correct this in the Phi-4 release. It provides clear versioning (Phi-4-Mini-Instruct) and is transparent about its limitations regarding factual knowledge due to its small size.

Downstream

23.0 / 30

License Clarity

10.0 / 10

The model weights and associated code are released under the highly permissive MIT License. This is explicitly stated on Hugging Face, Azure, and in official blog posts. There are no conflicting commercial restrictions or 'open-ish' custom licenses; it is a standard, legally clear open-source license that allows for commercial use and derivative works without ambiguity.

Hardware Footprint

8.0 / 10

VRAM requirements are well-documented across multiple sources, including official Microsoft blogs and third-party implementation guides. Requirements for FP16 (approx. 9GB) and various quantization levels (e.g., 4-bit GGUF at ~2.5GB) are publicly available. The impact of the 128K context window on KV cache memory is also addressed through the documentation of GQA and its efficiency gains.

Versioning Drift

5.0 / 10

The model uses basic versioning (v1.0) and maintains a release date (February 2025). While there is a 'Release Notes' section on the model card, it lacks a detailed, granular changelog for minor weight updates or specific data mixture adjustments. There is no formal system for tracking performance drift over time, although the model is described as a 'static' release.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs