ApX logoApX logo

Phi-3-medium

Parameters

14B

Context Length

128K

Modality

Text

Architecture

Dense

License

MIT

Release Date

22 Apr 2024

Knowledge Cutoff

Oct 2023

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

40

Key-Value Heads

10

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

10,000

Sliding Window Attention

Yes

Sliding Window Size

2,047

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

5,120

Number of Layers

40

FFN Intermediate Size (Dense)

17,920

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

32,064

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 5.1k · Context: 128k · Vocab: 32.1kx 40 layersRMSNormPre-AttentionGrouped-Query Attention40Q / 10KV heads · SW: 2kHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwishIntermediate: 17.9k+Final RMSNormOutput Logits

Phi-3-medium

Phi-3-medium is a compact, high-performance large language model developed by Microsoft, belonging to the Phi-3 family of models. With 14 billion parameters, it is designed for a broad array of commercial and research applications, particularly those operating within memory or compute-constrained environments and latency-sensitive scenarios. This model aims to provide strong reasoning capabilities, notably in mathematics, logic, and code generation, positioning it as a foundational component for developing generative artificial intelligence features.

The training methodology for Phi-3-medium leverages a high-quality, reasoning-dense dataset, which is a refined and scaled version of the data utilized for its predecessor, Phi-2. This dataset incorporates both meticulously filtered publicly available web content and synthetically generated data, ensuring a robust and instruction-adherent model. The training process includes supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance its ability to follow instructions precisely and to reinforce safety measures.

The model employs a dense decoder-only Transformer architecture, a common and effective structure for autoregressive language modeling tasks. Its internal mechanisms include Grouped Query Attention (GQA) for efficient memory utilization and processing, Root Mean Square (RMS) normalization for stable training, and Rotary Positional Embeddings (RoPE) to handle positional information within sequences. A specific variant of RoPE, known as LongRope, facilitates the model's capacity to process extended context lengths up to 128,000 tokens. Phi-3-medium is optimized for deployment across diverse hardware, including graphics processing units (GPUs), central processing units (CPUs), and mobile devices, often leveraging technologies like ONNX Runtime and DirectML for cross-platform compatibility and efficient inference.

About Phi-3

Microsoft's Phi-3 models are small language models designed for efficient operation on resource-constrained devices. They utilize a transformer decoder architecture and are trained on extensively filtered, high-quality data, including synthetic compositions. This approach enables a compact yet capable model family.


Other Phi-3 Models

Evaluation Benchmarks

Rank

#145

BenchmarkScoreRank

Web Development

WebDev Arena

1198

81

Rankings

Overall Rank

#145

Coding Rank

#100

Model Integrity

Total Score

B+

71 / 100

Phi-3-medium Model Integrity Report

Total Score

71

/ 100

B+

Audit Note

Phi-3-medium demonstrates strong transparency in its licensing and architectural specifications, providing clear hardware requirements and a permissive MIT license. However, the model's reliance on undisclosed synthetic data mixtures and internal evaluation tools creates significant gaps in verifying its training provenance and benchmark claims. While it is a highly accessible model for deployment, the 'black box' nature of its high-quality data recipe remains a primary transparency hurdle.

Upstream

20.0 / 30

Architectural Provenance

7.5 / 10

Microsoft provides a technical report and model cards that explicitly define Phi-3-medium as a 14B parameter dense decoder-only Transformer. It specifies the use of 40 layers, 40 attention heads, and an embedding dimension of 5120. The architecture is noted to be a scaled version of the Phi-3-mini, utilizing Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE/LongRope) for context extension up to 128k. While the high-level methodology (SFT and DPO) is described, the specific hyperparameters for the pre-training phase are less detailed than those for the smaller variants.

Dataset Composition

4.0 / 10

The model was trained on 4.8 trillion tokens. Microsoft discloses that the data is a mixture of 'heavily filtered' web data and synthetic data designed to mimic 'textbook-quality' reasoning. However, there is no specific percentage breakdown between web and synthetic sources, nor is there a detailed list of the specific web domains or datasets used. The filtering criteria are described in general terms ('quality-dense') without providing the actual code or comprehensive methodology for the curation process.

Tokenizer Integrity

8.5 / 10

The model uses the same tokenizer as Phi-3-mini, which is a version of the Llama tokenizer with a vocabulary size of 32,064 tokens. The tokenizer files are publicly available on Hugging Face and integrated into the standard 'transformers' library. The vocabulary size and special tokens (e.g., <|user|>, <|assistant|>, <|end|>) are clearly documented in the model card and technical report, allowing for easy verification and local testing.

Model

28.5 / 40

Parameter Density

9.0 / 10

Phi-3-medium is explicitly stated to be a dense model with 14 billion parameters. Unlike MoE models where active parameters can be obscured, the 14B figure represents the full active parameter count. The architectural breakdown (layers, heads, embedding dimensions) is clearly provided in the technical report, and the model weights on Hugging Face confirm these specifications through the configuration files.

Training Compute

6.5 / 10

Microsoft disclosed that the model was trained using 512 H100-80G GPUs over a period of 42 days. This provides a clear hardware specification and duration, allowing for a rough estimate of total compute. However, the official documentation lacks a specific carbon footprint calculation or a detailed breakdown of the total cost and energy consumption associated with the training run.

Benchmark Reproducibility

4.0 / 10

While Microsoft reports scores on standard benchmarks (MMLU, GSM8K, HumanEval), the evaluation is conducted using an 'internal tool' (BabelBench) with prompts that are not fully public. The technical report mentions that they do not optimize prompts for Phi-3, but the lack of a public evaluation repository or the exact few-shot examples used makes independent reproduction difficult. There is also limited disclosure regarding the specific versions of benchmarks used.

Identity Consistency

9.0 / 10

The model consistently identifies itself as a Microsoft-developed AI and is aware of its versioning within the Phi-3 family. It does not exhibit the identity confusion seen in some smaller fine-tuned models that claim to be GPT-4. The model card clearly outlines its intended use cases and limitations, and the model's behavior in chat mode generally aligns with these disclosures.

Downstream

22.5 / 30

License Clarity

10.0 / 10

The model is released under the MIT License, which is a highly permissive, standard open-source license. This allows for broad commercial and research use, modification, and distribution without the restrictive 'acceptable use' policies or revenue-based triggers found in other 'open' models. The licensing terms are unambiguous and prominently displayed on the official repository.

Hardware Footprint

7.5 / 10

Microsoft and third-party sources provide clear guidance on VRAM requirements. For example, it is documented that the model requires approximately 28GB of VRAM in FP16, and can be run on consumer hardware (like 2x RTX 4090 or a single A6000) when quantized. Official ONNX and GGUF versions are available with documented performance/memory trade-offs, though detailed context-length scaling memory charts are not provided in the primary technical report.

Versioning Drift

5.0 / 10

The model uses a basic naming convention (Phi-3-medium-4k/128k-instruct) and has seen updates (e.g., the transition from preview to official release). However, there is no formal semantic versioning system or a detailed public changelog that tracks minor weight updates or safety alignment drift. Users must rely on Hugging Face commit histories to track changes, which lacks the transparency of a formal versioning policy.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

Phi-3-medium: Specifications and GPU VRAM Requirements