Parameters
14B
Context Length
128K
Modality
Text
Architecture
Dense
License
MIT
Release Date
22 Apr 2024
Knowledge Cutoff
Oct 2023
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
40
Key-Value Heads
10
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
10,000
Sliding Window Attention
Yes
Sliding Window Size
2,047
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
5,120
Number of Layers
40
FFN Intermediate Size (Dense)
17,920
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
32,064
Phi-3-medium is a compact, high-performance large language model developed by Microsoft, belonging to the Phi-3 family of models. With 14 billion parameters, it is designed for a broad array of commercial and research applications, particularly those operating within memory or compute-constrained environments and latency-sensitive scenarios. This model aims to provide strong reasoning capabilities, notably in mathematics, logic, and code generation, positioning it as a foundational component for developing generative artificial intelligence features.
The training methodology for Phi-3-medium leverages a high-quality, reasoning-dense dataset, which is a refined and scaled version of the data utilized for its predecessor, Phi-2. This dataset incorporates both meticulously filtered publicly available web content and synthetically generated data, ensuring a robust and instruction-adherent model. The training process includes supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance its ability to follow instructions precisely and to reinforce safety measures.
The model employs a dense decoder-only Transformer architecture, a common and effective structure for autoregressive language modeling tasks. Its internal mechanisms include Grouped Query Attention (GQA) for efficient memory utilization and processing, Root Mean Square (RMS) normalization for stable training, and Rotary Positional Embeddings (RoPE) to handle positional information within sequences. A specific variant of RoPE, known as LongRope, facilitates the model's capacity to process extended context lengths up to 128,000 tokens. Phi-3-medium is optimized for deployment across diverse hardware, including graphics processing units (GPUs), central processing units (CPUs), and mobile devices, often leveraging technologies like ONNX Runtime and DirectML for cross-platform compatibility and efficient inference.
Microsoft's Phi-3 models are small language models designed for efficient operation on resource-constrained devices. They utilize a transformer decoder architecture and are trained on extensively filtered, high-quality data, including synthetic compositions. This approach enables a compact yet capable model family.
Rank
#145
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1198 | 81 |
Overall Rank
#145
Coding Rank
#100
Total Score
71
/ 100
Phi-3-medium demonstrates strong transparency in its licensing and architectural specifications, providing clear hardware requirements and a permissive MIT license. However, the model's reliance on undisclosed synthetic data mixtures and internal evaluation tools creates significant gaps in verifying its training provenance and benchmark claims. While it is a highly accessible model for deployment, the 'black box' nature of its high-quality data recipe remains a primary transparency hurdle.
Architectural Provenance
Microsoft provides a technical report and model cards that explicitly define Phi-3-medium as a 14B parameter dense decoder-only Transformer. It specifies the use of 40 layers, 40 attention heads, and an embedding dimension of 5120. The architecture is noted to be a scaled version of the Phi-3-mini, utilizing Grouped Query Attention (GQA) and Rotary Positional Embeddings (RoPE/LongRope) for context extension up to 128k. While the high-level methodology (SFT and DPO) is described, the specific hyperparameters for the pre-training phase are less detailed than those for the smaller variants.
Dataset Composition
The model was trained on 4.8 trillion tokens. Microsoft discloses that the data is a mixture of 'heavily filtered' web data and synthetic data designed to mimic 'textbook-quality' reasoning. However, there is no specific percentage breakdown between web and synthetic sources, nor is there a detailed list of the specific web domains or datasets used. The filtering criteria are described in general terms ('quality-dense') without providing the actual code or comprehensive methodology for the curation process.
Tokenizer Integrity
The model uses the same tokenizer as Phi-3-mini, which is a version of the Llama tokenizer with a vocabulary size of 32,064 tokens. The tokenizer files are publicly available on Hugging Face and integrated into the standard 'transformers' library. The vocabulary size and special tokens (e.g., <|user|>, <|assistant|>, <|end|>) are clearly documented in the model card and technical report, allowing for easy verification and local testing.
Parameter Density
Phi-3-medium is explicitly stated to be a dense model with 14 billion parameters. Unlike MoE models where active parameters can be obscured, the 14B figure represents the full active parameter count. The architectural breakdown (layers, heads, embedding dimensions) is clearly provided in the technical report, and the model weights on Hugging Face confirm these specifications through the configuration files.
Training Compute
Microsoft disclosed that the model was trained using 512 H100-80G GPUs over a period of 42 days. This provides a clear hardware specification and duration, allowing for a rough estimate of total compute. However, the official documentation lacks a specific carbon footprint calculation or a detailed breakdown of the total cost and energy consumption associated with the training run.
Benchmark Reproducibility
While Microsoft reports scores on standard benchmarks (MMLU, GSM8K, HumanEval), the evaluation is conducted using an 'internal tool' (BabelBench) with prompts that are not fully public. The technical report mentions that they do not optimize prompts for Phi-3, but the lack of a public evaluation repository or the exact few-shot examples used makes independent reproduction difficult. There is also limited disclosure regarding the specific versions of benchmarks used.
Identity Consistency
The model consistently identifies itself as a Microsoft-developed AI and is aware of its versioning within the Phi-3 family. It does not exhibit the identity confusion seen in some smaller fine-tuned models that claim to be GPT-4. The model card clearly outlines its intended use cases and limitations, and the model's behavior in chat mode generally aligns with these disclosures.
License Clarity
The model is released under the MIT License, which is a highly permissive, standard open-source license. This allows for broad commercial and research use, modification, and distribution without the restrictive 'acceptable use' policies or revenue-based triggers found in other 'open' models. The licensing terms are unambiguous and prominently displayed on the official repository.
Hardware Footprint
Microsoft and third-party sources provide clear guidance on VRAM requirements. For example, it is documented that the model requires approximately 28GB of VRAM in FP16, and can be run on consumer hardware (like 2x RTX 4090 or a single A6000) when quantized. Official ONNX and GGUF versions are available with documented performance/memory trade-offs, though detailed context-length scaling memory charts are not provided in the primary technical report.
Versioning Drift
The model uses a basic naming convention (Phi-3-medium-4k/128k-instruct) and has seen updates (e.g., the transition from preview to official release). However, there is no formal semantic versioning system or a detailed public changelog that tracks minor weight updates or safety alignment drift. Users must rely on Hugging Face commit histories to track changes, which lacks the transparency of a formal versioning policy.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online