Parameters
7B
Context Length
8.192K
Modality
Text
Architecture
Dense
License
MIT License
Release Date
22 Apr 2024
Knowledge Cutoff
Oct 2023
Attention
Attention Structure
Grouped-Query Attention
Attention Heads
32
Key-Value Heads
8
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
-
Activation Function
Gated GELU
Dimensions
Hidden Dimension Size
4,096
Number of Layers
32
FFN Intermediate Size (Dense)
14,336
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
100,352
Microsoft's Phi-3-small is a member of the Phi family of small language models (SLMs), engineered to deliver high performance within a compact computational footprint. This model variant, with 7 billion parameters, is positioned for broad commercial and research applications where resource efficiency and responsiveness are critical. It addresses scenarios demanding robust language understanding, logical reasoning, and efficient processing on constrained hardware environments, including on-device deployments.
The underlying architecture of Phi-3-small is a dense, decoder-only Transformer. It incorporates several design choices aimed at optimizing performance and memory efficiency, notably leveraging Grouped Query Attention (GQA) where four query heads share a single key-value head, thereby reducing the KV cache footprint. Additionally, the model utilizes alternating layers of dense and blocksparse attention mechanisms, which further contribute to efficient memory management while preserving long-context retrieval capabilities. The training methodology includes a meticulous process of Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), ensuring the model's alignment with human preferences and safety guidelines.
Phi-3-small is designed to operate with a default context length of 8,192 tokens (8K), with a further extended variant supporting up to 128,000 tokens through the application of LongRope technology. The model's training regimen involved an extensive dataset comprising 4.8 trillion tokens, derived from a combination of rigorously filtered public documents, high-quality educational content, and synthetically generated data, emphasizing data quality and reasoning density. This enables the model to excel in tasks such as complex language understanding, mathematical problem-solving, and code generation, making it suitable for deployment across various hardware platforms, from cloud-based inference to edge devices and mobile platforms.
Microsoft's Phi-3 models are small language models designed for efficient operation on resource-constrained devices. They utilize a transformer decoder architecture and are trained on extensively filtered, high-quality data, including synthetic compositions. This approach enables a compact yet capable model family.
Rank
#149
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1171 | 84 |
Overall Rank
#149
Coding Rank
#106
Total Score
66
/ 100
Phi-3-small demonstrates strong transparency in its architectural design and licensing, providing a detailed technical report and a permissive MIT license. However, it remains opaque regarding the specific composition of its 4.8T token training set and relies on internal, non-public evaluation tools that limit benchmark reproducibility. The model's transparency profile is that of a 'weights-available' corporate product rather than a fully open-science project.
Architectural Provenance
Microsoft provides a detailed technical report (arXiv:2404.14219) specifying that Phi-3-small is a dense decoder-only Transformer with 32 layers and a hidden size of 4096. It explicitly documents the use of Grouped Query Attention (GQA) with 4 queries per key and a unique alternating pattern of dense and blocksparse attention layers to optimize the KV cache. While the base architecture is well-documented, the specific 'blocksparse' implementation details are described at a high level without full source code for the custom kernels used in training.
Dataset Composition
The model was trained on 4.8 trillion tokens. Documentation mentions three main categories: 1) filtered public web data, 2) high-quality educational/code data, and 3) synthetic 'textbook-like' data. However, Microsoft does not provide a specific percentage breakdown of these sources (beyond a mention of 10% multilingual data) or name the specific datasets used. The 'synthetic data' generation process is described conceptually but lacks the transparency required for verification or reproduction of the data mix.
Tokenizer Integrity
Phi-3-small uses the tiktoken-based tokenizer with a vocabulary size of 100,352, which is a significant departure from the Llama-based tokenizer used in Phi-3-mini. The tokenizer is publicly accessible via Hugging Face and the vocabulary size is clearly stated in the technical report. It is well-documented as being optimized for multilingual support, though detailed alignment between the tokenizer's training data and the model's 4.8T token corpus is not fully disclosed.
Parameter Density
The model is clearly identified as a 7B parameter dense model. Microsoft provides a structural breakdown (32 layers, 32 heads, 4096 hidden dimension). While it uses blocksparse attention, it is not a Mixture-of-Experts (MoE) model, so the distinction between total and active parameters is not applicable here. The documentation is clear, though it lacks a precise parameter count beyond the '7B' marketing label (e.g., 7.39B).
Training Compute
Microsoft discloses the hardware used (1024 H100-80G GPUs) and the training duration (18 days). This allows for a rough estimate of compute resources. However, it fails to provide a calculated carbon footprint or the specific energy efficiency metrics of the cluster. The information is better than most proprietary models but lacks the environmental transparency seen in exemplary open-science projects.
Benchmark Reproducibility
While Microsoft reports scores on standard benchmarks (MMLU, GSM8K, etc.) in the technical report, they explicitly state that the prompts and few-shot examples are part of an 'internal tool' and are not fully public. This significantly hinders third-party reproduction. Furthermore, independent research has highlighted significant performance gaps when using different evaluation pipelines, suggesting the reported numbers are highly sensitive to the undisclosed internal settings.
Identity Consistency
The model consistently identifies itself as a Microsoft Phi-3 model in system prompts and documentation. It maintains a clear versioning identity within the Phi-3 family (Small vs. Mini vs. Medium). There are no documented cases of the model claiming to be a competitor's product (like GPT-4) or denying its nature as an AI developed by Microsoft.
License Clarity
Phi-3-small is released under the highly permissive MIT License, which is clearly stated on the official Hugging Face repository and Microsoft's blog. The license allows for commercial use, modification, and distribution with minimal restrictions. There are no conflicting 'non-commercial' clauses in the primary license text for the weights.
Hardware Footprint
Microsoft and partners (like NVIDIA) provide VRAM requirements for various deployment scenarios. Documentation exists for FP16 and INT4 (via ONNX/DirectML) requirements. The impact of LongRope for 128K context on memory is discussed, though detailed scaling tables for VRAM vs. context length are primarily provided by third-party community benchmarks rather than a single comprehensive official source.
Versioning Drift
Microsoft uses a naming convention (e.g., Phi-3-small-8k-instruct) but lacks a strict semantic versioning system for weight updates. While they released a 'June 2024 Update' with a changelog on Hugging Face, updates are often delivered as new model cards rather than tracked versions of a single artifact. This makes it difficult for developers to track silent changes or roll back to specific sub-versions without manual commit tracking.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online