Parameters
14B
Context Length
33K
Modality
Text
Architecture
Dense
License
MIT
Release Date
30 Apr 2025
Knowledge Cutoff
Mar 2025
Attention
Attention Structure
Multi-Head Attention
Attention Heads
40
Key-Value Heads
10
Attention Head Dimension
-
Position Embedding
Absolute Position Embedding
RoPE Theta
500,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
5,120
Number of Layers
40
FFN Intermediate Size (Dense)
17,920
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
100,352
Phi-4 Reasoning Plus is a 14-billion parameter language model engineered by Microsoft to provide advanced chain-of-thought processing and high-precision logical inference. As an enhanced variant in the Phi-4 family, it is designed to handle sophisticated problem-solving across domains such as mathematics, scientific inquiry, and complex code generation. The model produces structured outputs that include an explicit reasoning trace followed by a final solution, facilitating transparency in its decision-making process. This design prioritizes output quality and depth for tasks where thoroughness is more critical than immediate response speed.
Technically, the model utilizes a dense, decoder-only Transformer architecture with multi-head attention (MHA). It incorporates Rotary Position Embeddings (RoPE) and an expanded context window of 32,768 tokens, allowing it to maintain coherence over the lengthy sequences often required for multi-step reasoning. The training methodology represents a significant advancement in data-centric AI, employing supervised fine-tuning (SFT) on over 1.4 million chain-of-thought traces, followed by reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. This RL phase specifically targets verifiable mathematical and logical problems, refining the model's ability to self-correct and explore alternative solutions.
Operational characteristics of Phi-4 Reasoning Plus include a notable increase in token generation compared to the standard Phi-4 models, as the 'plus' variant typically produces 50% more tokens to provide more exhaustive explanations. While this results in higher latency, it enables the model to rival the performance of much larger systems in specialized benchmarks. The model is released under the MIT license with open weights, making it accessible for deployment on consumer-grade hardware and local environments where computational resources are constrained but high-fidelity reasoning is required.
The Microsoft Phi-4 model family comprises small language models prioritizing efficient, high-capability reasoning. Its development emphasizes robust data quality and sophisticated synthetic data integration. This approach enables enhanced performance and on-device deployment capabilities.
Rank
#149
| Benchmark | Score | Rank |
|---|---|---|
Professional Knowledge MMLU Pro | 0.76 | 60 |
Overall Rank
#149
Coding Rank
-
Total Score
80
/ 100
Phi-4 Reasoning Plus demonstrates a high level of transparency for an open-weight model, particularly regarding its architectural lineage, compute resources, and licensing. While specific pre-training data ratios remain somewhat generalized, the disclosure of synthetic data generation methods and 'teacher' models is exemplary. The model's clear identity and well-documented hardware requirements make it a highly verifiable system for researchers and developers.
Architectural Provenance
Microsoft provides high transparency regarding the model's lineage, explicitly identifying it as a fine-tuned variant of the Phi-4 base model. The architecture is documented as a dense, decoder-only Transformer with 14 billion parameters, utilizing Multi-Head Attention (MHA) and Rotary Position Embeddings (RoPE). The technical report details the transition from the base Phi-4 to the 'Reasoning' and 'Reasoning Plus' variants, including the specific addition of <think> and </think> tokens and the expansion of the context window to 32,768 tokens. The use of Group Relative Policy Optimization (GRPO) for the reinforcement learning phase is also clearly stated.
Dataset Composition
The training methodology is described as 'data-centric,' with specific disclosures about the SFT phase using 1.4 million chain-of-thought traces generated via o3-mini and the RL phase using ~6,000 high-quality math problems. While the report mentions a blend of synthetic data and filtered public domain data (web, code, STEM), it lacks a precise percentage breakdown of the pre-training data composition (e.g., exact ratios of web vs. books vs. code). However, the disclosure of the specific 'teacher' models (o3-mini, DeepSeek-R1 for the mini variant) and the data-cleansing methodology (decontamination in Appendix B) provides better-than-average transparency.
Tokenizer Integrity
The model uses the tiktoken tokenizer with a stated vocabulary size of 100,352 tokens. Documentation confirms the inclusion of specific reserved tokens for reasoning traces (<think>, </think>). The tokenizer is publicly accessible via the Hugging Face repository and is integrated into the standard 'transformers' library (version 4.51.3+), allowing for direct verification of tokenization behavior and alignment with the claimed language support (primarily English).
Parameter Density
The parameter count is consistently stated as 14 billion. As a dense architecture, the active parameters equal the total parameters, which is explicitly confirmed in technical documentation to avoid MoE-related confusion. Detailed architectural specs, including the context window (32K) and the specific modifications from the Phi-3/Phi-4 base, are well-documented in the technical report.
Training Compute
Microsoft provides specific hardware and compute metrics for the reasoning variants: the training utilized 32 H100-80G GPUs over a duration of 2.5 days. While the carbon footprint is not explicitly calculated in the model card, the disclosure of GPU hours and hardware type allows for independent estimation. This level of detail is significantly higher than the industry standard for 'open-weight' models.
Benchmark Reproducibility
The technical report includes comprehensive evaluations on standard benchmarks (AIME, GPQA, MATH, LiveCodeBench) and compares results against both open and proprietary models. Microsoft specifies the use of the 'simple-evals' framework for reproducibility and provides details on the prompting strategy (temperature=0.8, top_p=0.95). However, while the methodology is described, the full evaluation code and exact prompt sets for every benchmark are not always provided in a single, turn-key repository.
Identity Consistency
The model is designed with a specific 'reasoning' identity, producing structured outputs with clear reasoning traces. It identifies as a Microsoft Phi-family model and maintains version consistency across its different variants (Mini, Reasoning, Reasoning Plus). There are no documented instances of the model claiming to be a competitor's system (e.g., GPT-4), and its limitations regarding English-only support and math-specialization are clearly disclosed.
License Clarity
The model is released under the MIT license, which is a highly permissive, standard open-source license. The terms are clear, allowing for both commercial and non-commercial use, and there are no conflicting proprietary 'community licenses' often seen with other large providers. The weights are freely available for download on Hugging Face.
Hardware Footprint
VRAM requirements are well-documented, with specific guidance provided for different hardware (e.g., 28GB for full precision, fits on 2x RTX 4090 or 1x A100). Microsoft also provides optimized ONNX versions with 4-bit quantization (RTN) and documents the resulting performance/latency trade-offs. The impact of the 'Plus' variant's longer reasoning traces on latency is explicitly mentioned.
Versioning Drift
Microsoft uses clear naming conventions (Phi-4-reasoning vs. Phi-4-reasoning-plus) and maintains a changelog on the Hugging Face repository. However, as a relatively new release (April 2025), there is limited long-term data on how Microsoft handles silent updates or model drift over time. The model is currently described as 'static,' which aids transparency but requires monitoring for future 'silent' revisions.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online