ApX logoApX logo

Phi-4 Reasoning Plus

Parameters

14B

Context Length

33K

Modality

Text

Architecture

Dense

License

MIT

Release Date

30 Apr 2025

Knowledge Cutoff

Mar 2025

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

40

Key-Value Heads

10

Attention Head Dimension

-

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

5,120

Number of Layers

40

FFN Intermediate Size (Dense)

17,920

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

100,352

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 5.1k · Context: 33K · Vocab: 100.4kx 40 layersRMSNormPre-AttentionMulti-Head Attention40Q / 10KV headsHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 17.9k+Final RMSNormOutput Logits

Phi-4 Reasoning Plus

Phi-4 Reasoning Plus is a 14-billion parameter language model engineered by Microsoft to provide advanced chain-of-thought processing and high-precision logical inference. As an enhanced variant in the Phi-4 family, it is designed to handle sophisticated problem-solving across domains such as mathematics, scientific inquiry, and complex code generation. The model produces structured outputs that include an explicit reasoning trace followed by a final solution, facilitating transparency in its decision-making process. This design prioritizes output quality and depth for tasks where thoroughness is more critical than immediate response speed.

Technically, the model utilizes a dense, decoder-only Transformer architecture with multi-head attention (MHA). It incorporates Rotary Position Embeddings (RoPE) and an expanded context window of 32,768 tokens, allowing it to maintain coherence over the lengthy sequences often required for multi-step reasoning. The training methodology represents a significant advancement in data-centric AI, employing supervised fine-tuning (SFT) on over 1.4 million chain-of-thought traces, followed by reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. This RL phase specifically targets verifiable mathematical and logical problems, refining the model's ability to self-correct and explore alternative solutions.

Operational characteristics of Phi-4 Reasoning Plus include a notable increase in token generation compared to the standard Phi-4 models, as the 'plus' variant typically produces 50% more tokens to provide more exhaustive explanations. While this results in higher latency, it enables the model to rival the performance of much larger systems in specialized benchmarks. The model is released under the MIT license with open weights, making it accessible for deployment on consumer-grade hardware and local environments where computational resources are constrained but high-fidelity reasoning is required.

About Phi-4

The Microsoft Phi-4 model family comprises small language models prioritizing efficient, high-capability reasoning. Its development emphasizes robust data quality and sophisticated synthetic data integration. This approach enables enhanced performance and on-device deployment capabilities.


Other Phi-4 Models

Evaluation Benchmarks

Rank

#149

BenchmarkScoreRank

Professional Knowledge

MMLU Pro

0.76

60

Rankings

Overall Rank

#149

Coding Rank

-

Model Integrity

Total Score

B+

80 / 100

Phi-4 Reasoning Plus Model Integrity Report

Total Score

80

/ 100

B+

Audit Note

Phi-4 Reasoning Plus demonstrates a high level of transparency for an open-weight model, particularly regarding its architectural lineage, compute resources, and licensing. While specific pre-training data ratios remain somewhat generalized, the disclosure of synthetic data generation methods and 'teacher' models is exemplary. The model's clear identity and well-documented hardware requirements make it a highly verifiable system for researchers and developers.

Upstream

23.5 / 30

Architectural Provenance

8.0 / 10

Microsoft provides high transparency regarding the model's lineage, explicitly identifying it as a fine-tuned variant of the Phi-4 base model. The architecture is documented as a dense, decoder-only Transformer with 14 billion parameters, utilizing Multi-Head Attention (MHA) and Rotary Position Embeddings (RoPE). The technical report details the transition from the base Phi-4 to the 'Reasoning' and 'Reasoning Plus' variants, including the specific addition of <think> and </think> tokens and the expansion of the context window to 32,768 tokens. The use of Group Relative Policy Optimization (GRPO) for the reinforcement learning phase is also clearly stated.

Dataset Composition

6.5 / 10

The training methodology is described as 'data-centric,' with specific disclosures about the SFT phase using 1.4 million chain-of-thought traces generated via o3-mini and the RL phase using ~6,000 high-quality math problems. While the report mentions a blend of synthetic data and filtered public domain data (web, code, STEM), it lacks a precise percentage breakdown of the pre-training data composition (e.g., exact ratios of web vs. books vs. code). However, the disclosure of the specific 'teacher' models (o3-mini, DeepSeek-R1 for the mini variant) and the data-cleansing methodology (decontamination in Appendix B) provides better-than-average transparency.

Tokenizer Integrity

9.0 / 10

The model uses the tiktoken tokenizer with a stated vocabulary size of 100,352 tokens. Documentation confirms the inclusion of specific reserved tokens for reasoning traces (<think>, </think>). The tokenizer is publicly accessible via the Hugging Face repository and is integrated into the standard 'transformers' library (version 4.51.3+), allowing for direct verification of tokenization behavior and alignment with the claimed language support (primarily English).

Model

32.0 / 40

Parameter Density

8.5 / 10

The parameter count is consistently stated as 14 billion. As a dense architecture, the active parameters equal the total parameters, which is explicitly confirmed in technical documentation to avoid MoE-related confusion. Detailed architectural specs, including the context window (32K) and the specific modifications from the Phi-3/Phi-4 base, are well-documented in the technical report.

Training Compute

7.5 / 10

Microsoft provides specific hardware and compute metrics for the reasoning variants: the training utilized 32 H100-80G GPUs over a duration of 2.5 days. While the carbon footprint is not explicitly calculated in the model card, the disclosure of GPU hours and hardware type allows for independent estimation. This level of detail is significantly higher than the industry standard for 'open-weight' models.

Benchmark Reproducibility

7.0 / 10

The technical report includes comprehensive evaluations on standard benchmarks (AIME, GPQA, MATH, LiveCodeBench) and compares results against both open and proprietary models. Microsoft specifies the use of the 'simple-evals' framework for reproducibility and provides details on the prompting strategy (temperature=0.8, top_p=0.95). However, while the methodology is described, the full evaluation code and exact prompt sets for every benchmark are not always provided in a single, turn-key repository.

Identity Consistency

9.0 / 10

The model is designed with a specific 'reasoning' identity, producing structured outputs with clear reasoning traces. It identifies as a Microsoft Phi-family model and maintains version consistency across its different variants (Mini, Reasoning, Reasoning Plus). There are no documented instances of the model claiming to be a competitor's system (e.g., GPT-4), and its limitations regarding English-only support and math-specialization are clearly disclosed.

Downstream

24.0 / 30

License Clarity

10.0 / 10

The model is released under the MIT license, which is a highly permissive, standard open-source license. The terms are clear, allowing for both commercial and non-commercial use, and there are no conflicting proprietary 'community licenses' often seen with other large providers. The weights are freely available for download on Hugging Face.

Hardware Footprint

8.0 / 10

VRAM requirements are well-documented, with specific guidance provided for different hardware (e.g., 28GB for full precision, fits on 2x RTX 4090 or 1x A100). Microsoft also provides optimized ONNX versions with 4-bit quantization (RTN) and documents the resulting performance/latency trade-offs. The impact of the 'Plus' variant's longer reasoning traces on latency is explicitly mentioned.

Versioning Drift

6.0 / 10

Microsoft uses clear naming conventions (Phi-4-reasoning vs. Phi-4-reasoning-plus) and maintains a changelog on the Hugging Face repository. However, as a relatively new release (April 2025), there is limited long-term data on how Microsoft handles silent updates or model drift over time. The model is currently described as 'static,' which aids transparency but requires monitoring for future 'silent' revisions.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs

Phi-4 Reasoning Plus: Specifications and GPU VRAM Requirements