ApX logoApX logo

GPT-OSS 120B

Active Parameters

117B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

5 Aug 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Total Expert Parameters

5.1B

Number of Experts

128

Active Experts

4

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2880

Number of Layers

36

Attention Heads

-

Key-Value Heads

-

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

GPT-OSS 120B

GPT-OSS 120B is a large open-weight model from OpenAI, designed to operate in data centers and on high-end desktops and laptops. It is developed to support advanced reasoning, agentic tasks, and diverse developer use cases, functioning as a text-only model for both input and output modalities.

About GPT-OSS

Open-weight language models from OpenAI.


Other GPT-OSS Models

Evaluation Benchmarks

Rank

#78

BenchmarkScoreRank

0.98

🥇

1

General Knowledge

MMLU

0.90

🥈

2

0.42

6

Professional Knowledge

MMLU Pro

0.81

11

Graduate-Level QA

GPQA

0.8

17

0.69

26

Web Development

WebDev Arena

1354

28

Agentic Coding

LiveBench Agentic

0.17

34

0.39

37

0.60

41

0.57

43

Rankings

Overall Rank

#78

Coding Rank

#78

Model Transparency

Total Score

B

67 / 100

GPT-OSS 120B Transparency Report

Total Score

67

/ 100

B

Audit Note

GPT-OSS 120B demonstrates a significant shift toward transparency for its provider, offering a permissive license and clear architectural specifications regarding its Mixture-of-Experts design. While hardware requirements and tokenization are exceptionally well-documented, the model suffers from a total lack of transparency regarding its training data composition and compute resources. It represents a high-quality open-weight release that remains opaque in its upstream development process.

Upstream

19.0 / 30

Architectural Provenance

7.0 / 10

The model is explicitly identified as a Transformer-based Mixture-of-Experts (MoE) architecture with 36 layers and 128 experts. Documentation specifies the use of SwiGLU activations, Rotary Positional Embeddings (RoPE), and alternating dense and locally banded sparse attention patterns. While the high-level methodology is described in the official model card and technical reports, the specific pretraining procedure and exact architectural modifications are not fully detailed to the level of a peer-reviewed paper.

Dataset Composition

3.0 / 10

OpenAI has been notably circumspect regarding the training data. Official documentation only mentions a 'diverse corpus' of publicly available texts, including books, academic articles, and websites, with an emphasis on STEM and coding. There is no public breakdown of dataset proportions (e.g., web vs. code percentages), no disclosure of specific data sources, and no detailed filtering or cleaning methodology provided.

Tokenizer Integrity

9.0 / 10

The model uses the 'o200k_base' (or 'o200k_harmony') tokenizer, which is publicly available via the tiktoken library. The vocabulary size is precisely stated as 201,088 tokens. It is documented as a fast BPE implementation optimized for multilingual text and code, and its performance has been independently verified by third-party benchmarks showing high token efficiency.

Model

24.0 / 40

Parameter Density

8.0 / 10

The model's parameter counts are clearly disclosed: 117 billion total parameters with 5.1 billion active parameters per token. The MoE structure is well-documented, specifying 128 experts with a Top-4 routing mechanism. The impact of MXFP4 quantization on the MoE layers is also explicitly stated, explaining how the model fits into 80GB of VRAM.

Training Compute

2.0 / 10

There is almost no transparency regarding the compute resources used for training. While the model is optimized for inference on H100 GPUs, OpenAI has not disclosed the total GPU/TPU hours, hardware specifications of the training cluster, training duration, or the carbon footprint associated with the model's development.

Benchmark Reproducibility

5.0 / 10

OpenAI provides results for standard benchmarks (MMLU, GPQA, HumanEval) and specifies the use of 'high' reasoning effort for these tests. However, the exact evaluation code and full prompt sets used for internal scoring are not public. While third-party verification exists on platforms like OpenRouter and Hugging Face, the lack of official reproduction scripts and detailed few-shot examples limits full reproducibility.

Identity Consistency

9.0 / 10

The model consistently identifies itself as GPT-OSS 120B and maintains a clear versioning identity. It is transparent about its nature as an open-weight reasoning model and its relationship to the 'harmony' prompt format. There are no documented cases of the model claiming to be a competitor's product or misrepresenting its core capabilities.

Downstream

23.5 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms are clear, allowing for commercial use, modification, and distribution without conflicting proprietary restrictions. This is verified across the official blog, GitHub repository, and Hugging Face model card.

Hardware Footprint

8.5 / 10

Hardware requirements are exceptionally well-documented. OpenAI and third parties provide specific VRAM targets: 80GB for the MXFP4 quantized version (fitting a single H100) and significantly higher for FP16. Documentation includes guidance for various quantization levels and multi-GPU setups, and these claims have been verified by community deployment reports.

Versioning Drift

5.0 / 10

While the model has a clear initial version and the weights are hosted on Hugging Face with commit history, there is no established long-term changelog or formal semantic versioning policy for future updates. The model is relatively new, and while the 'harmony' format provides some structure, a robust system for tracking and notifying users of behavioral drift is not yet evident.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

GPT-OSS 120B: Specifications and GPU VRAM Requirements