Active Parameters
117B
Context Length
128K
Modality
Text
Architecture
Mixture of Experts (MoE)
License
Apache 2.0
Release Date
5 Aug 2025
Knowledge Cutoff
Jun 2024
Total Expert Parameters
5.1B
Number of Experts
128
Active Experts
4
Attention Structure
Multi-Head Attention
Hidden Dimension Size
2880
Number of Layers
36
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
GPT-OSS 120B is a large open-weight model from OpenAI, designed to operate in data centers and on high-end desktops and laptops. It is developed to support advanced reasoning, agentic tasks, and diverse developer use cases, functioning as a text-only model for both input and output modalities.
Rank
#78
| Benchmark | Score | Rank |
|---|---|---|
Summarization ProLLM Summarization | 0.98 | 🥇 1 |
General Knowledge MMLU | 0.90 | 🥈 2 |
Coding Aider Coding | 0.42 | 6 |
Professional Knowledge MMLU Pro | 0.81 | 11 |
Graduate-Level QA GPQA | 0.8 | 17 |
Mathematics LiveBench Mathematics | 0.69 | 26 |
Web Development WebDev Arena | 1354 | 28 |
Agentic Coding LiveBench Agentic | 0.17 | 34 |
Reasoning LiveBench Reasoning | 0.39 | 37 |
Coding LiveBench Coding | 0.60 | 41 |
Data Analysis LiveBench Data Analysis | 0.57 | 43 |
Overall Rank
#78
Coding Rank
#78
Total Score
67
/ 100
GPT-OSS 120B demonstrates a significant shift toward transparency for its provider, offering a permissive license and clear architectural specifications regarding its Mixture-of-Experts design. While hardware requirements and tokenization are exceptionally well-documented, the model suffers from a total lack of transparency regarding its training data composition and compute resources. It represents a high-quality open-weight release that remains opaque in its upstream development process.
Architectural Provenance
The model is explicitly identified as a Transformer-based Mixture-of-Experts (MoE) architecture with 36 layers and 128 experts. Documentation specifies the use of SwiGLU activations, Rotary Positional Embeddings (RoPE), and alternating dense and locally banded sparse attention patterns. While the high-level methodology is described in the official model card and technical reports, the specific pretraining procedure and exact architectural modifications are not fully detailed to the level of a peer-reviewed paper.
Dataset Composition
OpenAI has been notably circumspect regarding the training data. Official documentation only mentions a 'diverse corpus' of publicly available texts, including books, academic articles, and websites, with an emphasis on STEM and coding. There is no public breakdown of dataset proportions (e.g., web vs. code percentages), no disclosure of specific data sources, and no detailed filtering or cleaning methodology provided.
Tokenizer Integrity
The model uses the 'o200k_base' (or 'o200k_harmony') tokenizer, which is publicly available via the tiktoken library. The vocabulary size is precisely stated as 201,088 tokens. It is documented as a fast BPE implementation optimized for multilingual text and code, and its performance has been independently verified by third-party benchmarks showing high token efficiency.
Parameter Density
The model's parameter counts are clearly disclosed: 117 billion total parameters with 5.1 billion active parameters per token. The MoE structure is well-documented, specifying 128 experts with a Top-4 routing mechanism. The impact of MXFP4 quantization on the MoE layers is also explicitly stated, explaining how the model fits into 80GB of VRAM.
Training Compute
There is almost no transparency regarding the compute resources used for training. While the model is optimized for inference on H100 GPUs, OpenAI has not disclosed the total GPU/TPU hours, hardware specifications of the training cluster, training duration, or the carbon footprint associated with the model's development.
Benchmark Reproducibility
OpenAI provides results for standard benchmarks (MMLU, GPQA, HumanEval) and specifies the use of 'high' reasoning effort for these tests. However, the exact evaluation code and full prompt sets used for internal scoring are not public. While third-party verification exists on platforms like OpenRouter and Hugging Face, the lack of official reproduction scripts and detailed few-shot examples limits full reproducibility.
Identity Consistency
The model consistently identifies itself as GPT-OSS 120B and maintains a clear versioning identity. It is transparent about its nature as an open-weight reasoning model and its relationship to the 'harmony' prompt format. There are no documented cases of the model claiming to be a competitor's product or misrepresenting its core capabilities.
License Clarity
The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. The terms are clear, allowing for commercial use, modification, and distribution without conflicting proprietary restrictions. This is verified across the official blog, GitHub repository, and Hugging Face model card.
Hardware Footprint
Hardware requirements are exceptionally well-documented. OpenAI and third parties provide specific VRAM targets: 80GB for the MXFP4 quantized version (fitting a single H100) and significantly higher for FP16. Documentation includes guidance for various quantization levels and multi-GPU setups, and these claims have been verified by community deployment reports.
Versioning Drift
While the model has a clear initial version and the weights are hosted on Hugging Face with commit history, there is no established long-term changelog or formal semantic versioning policy for future updates. The model is relatively new, and while the 'harmony' format provides some structure, a robust system for tracking and notifying users of behavioral drift is not yet evident.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens