ApX logoApX logo

GPT-OSS 20B

Active Parameters

21B

Context Length

128K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

5 Aug 2025

Knowledge Cutoff

Jun 2024

Technical Specifications

Total Expert Parameters

3.6B

Number of Experts

32

Active Experts

4

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2880

Number of Layers

24

Attention Heads

64

Key-Value Heads

8

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

GPT-OSS 20B

GPT-OSS 20B is a text-based language model developed by OpenAI, specifically engineered to deliver high-performance reasoning on consumer-grade hardware. As part of the GPT-OSS family, this model balances computational efficiency with complex task execution, utilizing a sparse architecture to maintain a low memory footprint. It is designed to function as a flexible component in local and enterprise environments, where data privacy and low-latency response times are critical requirements.

The model utilizes a Mixture-of-Experts (MoE) transformer architecture consisting of 24 layers. While the total parameter count is 21 billion, the system only activates 3.6 billion parameters per token during the forward pass. This sparsity is achieved through a routing mechanism that selects four active experts from a pool of 32 for each token. The architecture incorporates several modern optimizations, including SwiGLU activation functions, Root Mean Square (RMS) normalization, and Grouped-Query Attention (GQA) with eight key-value heads to optimize memory throughput. It also supports a native context window of 128,000 tokens using Rotary Positional Embeddings (RoPE).

Functionally, GPT-OSS 20B is optimized for agentic workflows and complex reasoning tasks. It supports features such as native tool use, function calling, and a configurable reasoning effort system that allows developers to adjust the model's processing depth based on the specific latency needs of the application. The model is trained using a specialized response format to facilitate consistent structured outputs and long-form chain-of-thought reasoning, making it suitable for scientific analysis, code generation, and specialized technical assistance on local devices.

About GPT-OSS

Open-weight language models from OpenAI.


Other GPT-OSS Models

Evaluation Benchmarks

Rank

#70

BenchmarkScoreRank

0.86

6

General Knowledge

MMLU

0.85

11

Web Development

WebDev Arena

1317

38

Rankings

Overall Rank

#70

Coding Rank

#51

Model Transparency

Total Score

B

67 / 100

GPT-OSS 20B Transparency Report

Total Score

67

/ 100

B

Audit Note

GPT-OSS 20B exhibits a bifurcated transparency profile, offering industry-leading clarity on architecture, licensing, and hardware requirements while remaining almost entirely opaque regarding its training data and compute resources. The model's technical documentation for inference and quantization is exemplary, yet the lack of dataset provenance and training metrics prevents a full understanding of its developmental lifecycle. Its commitment to a permissive Apache 2.0 license and open weights marks a significant shift in transparency for the provider, though evaluation reproducibility remains hampered by undisclosed prompting strategies.

Upstream

20.0 / 30

Architectural Provenance

8.0 / 10

The model architecture is extensively documented in the official model card and technical reports. It is a Mixture-of-Experts (MoE) Transformer with 24 layers, utilizing 32 experts with 4 active per token. Specific technical optimizations are disclosed, including SwiGLU activation functions (with clamping and residual connections), Grouped-Query Attention (GQA) with 8 key-value heads, and Rotary Positional Embeddings (RoPE). The model supports a native context window of 128k tokens. While the training methodology is described as a mix of reinforcement learning and advanced pre-training, the exact 'from scratch' vs. 'fine-tuned' lineage of the base weights is slightly obscured by references to 'internal frontier systems,' preventing a perfect score.

Dataset Composition

3.0 / 10

Transparency regarding the training data is minimal. Official documentation states the model was trained on a 'mostly English, text-only dataset' with a focus on STEM, coding, and general knowledge. However, there is no public breakdown of data sources, no specific proportions (e.g., web vs. books vs. code), and no detailed disclosure of filtering or cleaning methodologies. Third-party reports explicitly list training data collection and labeling as 'undisclosed.'

Tokenizer Integrity

9.0 / 10

The model uses the 'o200k_harmony' tokenizer, which is a BPE-style tokenizer with a vocabulary size of approximately 200,000 tokens. This tokenizer is publicly available via the 'tiktoken' library and is documented as a superset of the tokenizer used in GPT-4o. It is well-integrated into the 'openai-harmony' package and reference implementations, allowing for full verification of tokenization behavior and language support alignment.

Model

23.0 / 40

Parameter Density

9.0 / 10

OpenAI provides precise figures for both total and active parameters. The model has 21.1 billion total parameters, with 3.6 billion active parameters per token. The MoE structure is clearly defined (32 experts, top-4 routing). Additionally, the use of MXFP4 quantization for MoE weights is explicitly documented, including how weights are packed and scaled, which provides high transparency into the model's density and memory efficiency.

Training Compute

1.0 / 10

There is almost no verifiable information regarding the compute resources used for training. While the hardware requirements for inference are well-documented, the training duration, GPU/TPU hours, hardware specifications used during training, and the resulting carbon footprint are entirely absent from public documentation. Claims of 'advanced pre-training' serve as marketing language rather than technical disclosure.

Benchmark Reproducibility

4.0 / 10

While OpenAI provides results for several benchmarks (MMLU, HumanEval, Tau-Bench, HealthBench), the evaluation methodology lacks full transparency. Evaluation code is not fully public, and exact prompts or few-shot examples used for official scores are not consistently disclosed. Third-party audits (e.g., arXiv:2508.17525) have noted discrepancies where the 20B model outperforms the 120B variant, suggesting non-monotonic scaling or evaluation inconsistencies that are not addressed in official docs. A -2 penalty was applied due to industry-wide concerns regarding training data overlap with common coding benchmarks like SWE-bench.

Identity Consistency

9.0 / 10

The model demonstrates strong identity consistency, correctly identifying itself as part of the GPT-OSS family in standard deployments. It is designed to use the 'Harmony' response format, which includes specific roles (System, Developer, User, Assistant, Tool) that help maintain its persona and operational boundaries. There are no documented cases of the model claiming to be a competitor's product or denying its AI nature.

Downstream

24.0 / 30

License Clarity

10.0 / 10

The model is released under the Apache 2.0 license, which is a standard, highly permissive open-source license. This allows for commercial use, modification, and distribution without the 'copyleft' restrictions found in other licenses. The terms are clear, publicly accessible on GitHub and Hugging Face, and do not conflict with the model's stated 'open-weight' status.

Hardware Footprint

9.0 / 10

Hardware requirements are exceptionally well-documented. OpenAI and third-party partners (NVIDIA, Unsloth) provide specific VRAM targets: 16GB for the 20B model using native MXFP4 quantization. Performance metrics for different hardware (Mac, H100, consumer GPUs) and token-per-second estimates are widely available. The impact of the 'reasoning effort' setting on latency is also clearly explained.

Versioning Drift

5.0 / 10

The model supports 'snapshots' to lock in specific versions, and a basic versioning system is in place. However, there is no comprehensive public changelog or detailed history of weight updates since the initial release. While the 'openai-harmony' package is versioned, the underlying model weights lack the granular semantic versioning required for a higher score, making it difficult to track silent drift or minor optimizations.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
63k
125k

VRAM Required:

Recommended GPUs

GPT-OSS 20B: Specifications and GPU VRAM Requirements