OLMo 3 32B Base: Specifications and GPU VRAM Requirements

OLMo 3 32B Base

Open Source

Open Weights

Parameters

32B

Context Length

65.536K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

25 Nov 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention Structure

Multi-Head Attention

Hidden Dimension Size

5120

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

OLMo 3 32B Base

The OLMo 3 32B Base model, developed by the Allen Institute for AI (Ai2), is a foundational large language model designed to advance transparency and reproducibility in AI research. This variant, with 32 billion parameters, serves as the base for more specialized models within the OLMo 3 family, including Instruct and Think variants. Its primary purpose is to provide a robust, openly accessible, and auditable platform for further pretraining, fine-tuning, and experimentation in language model development. The model's complete lifecycle, encompassing training data, code, checkpoints, logs, and evaluation methodologies, is made publicly available to foster a deeper understanding of model behavior and facilitate scientific inquiry.

Architecturally, OLMo 3 32B Base is a dense, decoder-only transformer. It is configured with 64 layers and a hidden dimension size of 5120. The attention mechanism incorporates grouped-query attention (GQA), featuring 40 attention heads and 8 key-value heads, which contributes to efficient KV cache management. The model also employs a hybrid attention pattern, utilizing sliding-window attention across most layers and full-sequence attention in every fourth layer to balance local and global context processing. Rotary position embeddings (RoPE) with YaRN-style scaling extend the model's effective context length to 65,536 tokens. Normalization is implemented using RMSNorm, and the activation function within the MLP blocks is of a GeGLU/SwiGLU style, which enhances parameter efficiency. The training process leverages Flash Attention for computational efficiency.

Pretrained on approximately 5.9 trillion tokens from the Dolma 3 dataset, OLMo 3 32B Base undergoes a staged training regimen that includes general pretraining, mid-training on targeted data, and a context extension phase. This methodical approach establishes a strong foundation for its capabilities in areas such as programming, reading comprehension, and mathematical problem-solving. The model maintains its performance across extended context lengths, providing a versatile base for developing specialized downstream applications. The comprehensive openness of its development artifacts allows researchers and developers to inspect, audit, and extend the model, supporting diverse applications from continued pretraining to targeted fine-tuning and reinforcement learning setups.

About OLMo 3

OLMo (Open Language Model) is a series of fully open language models designed to enable the science of language models. Released by the Allen Institute for AI (Ai2), OLMo 3 provides complete access to training data (Dolma 3), code, checkpoints, logs, and evaluation methodologies. The family includes Base models for pretraining research, Instruct variants for chat and tool use, and Think variants with chain-of-thought reasoning capabilities. All models are trained with staged approach including pretraining, mid-training, and long-context phases.

Other OLMo 3 Models

Evaluation Benchmarks

No evaluation benchmarks for OLMo 3 32B Base available.

Rankings

Overall Rank

Coding Rank

Model Transparency

Total Score

B+

88 / 100

Upstream

27.0 / 30

Model

34.5 / 40

Downstream

26.0 / 30

OLMo 3 32B Base Transparency Report

Total Score

/ 100

B+

Audit Note

OLMo 3 32B Base sets a high standard for transparency by providing not just the model weights, but the entire 'model flow,' including the full training data and code. The documentation of its architectural choices and staged training process is exceptionally detailed and verifiable. While compute and environmental metrics are present, they are slightly less centralized than the comprehensive data and architectural disclosures.

Upstream

27.0 / 30

Architectural Provenance

9.0 / 10

The model's architecture is extensively documented in the official technical report and Hugging Face model card. It is a dense, decoder-only transformer with 64 layers and a hidden dimension of 5120. Specific details provided include the use of Grouped-Query Attention (GQA) with 40 attention heads and 8 KV heads, RMSNorm, and SwiGLU activations. The long-context mechanism is thoroughly described as a hybrid approach using sliding-window attention (4K window) in most layers and full-sequence attention in every fourth layer, combined with YaRN-style scaling for Rotary Position Embeddings (RoPE) to reach 65,536 tokens. The training methodology is disclosed as a three-stage process: initial pretraining, mid-training, and context extension.

Dataset Composition

9.5 / 10

The Allen Institute for AI provides exemplary transparency regarding the training data. The model was trained on the Dolma 3 dataset, which is publicly available. The composition is broken down into specific stages: Stage 1 used the 5.5T token 'dolma3_mix-1125'; Stage 2 (Mid-training) used the 100B token 'dolma3-dolmino-mix' (math, code, science, and reasoning traces); and Stage 3 (Long Context) used the 100B token 'dolma3-longmino-mix' (34% long-context PDFs, 66% mid-training data). The data collection methodology, including the use of olmOCR for scientific PDFs and gzip-based quality filtering, is fully documented. De-contamination procedures against benchmark test sets are explicitly stated.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via the Hugging Face repository and is integrated with the standard 'transformers' library. While the exact vocabulary size and training data alignment are implied by its release alongside the model and its use in the Dolma 3 pipeline, the technical report confirms it is a BPE-based tokenizer designed for the specific data mix. The tokenizer's behavior is verifiable through the provided inference code snippets.

Model

34.5 / 40

Parameter Density

10.0 / 10

The parameter count is clearly stated as 32 billion. As a dense model, all parameters are active during inference, and the provider explicitly confirms there is no 'MoE trickery.' A complete architectural breakdown is provided, including the number of layers (64), hidden size (5120), and attention head configuration (40 Q, 8 KV), allowing for precise verification of the parameter density.

Training Compute

7.0 / 10

The provider discloses significant details about the training hardware and duration. For example, the 32B Think variant's RL extension was trained for 21 days on 224 GPUs. The technical report mentions the use of the 'Augusta' cluster and specific environment variables (NCCL) required for the run. However, a consolidated total GPU-hour figure for the entire three-stage pretraining of the base model and a specific carbon footprint calculation are less prominently featured than the architectural and data details.

Benchmark Reproducibility

8.5 / 10

The Allen Institute provides high transparency for evaluations through the 'OLMo-Eval' GitHub repository, which contains the code and configurations used for benchmarking. The technical report specifies the benchmarks used (e.g., MMLU, GSM8K, RULER, HELMET) and provides detailed results across various checkpoints. The release of intermediate checkpoints (stage1-stepXXX, etc.) further enables researchers to reproduce and verify the model's learning trajectory.

Identity Consistency

9.0 / 10

The model and its variants (Base, Instruct, Think, RL Zero) are clearly labeled with semantic versioning (e.g., OLMo 3, OLMo 3.1). The model card provides a clear identity and intended use cases. There are no reported issues of the model misidentifying itself as a competitor's product, and the 'Think' variant's reasoning traces provide additional transparency into its internal processing identity.

Downstream

26.0 / 30

License Clarity

10.0 / 10

The model, weights, and code are all released under the highly permissive Apache 2.0 license. There are no conflicting terms or 'open-ish' restrictions on commercial use. The license is clearly stated on the Hugging Face model card, the GitHub repository, and the official blog posts.

Hardware Footprint

8.0 / 10

VRAM requirements are well-documented by both the provider and the community. Official documentation notes that FP16 inference requires approximately 64GB of VRAM, making it suitable for single A100 (80GB) or multi-GPU consumer setups (e.g., 4x RTX 4090). Quantization support (INT8/INT4) is documented, with community guides providing specific VRAM targets (sub-20GB for 4-bit). The impact of context length on memory (KV cache efficiency via GQA) is also detailed.

Versioning Drift

8.0 / 10

The project maintains a detailed CHANGELOG.md in the 'OLMo-core' repository, tracking updates to the training code, configs, and model releases. The transition from OLMo 3 to OLMo 3.1 is documented with specific performance gains and training changes (e.g., additional RL training days). The availability of intermediate checkpoints allows users to pin specific versions to avoid silent drift.

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

32k

64k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code