ApX logoApX logo

OLMo 3.1 32B Think

Parameters

32B

Context Length

66K

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

12 Dec 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

40

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

500,000

Sliding Window Attention

Yes

Sliding Window Size

4,096

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

5,120

Number of Layers

64

FFN Intermediate Size (Dense)

27,648

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

100,278

Architecture Diagram

Input TokensToken EmbeddingPosition: AbsoluteHidden: 5.1k · Context: 66K · Vocab: 100.3kx 64 layersRMSNormPre-AttentionMulti-Head Attention40Q / 8KV heads · SW: 4.1kHead dim: 128+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 27.6k+Final RMSNormOutput Logits

OLMo 3.1 32B Think

OLMo 3.1 32B Think is a large-scale autoregressive language model developed by the Allen Institute for AI, specifically engineered to excel in complex reasoning and multi-step logic. As part of the OLMo 3.1 series, this variant represents a significant evolution in the initiative's commitment to open science, providing an end-to-end transparent pipeline that includes model weights, training code, and the underlying data. The model is optimized for tasks requiring extended chains of thought, particularly in mathematics and programming, where it leverages specialized post-training to generate detailed, verifiable logical steps before arriving at a final solution.

Built on a decoder-only Transformer architecture, OLMo 3.1 32B Think utilizes 64 layers with a hidden dimension of 5120, incorporating architectural refinements to balance high performance with computational efficiency. It employs Grouped-Query Attention (GQA) with 40 query heads and 8 key-value heads, a configuration that significantly reduces the memory footprint of the key-value cache and enables efficient inference. The model utilizes SwiGLU activation functions and RMSNorm for stable training dynamics. For positional encoding, it implements Rotary Position Embeddings (RoPE) with YaRN-style scaling, supporting a substantial context window of 65,536 tokens.

The training regimen for this model involves a sophisticated multi-stage process starting with pretraining on the 9.3-trillion-token Dolma 3 dataset, followed by mid-training on higher-quality reasoning data. The Think variant is further refined through supervised fine-tuning and Reinforcement Learning from Verifiable Rewards (RLVR) using the Dolci-Think-RL dataset. This specialized reinforcement learning stage is designed to cultivate persistent internal reasoning, allowing the model to navigate intricate problems by exploring multiple logical paths. Because the model is released under the Apache 2.0 license with full access to the training recipes and data provenance tools, it serves as a transparent foundation for researchers and developers building auditable AI systems.

About OLMo 3

OLMo (Open Language Model) is a series of fully open language models designed to enable the science of language models. Released by the Allen Institute for AI (Ai2), OLMo 3 provides complete access to training data (Dolma 3), code, checkpoints, logs, and evaluation methodologies. The family includes Base models for pretraining research, Instruct variants for chat and tool use, and Think variants with chain-of-thought reasoning capabilities. All models are trained with staged approach including pretraining, mid-training, and long-context phases.


Other OLMo 3 Models

Evaluation Benchmarks

Rank

#123

BenchmarkScoreRank

Web Development

WebDev Arena

1285

82

General Text

Text Arena

1285

88

Rankings

Overall Rank

#123

Coding Rank

#91

Model Integrity

Total Score

B+

86 / 100

OLMo 3.1 32B Think Model Integrity Report

Total Score

86

/ 100

B+

Audit Note

OLMo 3.1 32B Think represents a gold standard in AI transparency, providing an end-to-end open pipeline that includes not just weights, but also the full training data, code, and compute metrics. Its commitment to open science is evidenced by the disclosure of environmental impacts and the use of permissive licensing. The model's clear architectural documentation and verifiable training stages make it a highly auditable foundation for research and development.

Upstream

27.0 / 30

Architectural Provenance

9.5 / 10

The model's architecture is extensively documented in the official technical report and Hugging Face model card. It is a dense decoder-only Transformer with 64 layers, a hidden dimension of 5120, and 40 query heads with 8 KV heads (GQA). It utilizes SwiGLU activations, RMSNorm, and RoPE with YaRN-style scaling for a 65,536 token context window. The training methodology is fully described as a multi-stage process (pretraining, mid-training, long-context, SFT, DPO, and RLVR), and the training code is publicly available in the OLMo-core and open-instruct repositories.

Dataset Composition

9.0 / 10

OLMo 3.1 32B Think is trained on the Dolma 3 dataset, which is publicly released and documented. The pretraining mix (5.9T tokens from a 9.3T crawl) and the specialized Dolci-Think-RL datasets for post-training are disclosed. Documentation includes details on data filtering, quality-aware upsampling (top 5% upsampled 7x), and de-contamination procedures against benchmark test sets. The full 'model flow' including data points is part of the release.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face repository and is compatible with standard libraries like Transformers and vLLM. While the exact vocabulary size is not explicitly highlighted in all marketing summaries, it is verifiable through the provided configuration files and code. The tokenizer is aligned with the training data and supports the claimed 65k context window.

Model

33.5 / 40

Parameter Density

9.0 / 10

The model is explicitly stated to be a dense 32B parameter model, avoiding the ambiguity often found in Mixture-of-Experts (MoE) models. The architectural breakdown (layers, hidden size, GQA configuration) is clearly provided, allowing for precise calculation of active parameters and memory requirements.

Training Compute

8.0 / 10

Ai2 provides significant detail regarding compute resources. The 32B model pretraining required approximately 1.05 million H100 GPU hours, with an additional 21 days on 224 GPUs for the 3.1 RLVR stage. Power consumption was measured at ~649W per GPU during pretraining, totaling ~681MWh for the 32B model. This level of environmental and resource disclosure is exemplary.

Benchmark Reproducibility

7.5 / 10

Evaluation is conducted using the open-source OLMo-Eval framework. Results are provided for standard benchmarks (AIME, ZebraLogic, IFEval, GSM8K, etc.) with specific versioning. While the technical report provides high-level methodology, the full evaluation code and exact prompt configurations are accessible through the linked GitHub repositories, though third-party verification is still maturing for the 3.1 variant.

Identity Consistency

9.0 / 10

The model consistently identifies as part of the OLMo 3.1 family and is transparent about its 'Think' variant nature, which uses explicit <think> tags for chain-of-thought reasoning. It does not attempt to mimic competitor identities and maintains clear versioning (e.g., 3.1 vs 3.0) in its system prompts and documentation.

Downstream

25.0 / 30

License Clarity

10.0 / 10

The model, weights, and training code are all released under the Apache 2.0 license, which is a standard, permissive open-source license. There are no conflicting proprietary terms or 'open-weights-only' restrictions that limit commercial use or derivative works, making it one of the most legally transparent models available.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented, with specific VRAM estimates for different precisions (FP16 requires ~64GB, while 4-bit/8-bit quantizations are noted for 24GB-32GB cards). Context length memory scaling for the 65k window is also addressed, providing clear guidance for deployment on consumer vs. enterprise hardware.

Versioning Drift

7.0 / 10

The model uses clear semantic versioning (3.1) to distinguish it from the initial 3.0 release. Ai2 maintains a public 'model flow' that tracks checkpoints and updates. While a formal changelog in the style of software is less centralized, the transition from 3.0 to 3.1 is well-documented in blog posts and technical updates, including the specific additional RL training days.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
32k
64k

VRAM Required:

Recommended GPUs