Parameters
7B
Context Length
65.536K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
25 Oct 2025
Knowledge Cutoff
Dec 2024
Attention Structure
Multi-Head Attention
Hidden Dimension Size
4096
Number of Layers
32
Attention Heads
32
Key-Value Heads
32
Activation Function
SwigLU
Normalization
-
Position Embedding
Absolute Position Embedding
OLMo 3 7B Base represents a foundational component within the Allen Institute for AI's (AI2) OLMo 3 family of language models, designed to advance the scientific understanding and development of large language models. This variant features 7 billion parameters and is trained on 5.93 trillion tokens sourced from the Dolma 3 dataset. A key characteristic of the OLMo 3 project is its commitment to full transparency, offering public access to not only the model weights but also the comprehensive training data, code, intermediate checkpoints, logs, and evaluation methodologies. This approach facilitates reproducibility and supports detailed research into model behavior and development processes.
Architecturally, the OLMo 3 7B Base model is a dense, decoder-only transformer. Its training employs a staged approach, encompassing distinct pretraining, mid-training, and long-context phases to optimize for diverse linguistic capabilities and extended input handling. The model incorporates 32 layers, a hidden dimension size of 4096, and utilizes multi-head attention with 32 query heads and 32 key-value heads. Rotary Positional Embeddings (RoPE) are integrated, with scaling mechanisms implemented to support a substantial context length of 65,536 tokens.
As a base model, OLMo 3 7B is intended primarily for pretraining research and serves as a robust starting point for subsequent fine-tuning across various downstream tasks. Its design prioritizes general capabilities, laying the groundwork for specialized applications in areas such as reasoning, tool use, and instruction following through further post-training. The model's open licensing under Apache 2.0 permits broad usage, including commercial applications, fostering community collaboration and innovation in the AI ecosystem.
OLMo (Open Language Model) is a series of fully open language models designed to enable the science of language models. Released by the Allen Institute for AI (Ai2), OLMo 3 provides complete access to training data (Dolma 3), code, checkpoints, logs, and evaluation methodologies. The family includes Base models for pretraining research, Instruct variants for chat and tool use, and Think variants with chain-of-thought reasoning capabilities. All models are trained with staged approach including pretraining, mid-training, and long-context phases.
No evaluation benchmarks for OLMo 3 7B Base available.
Overall Rank
-
Coding Rank
-
Total Score
93
/ 100
OLMo 3 7B Base sets a benchmark for transparency in the AI industry by providing public access to its full training data, code, and intermediate checkpoints. The model's documentation is exceptionally detailed, covering everything from specific dataset percentages to precise GPU power consumption and training hours. This comprehensive disclosure enables a level of scientific auditability and reproducibility that is virtually unmatched by contemporary models.
Architectural Provenance
OLMo 3 7B Base provides exemplary documentation of its architectural lineage. It is a dense, decoder-only transformer with 32 layers, a hidden dimension of 4096, and 32 attention heads. The training methodology is explicitly detailed as a three-stage process: initial pretraining (5.93T tokens), mid-training (100B tokens), and long-context extension (50B tokens). Unlike most models, the full training code is available in the 'OLMo-core' GitHub repository, and the technical report provides exhaustive details on the staged curriculum and architectural choices like RoPE scaling for its 65,536 context window.
Dataset Composition
The model's training data, Dolma 3, is fully disclosed with precise percentage breakdowns for each stage. The 5.93T token pretraining mix is documented as 76.07% Common Crawl, 13.57% scientific PDFs, 6.89% code, and 2.56% math. Mid-training and long-context mixes are similarly detailed (e.g., 20% code, 19% math for mid-training). AI2 provides the 'Dolma' toolkit for data processing and has released the actual dataset on Hugging Face, including intermediate checkpoints, which is a rare level of transparency.
Tokenizer Integrity
The tokenizer is publicly accessible via the Hugging Face repository and the OLMo-core library. It has a stated vocabulary size of 50,304 tokens. Documentation covers the tokenization approach, and the tokenizer is integrated into standard 'transformers' workflows, allowing for immediate verification of token counts and language support alignment.
Parameter Density
The model is explicitly defined as a dense architecture with 7 billion total parameters. There is no ambiguity regarding active vs. total parameters as seen in MoE models. The architectural breakdown (layers, heads, dimensions) is clearly stated in the technical report and model card, and the parameter count is consistent across all official documentation and third-party implementations.
Training Compute
AI2 provides high-granularity compute data, disclosing that the 7B model required approximately 234,000 H100 GPU hours for pretraining. They also provide power consumption metrics (~621W average during pretraining) and total energy draw (~146 MWh). This level of detail allows for precise carbon footprint and cost estimation, far exceeding industry standards.
Benchmark Reproducibility
Evaluation is highly transparent through the 'OLMo-Eval' repository and 'OLMES' suite. AI2 discloses exact benchmarks, versions, and results across a wide array of tasks (MMLU, GSM8K, HumanEval, etc.). While they provide the code and prompts used, a minor deduction is made because the technical report notes that some scientific PDFs in the released dataset were redacted post-training for legal reasons, which may slightly impact exact bit-for-bit reproduction of the training run.
Identity Consistency
The model demonstrates high identity consistency, correctly identifying itself as an AI2 OLMo model in testing and documentation. There are no reports of the model claiming to be a competitor's product (e.g., GPT-4). Versioning is strictly maintained (OLMo 3 1025-7B), and the model's capabilities and limitations as a base model are clearly articulated.
License Clarity
The model, weights, and code are all released under the Apache 2.0 license, which is a standard, permissive open-source license. There are no conflicting commercial restrictions or 'open-weights-but-not-open-source' ambiguities. The terms for derivative works and commercial use are clear and legally standard.
Hardware Footprint
Hardware requirements are well-documented, with VRAM estimates provided for various precisions (FP16 requires ~16GB). Third-party documentation and community testing (e.g., via Ollama and LM Studio) provide additional verification for quantization (Q4/Q8) and context scaling impacts. While AI2 provides the foundation, most detailed quantization trade-off data comes from the community, though the base documentation is sufficient.
Versioning Drift
AI2 uses clear semantic versioning and maintains a detailed changelog in the OLMo-core repository. They provide access to intermediate checkpoints (not just the final weights), allowing researchers to track the model's evolution throughout the training process. This 'model flow' approach is the gold standard for tracking drift and behavioral changes.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens