Claude Sonnet 4.5 Thinking

Closed Source

Closed Weights

Parameters

Context Length

200K

Modality

Text

Architecture

Dense

License

Proprietary

Release Date

29 Sept 2025

Knowledge Cutoff

Jul 2025

Evaluation Benchmarks

Rank

#31

Benchmark	Score	Rank
Coding LiveBench Coding	0.80	⭐ 5
StackEval ProLLM Stack Eval	0.97	5
Professional Knowledge MMLU Pro	0.87	⭐ 7
Agentic Coding LiveBench Agentic	0.53	13
Coding Aider Coding	0.61	13
Reasoning LiveBench Reasoning	0.78	19
Mathematics LiveBench Mathematics	0.79	24
General Text Text Arena	1452	24
Data Analysis LiveBench Data Analysis	0.57	26
Web Development WebDev Arena	1388	41

Rankings

Overall Rank

#31

Coding Rank

#30

About Claude Sonnet 4.5 Thinking

Claude Sonnet 4.5 Thinking is a frontier-class hybrid reasoning model developed by Anthropic, engineered to provide a sophisticated balance between low-latency execution and high-fidelity cognitive processing. The model architecture introduces a dual-mode inference framework, allowing users to select between a standard response path and an extended thinking mode. In the latter, the model utilizes an internal scratchpad to perform multi-step planning, reflection, and self-correction before generating a final output. This transparent reasoning process is exposed to the user as a visible thought block, facilitating a more explainable and verifiable interaction for complex technical tasks.

Technically, the model is built upon an advanced transformer-based architecture optimized for agentic autonomy and long-horizon execution. It supports a standardized 200,000-token context window, with beta support for up to 1 million tokens, specifically designed to handle massive codebases and extensive document sets. Innovations in parallel tool execution and an improved attention mechanism enable the model to manage complex computer-use tasks, such as navigating file systems, executing shell commands, and coordinating multi-part software projects autonomously for periods exceeding 30 hours.

The system is primarily utilized in high-stakes environments where precision and sustained focus are mandatory. Its design excels in production-level software engineering, rigorous financial analysis, and the orchestration of autonomous agents. By integrating advanced memory management and checkpointing capabilities, the model allows for iterative development workflows where progress can be saved and referenced across long-duration sessions. This makes it a primary choice for developers building persistent AI agents that require both deep technical knowledge and the ability to reason through ambiguous, multi-step instructions.

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

Position Embedding

Absolute Position Embedding

RoPE Theta

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

Activation Function

Dimensions

Hidden Dimension Size

Number of Layers

FFN Intermediate Size (Dense)

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

Model Integrity

Total Score

51 / 100

Upstream

16.0 / 30

Model

19.0 / 40

Downstream

16.0 / 30

Claude Sonnet 4.5 Thinking Model Integrity Report

Total Score

/ 100

Audit Note

Claude 4.5 Sonnet Thinking exhibits a transparency profile typical of frontier proprietary models, characterized by excellent API documentation and version tracking but near-total opacity regarding training data and compute resources. While the model's reasoning processes are made visible to users through 'thinking blocks,' the underlying architectural innovations and dataset composition remain undisclosed corporate secrets. Its transparency is strongest in its functional identity and developer-facing specifications, yet it fails to meet basic evidence-based standards for architectural or environmental disclosure.

Upstream

16.0 / 30

Architectural Provenance

5.0 / 10

Anthropic identifies Claude 4.5 Sonnet as a 'hybrid reasoning model' built on a transformer-based architecture. While the dual-mode inference framework (standard vs. extended thinking) is well-documented in the system card and API guides, the underlying architectural modifications that enable 30+ hour autonomous execution and 'interleaved thinking' remain proprietary. No peer-reviewed paper or detailed technical report disclosing the specific model architecture or pre-training methodology has been released beyond high-level marketing descriptions.

Dataset Composition

3.0 / 10

Information regarding the training data is extremely limited. The system card mentions the use of 'crowd workers' for alignment and a knowledge cutoff of July 2025 (or January 2025 in some documentation), but provides no breakdown of data sources, proportions (e.g., web vs. code), or specific filtering methodologies. The claim of being 'the most aligned frontier model' is not supported by public disclosure of the data used to achieve this alignment.

Tokenizer Integrity

8.0 / 10

The tokenizer is accessible via the Anthropic API and supported through official SDKs (e.g., 'anthropic-sdk-python'). While the exact vocabulary size for the 4.5 family isn't explicitly stated in a single technical specification, the API provides 'count_tokens' functionality and 'logit_bias' support, allowing for empirical verification of token IDs and behavior. The tokenizer's alignment with the 200k/1M context window is well-documented for developers.

Model

19.0 / 40

Parameter Density

2.0 / 10

Anthropic does not disclose parameter counts for its proprietary models. While third-party analysis and competitor comparisons (like GLM-4.5) suggest it is a 'mid-sized' model within their 4.5 family, there is no official confirmation of total or active parameters. The 'dense' vs. 'sparse' nature of the architecture is not publicly verified, though it is marketed as a 'frontier-class' model.

Training Compute

2.0 / 10

There is no public disclosure of the hardware (GPU/TPU hours), training duration, or total compute cost. While third-party estimates for carbon footprint exist (e.g., ~40.1 kg CO2 per 36,500 queries), these are not official Anthropic disclosures. The company provides no data on the environmental impact or the specific resources required to train the 4.5 generation.

Benchmark Reproducibility

6.0 / 10

Anthropic provides detailed benchmark results for SWE-bench Verified (77.2%), OSWorld (61.4%), and AIME 2025. The system card includes some methodology notes (e.g., sampling temperature 1.0, max steps for OSWorld). However, the full evaluation code and exact prompts used for all internal benchmarks are not public, and some scores (like the 82% SWE-bench result) rely on 'parallel test-time compute' which is not fully transparently documented for independent reproduction.

Identity Consistency

9.0 / 10

The model demonstrates high identity consistency, correctly identifying itself as Claude 4.5 Sonnet and maintaining version awareness (e.g., 20250929). The system card and API documentation emphasize its role as a reasoning-focused agentic model. There are no documented widespread reports of the model claiming to be a competitor or misrepresenting its core capabilities during standard operation.

Downstream

16.0 / 30

License Clarity

4.0 / 10

The model is governed by a restrictive proprietary license. While the terms for commercial and consumer use are clearly linked in the documentation, they are standard 'black-box' terms that provide no rights to the weights or underlying code. The license is 'clear' in its restrictions but scores low on the transparency scale due to the lack of open-source or open-weights access.

Hardware Footprint

5.0 / 10

As an API-based model, local VRAM requirements are not applicable. However, Anthropic provides good documentation on context window memory scaling (200k to 1M tokens) and the impact of 'thinking tokens' on the total token budget. Documentation on 'context rot' and 'smart context management' provides some guidance on performance trade-offs, though specific quantization impacts for the hosted model are not disclosed.

Versioning Drift

7.0 / 10

Anthropic uses date-based versioning (e.g., claude-sonnet-4-5-20250929) and maintains a changelog for its associated tools like Claude Code. The API documentation explicitly details the retirement dates for model versions (e.g., not sooner than Sept 2026) and provides migration paths for 'extended thinking' features. While behavior drift is a known risk in LLMs, the versioning system is robust and publicly trackable.

Resources

Official Documentation

About Claude 4.5

Enhanced Claude models with further improvements in reasoning, coding, and agentic capabilities. Features advanced thinking modes with adjustable effort levels (high, medium, standard) for optimal performance-latency tradeoffs. Excels at complex analysis, software development, web development, and long-context understanding. Includes thinking variants that expose reasoning process for improved transparency.

Claude Sonnet 4.5 Thinking

Evaluation Benchmarks

Rankings

About Claude Sonnet 4.5 Thinking

Technical Specifications

Model Integrity

Claude Sonnet 4.5 Thinking Model Integrity Report

Audit Note

Upstream

Model

Downstream

Resources

About Claude 4.5

Other Claude 4.5 Models