ApX logo

DeepSeek-R1 70B

Parameters

70B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

MIT License

Release Date

27 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention Structure

Multi-Layer Attention

Hidden Dimension Size

8192

Number of Layers

80

Attention Heads

112

Key-Value Heads

112

Activation Function

-

Normalization

-

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

DeepSeek-R1 70B

DeepSeek-R1 is a family of advanced large language models developed by DeepSeek, designed with a primary focus on enhancing reasoning capabilities. The DeepSeek-R1-Distill-Llama-70B variant is a product of knowledge distillation, leveraging the reasoning strengths of the larger DeepSeek-R1 model and transferring them to a Llama-3.3-70B-Instruct base architecture. This distillation process aims to create a highly capable model that maintains the efficiency and operational characteristics of its base while inheriting sophisticated reasoning patterns.

Architecturally, DeepSeek-R1-Distill-Llama-70B is a dense transformer model, distinguishing it from the Mixture of Experts (MoE) architecture of the original DeepSeek-R1. It employs a Multi-Head Attention (MLA) mechanism with 112 attention heads, facilitating comprehensive processing of input sequences. The model integrates Rotary Position Embeddings (RoPE) for effective handling of positional information within sequences and utilizes Flash Attention for optimized computational efficiency. This configuration enables the model to process substantial context lengths, supporting complex problem-solving.

This model is engineered for general text generation, code generation, and sophisticated problem-solving across domains requiring logical inference and multi-step reasoning. Its design prioritizes efficient deployment, making it suitable for applications where computational resources are a consideration, including those on consumer-grade hardware. The DeepSeek-R1-Distill-Llama-70B is particularly adept at tasks demanding structured thought processes, such as mathematical problem-solving and generating coherent code, extending its utility across various technical and research applications.

About DeepSeek-R1

DeepSeek-R1 is a model family developed for logical reasoning tasks. It incorporates a Mixture-of-Experts architecture for computational efficiency and scalability. The family utilizes Multi-Head Latent Attention and employs reinforcement learning in its training, with some variants integrating cold-start data.


Other DeepSeek-R1 Models

Evaluation Benchmarks

Ranking is for Local LLMs.

Rank

#24

BenchmarkScoreRank

0.60

11

0.61

12

Agentic Coding

LiveBench Agentic

0.07

13

0.59

16

0.47

21

Rankings

Overall Rank

#24

Coding Rank

#28

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs