DeepSeek V3 is a state-of-the-art Mixture-of-Experts (MoE) model boasting 671 billion parameters. It excels in tasks like reasoning, code generation, and multilingual support, making it one of the top-performing open-source AI solutions. This guide details the deployment process for DeepSeek V3, emphasizing optimal hardware configurations and tools like ollama for easier setup.

Why Choose DeepSeek V3?

Key features that make DeepSeek V3 stand out:

Auxiliary-Loss-Free Strategy: Ensures balanced load distribution without sacrificing performance.
Multi-Token Prediction (MTP): Boosts inference efficiency and speed.
FP8 Precision Training: Provides cost-effective scalability for large-scale models.
Framework Flexibility: Compatible with multiple hardware and software stacks.

Deploying DeepSeek V3 locally provides complete control over its performance and maximizes hardware investments.

Updated System Requirements (Full Base Model)

Hardware

To deploy the full base model of DeepSeek V3 efficiently, use the following configurations:

GPU:
- Minimum: NVIDIA A100 (80GB) with FP8/BF16 precision support.
- Recommended: NVIDIA H100 80GB GPUs (16x or more) for distributed setups.
- Alternatives:
  - AMD GPUs supporting FP8/BF16 (via frameworks like SGLang).
  - Huawei Ascend NPUs with BF16 support.
CPU: Multi-core processors for pre/post-processing tasks.
Memory:
- Minimum: 64GB RAM.
- Recommended: 128GB RAM for larger datasets or multi-GPU configurations.
Storage: At least 1TB of high-speed SSD or NVMe storage for model weights and intermediate files.

Software

Operating System: Linux-based OS (Ubuntu 20.04+ recommended).
Python Version: Python 3.8 or higher.
Essential Libraries:
- PyTorch (torch >= 1.12.0)
- Transformers, huggingface_hub, numpy, scipy, scikit-learn

For the full list of system requirements, including the distilled models, visit the system requirements guide.

Deployment Steps

Quick Setup Using Ollama

For the simplest deployment, use ollama. Ensure ollama is installed on your system, then start the model with a single command:

ollama run deepseek-v3

This command launches an interactive session, enabling you to interact with the model without needing to configure complex setups.

Commands for Distilled Models

Run smaller, distilled versions of the model that have more modest GPU requirements.

1.5B version:
```
ollama run deepseek-r1:1.5b
```
8B version:
```
ollama run deepseek-r1:8b
```
14B version:
```
ollama run deepseek-r1:14b
```
32B version:
```
ollama run deepseek-r1:32b
```
70B version:
```
ollama run deepseek-r1:70b
```

Advanced Deployment Steps

1. Clone the Repository

Clone the official DeepSeek V3 repository:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3

2. Install Dependencies

Navigate to the inference folder and install required dependencies:

cd inference

# Optional: Isolate dependencies
virtualenv env
source env/bin/activate

pip install -r requirements.txt

3. Download Model Weights

Download the weights from Hugging Face:

mkdir -p /path/to/DeepSeek-V3-Demo
# Save model weights to the directory above.

4. Alternative Inference Commands

For advanced inference scenarios, use distributed PyTorch commands:

Interactive Inference

torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py   --ckpt-path /path/to/DeepSeek-V3-Demo   --config configs/config_671B.json   --interactive   --temperature 0.7   --max-new-tokens 200

Batch Inference

torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py   --ckpt-path /path/to/DeepSeek-V3-Demo   --config configs/config_671B.json   --input-file $FILE

Frameworks for Enhanced Deployment

SGLang

A specialized framework for MoE models like DeepSeek V3, offering:

FP8 precision.
Distributed training.
Advanced Multi-Token Prediction (MTP).

Detailed guide here.

LMDeploy

A versatile inference framework supporting FP8 and BF16 precision, ideal for scaling DeepSeek V3.

Setup guide here.

NVIDIA TensorRT-LLM

Optimize your deployment with TensorRT-LLM, featuring quantization and precision tuning (BF16 and INT4/INT8).

Learn more here.

Key Tips for Optimal Performance

Use FP8 Precision: Maximize efficiency for both training and inference.
Deploy on Distributed Systems: Use frameworks like TensorRT-LLM or SGLang for multi-node setups.
Monitor Resources: Leverage tools like nvidia-smi for real-time utilization tracking.

Conclusion

Deploying DeepSeek V3 is now more streamlined than ever, thanks to tools like ollama and frameworks such as TensorRT-LLM and SGLang. By leveraging high-end GPUs like the NVIDIA H100 and following this guide, you can unlock the full potential of this powerful MoE model for your AI workloads.

How to Run DeepSeek V3