Wan2.1 is a state-of-the-art open-source suite for video generation by Alibaba Cloud, offering cutting-edge performance while remaining accessible to users with consumer-grade GPUs. Developed to push the boundaries of AI-powered video creation, Wan2.1 supports Text-to-Video (T2V), Image-to-Video, Video Editing, and even Video-to-Audio generation.

One of its major advantages is its efficient hardware requirements. The 1.3B model can run on GPUs under 10GB of VRAM. The larger 14B model requires much higher requirements. This guide focuses on setting up and running text-to-video generation using Wan2.1 on Ubuntu.

System Requirements

Before proceeding, ensure your system meets the following minimum specifications:

1.3B Model (Recommended for Most Users)

RAM: 32GB (minimum, more is recommended)
VRAM: 10GB (with CPU offloading), 18GB (without offloading). Minimum, more is recommended for both.

14B Model

RAM: 256GB+
VRAM: 80GB (Single GPU), 120GB for Multi-GPU (more is better)

If your GPU has less than the required VRAM, offloading to CPU is possible (though performance will be slower). More info on that later on.

Step 1: Install Prerequisites

Ensure you have Git and Python Pip installed. If not, install them using:

sudo apt update && sudo apt install git python3-pip -y

Clone the Wan2.1 repository and navigate into the project directory:

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

Step 2: Install Dependencies

Wan2.1 requires PyTorch 2.4.0 or later. Install the required dependencies:

pip install -r requirements.txt
pip install "huggingface_hub[cli]"

Step 3: Download the Model

Choose the appropriate model based on your hardware:

1.3B Model (Recommended for most users)

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B

14B Model

huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B

Step 4: Run Text-to-Video Generation

Wan2.1 supports two resolutions (480P and 720P) and two model sizes (1.3B and 14B). The 1.3B model supports only 480P generation, while the 14B model can generate videos in both 480P and 720P.

Running the 1.3B Model (Best for Limited VRAM)

This command runs the 1.3B model with CPU offloading to optimize memory usage. If your GPU has less than 18-24GB VRAM, this is the best option:

python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B  --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6  --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."

This requires only about 10GB - 12GB of VRAM to run.

Understanding the Offloading Flags:

--t5_cpu: Moves the T5 model to the CPU, reducing VRAM usage.
--offload_model True: Offloads the model to the CPU after each step, significantly reducing GPU memory consumption.

If you have sufficient VRAM, you can omit these flags for better performance.

Running the 14B Model (Requires High-End GPUs)

If you have an 80GB+ VRAM GPU, you can run the 14B model at 720P resolution using:

python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B  --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."

The offload flags are not enabled in the above command but it's recommended that you run with it first.

Note: Running this model on hardware with insufficient VRAM will lead to slow performance or crashes due to memory overflow.

Step 5: Multi-GPU Setup for Faster Inference

For users with multiple GPUs, FSDP (Fully Sharded Data Parallel) and xDiT can distribute workloads across GPUs to accelerate generation. First, install the necessary package:

pip install "xfuser>=0.4.1"

Then, run the multi-GPU command:

torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720  --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8  --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."

Conclusion

Wan2.1 is a powerful yet accessible video generation model that enables high-quality AI-generated videos on a range of hardware configurations. While the 1.3B model is suitable for users with consumer GPUs, the 14B model delivers superior quality but requires high-end workstations.

If you're new to AI video generation, start with the 1.3B model and use CPU offloading to accommodate lower VRAM. As you scale up, explore multi-GPU inference for faster performance.

For more details and advanced usage, check the official Wan2.1 GitHub repository.

How to Generate Videos Using Wan2.1 Text-to-Video on Ubuntu