By Ryan A. on Mar 12, 2025
Wan2.1 is a state-of-the-art open-source suite for video generation by Alibaba Cloud, offering cutting-edge performance while remaining accessible to users with consumer-grade GPUs. Developed to push the boundaries of AI-powered video creation, Wan2.1 supports Text-to-Video (T2V), Image-to-Video, Video Editing, and even Video-to-Audio generation.
One of its major advantages is its efficient hardware requirements. The 1.3B model can run on GPUs under 10GB of VRAM. The larger 14B model requires much higher requirements. This guide focuses on setting up and running text-to-video generation using Wan2.1 on Ubuntu.
Before proceeding, ensure your system meets the following minimum specifications:
If your GPU has less than the required VRAM, offloading to CPU is possible (though performance will be slower). More info on that later on.
Ensure you have Git and Python Pip installed. If not, install them using:
sudo apt update && sudo apt install git python3-pip -y
Clone the Wan2.1 repository and navigate into the project directory:
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
Wan2.1 requires PyTorch 2.4.0 or later. Install the required dependencies:
pip install -r requirements.txt
pip install "huggingface_hub[cli]"
Choose the appropriate model based on your hardware:
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
Wan2.1 supports two resolutions (480P and 720P) and two model sizes (1.3B and 14B). The 1.3B model supports only 480P generation, while the 14B model can generate videos in both 480P and 720P.
This command runs the 1.3B model with CPU offloading to optimize memory usage. If your GPU has less than 18-24GB VRAM, this is the best option:
python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."
This requires only about 10GB - 12GB of VRAM to run.
--t5_cpu
: Moves the T5 model to the CPU, reducing VRAM usage.--offload_model True
: Offloads the model to the CPU after each step, significantly reducing GPU memory consumption.If you have sufficient VRAM, you can omit these flags for better performance.
If you have an 80GB+ VRAM GPU, you can run the 14B model at 720P resolution using:
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."
The offload flags are not enabled in the above command but it's recommended that you run with it first.
Note: Running this model on hardware with insufficient VRAM will lead to slow performance or crashes due to memory overflow.
For users with multiple GPUs, FSDP (Fully Sharded Data Parallel) and xDiT can distribute workloads across GPUs to accelerate generation. First, install the necessary package:
pip install "xfuser>=0.4.1"
Then, run the multi-GPU command:
torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "A capybara relaxing in a futuristic cyberpunk city, neon lights reflecting on water."
Wan2.1 is a powerful yet accessible video generation model that enables high-quality AI-generated videos on a range of hardware configurations. While the 1.3B model is suitable for users with consumer GPUs, the 14B model delivers superior quality but requires high-end workstations.
If you're new to AI video generation, start with the 1.3B model and use CPU offloading to accommodate lower VRAM. As you scale up, explore multi-GPU inference for faster performance.
For more details and advanced usage, check the official Wan2.1 GitHub repository.
© 2025 ApX Machine Learning. All rights reserved.
LangML Suite