Model Overview
Model Features
Model Capabilities
Use Cases
๐ Wan-Fun
Wan-Fun is a text-to-video model that enables video generation from text or images. It supports multiple resolutions and various control conditions, offering a powerful solution for video creation.
๐ Welcome!
๐ Quick Start
1. Cloud Usage: AliyunDSW/Docker
a. Through Alibaba Cloud DSW
DSW offers free GPU hours, which users can apply for once and are valid for 3 months after application.
Alibaba Cloud provides free GPU hours on Freetier. Obtain them and use them in Alibaba Cloud PAI-DSW to start CogVideoX-Fun within 5 minutes.
b. Through ComfyUI
Our ComfyUI interface is as follows. For details, check ComfyUI README.
c. Through Docker
If using Docker, ensure that the graphics card driver and CUDA environment are correctly installed on the machine, and then execute the following commands in sequence:
# pull image
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
# enter image
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
# clone code
git clone https://github.com/aigc-apps/CogVideoX-Fun.git
# enter CogVideoX-Fun's dir
cd CogVideoX-Fun
# download weights
mkdir models/Diffusion_Transformer
mkdir models/Personalized_Model
# Please use the hugginface link or modelscope link to download the model.
# CogVideoX-Fun
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP
# Wan
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP
2. Local Installation: Environment Check/Download/Installation
a. Environment Check
We have verified that this library can run in the following environments:
Details for Windows:
- Operating System: Windows 10
- Python: python3.10 & python3.11
- PyTorch: torch2.2.0
- CUDA: 11.8 & 12.1
- CUDNN: 8+
- GPU: Nvidia-3060 12G & Nvidia-3090 24G
Details for Linux:
- Operating System: Ubuntu 20.04, CentOS
- Python: python3.10 & python3.11
- PyTorch: torch2.2.0
- CUDA: 11.8 & 12.1
- CUDNN: 8+
- GPU: Nvidia-V100 16G & Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G
We need approximately 60GB of available disk space. Please check!
b. Weight Placement
It is recommended to place the weights in the specified paths:
๐ฆ models/
โโโ ๐ Diffusion_Transformer/
โ โโโ ๐ CogVideoX-Fun-V1.1-2b-InP/
โ โโโ ๐ CogVideoX-Fun-V1.1-5b-InP/
โ โโโ ๐ Wan2.1-Fun-14B-InP
โ โโโ ๐ Wan2.1-Fun-1.3B-InP/
โโโ ๐ Personalized_Model/
โ โโโ your trained trainformer model / your trained lora model (for UI load)
โจ Features
Model Address
V1.0:
Name | Storage Space | Hugging Face | Model Scope | Description |
---|---|---|---|---|
Wan2.1-Fun-1.3B-InP | 19.0 GB | ๐คLink | ๐Link | Wan2.1-Fun-1.3B text-to-video weights, trained at multiple resolutions, supporting first and last frame prediction. |
Wan2.1-Fun-14B-InP | 47.0 GB | ๐คLink | ๐Link | Wan2.1-Fun-14B text-to-video weights, trained at multiple resolutions, supporting first and last frame prediction. |
Wan2.1-Fun-1.3B-Control | 19.0 GB | ๐คLink | ๐Link | Wan2.1-Fun-1.3B video control weights, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and trajectory control. Supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, supporting multi-language prediction. |
Wan2.1-Fun-14B-Control | 47.0 GB | ๐คLink | ๐Link | Wan2.1-Fun-14B video control weights, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and trajectory control. Supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, supporting multi-language prediction. |
Video Works
Wan2.1-Fun-14B-InP && Wan2.1-Fun-1.3B-InP
Wan2.1-Fun-14B-Control && Wan2.1-Fun-1.3B-Control
๐ป Usage Examples
1. Generation
a. Memory Saving Scheme
Due to the large number of parameters in Wan2.1, we need to consider a memory-saving scheme to adapt to consumer-grade graphics cards. We provide a GPU_memory_mode for each prediction file, which can be selected from model_cpu_offload, model_cpu_offload_and_qfloat8, and sequential_cpu_offload. This scheme also applies to the generation of CogVideoX-Fun.
- model_cpu_offload means the entire model will be moved to the CPU after use, saving some video memory.
- model_cpu_offload_and_qfloat8 means the entire model will be moved to the CPU after use, and the transformer model is quantized to float8, saving more video memory.
- sequential_cpu_offload means each layer of the model will be moved to the CPU after use. It is slower but saves a large amount of video memory.
qfloat8 will partially reduce the model's performance but save more video memory. If the video memory is sufficient, it is recommended to use model_cpu_offload.
b. Through ComfyUI
For details, check ComfyUI README.
c. Running Python Files
- Step 1: Download the corresponding weights and place them in the models folder.
- Step 2: Use different files for prediction based on different weights and prediction targets. Currently, this library supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun, which are distinguished by folder names in the examples folder. Different models support different functions, so please distinguish them according to the specific situation. Take CogVideoX-Fun as an example.
- Text-to-Video:
- Modify prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_t2v.py file.
- Then run the examples/cogvideox_fun/predict_t2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox-fun-videos folder.
- Image-to-Video:
- Modify validation_image_start, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_i2v.py file.
- validation_image_start is the starting image of the video, and validation_image_end is the ending image of the video.
- Then run the examples/cogvideox_fun/predict_i2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox-fun-videos_i2v folder.
- Video-to-Video:
- Modify validation_video, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_v2v.py file.
- validation_video is the reference video for video-to-video generation. You can use the following video for the demo: Demo Video
- Then run the examples/cogvideox_fun/predict_v2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox-fun-videos_v2v folder.
- Normal Control-to-Video (Canny, Pose, Depth, etc.):
- Modify control_video, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_v2v_control.py file.
- control_video is the control video for control-to-video generation, which is a video extracted by operators such as Canny, Pose, and Depth. You can use the following video for the demo: Demo Video
- Then run the examples/cogvideox_fun/predict_v2v_control.py file and wait for the generation result. The result will be saved in the samples/cogvideox-fun-videos_v2v_control folder.
- Text-to-Video:
- Step 3: If you want to combine other backbones and LoRA trained by yourself, modify the lora_path in examples/{model_name}/predict_t2v.py and examples/{model_name}/predict_i2v.py according to the situation.
d. Through the UI Interface
The webui supports text-to-video, image-to-video, video-to-video, and normal control-to-video (Canny, Pose, Depth, etc.). Currently, this library supports CogVideoX-Fun, Wan2.1, and Wan2.1-Fun, which are distinguished by folder names in the examples folder. Different models support different functions, so please distinguish them according to the specific situation. Take CogVideoX-Fun as an example.
- Step 1: Download the corresponding weights and place them in the models folder.
- Step 2: Run the examples/cogvideox_fun/app.py file and enter the Gradio page.
- Step 3: Select the generation model on the page, fill in prompt, neg_prompt, guidance_scale, and seed, etc., click Generate, and wait for the generation result. The result will be saved in the sample folder.
๐ Documentation
References
- CogVideo: https://github.com/THUDM/CogVideo/
- EasyAnimate: https://github.com/aigc-apps/EasyAnimate
- Wan2.1: https://github.com/Wan-Video/Wan2.1/
๐ License
This project is licensed under the Apache License (Version 2.0).

