Model Overview
Model Features
Model Capabilities
Use Cases
๐ Wan-Fun
๐ Welcome! This project focuses on text-to-video generation, offering a powerful solution for creating videos from text descriptions.
๐ Quick Start
1. Cloud Usage: AliyunDSW/Docker
a. Through Alibaba Cloud DSW
DSW offers free GPU hours. Users can apply for it once, and it will be valid for 3 months after application.
Alibaba Cloud provides free GPU hours on Freetier. Obtain and use it in Alibaba Cloud PAI-DSW, and you can start CogVideoX-Fun within 5 minutes.
b. Through ComfyUI
Our ComfyUI interface is as follows. For details, check ComfyUI README.
c. Through Docker
If you use Docker, make sure that the graphics card driver and CUDA environment are correctly installed on your machine, and then execute the following commands in sequence:
# pull image
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
# enter image
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun
# clone code
git clone https://github.com/aigc-apps/CogVideoX-Fun.git
# enter CogVideoX-Fun's dir
cd CogVideoX-Fun
# download weights
mkdir models/Diffusion_Transformer
mkdir models/Personalized_Model
# Please use the hugginface link or modelscope link to download the model.
# CogVideoX-Fun
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP
# Wan
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP
2. Local Installation: Environment Check/Download/Installation
a. Environment Check
We have verified that this library can be executed in the following environments:
Details for Windows:
- Operating System: Windows 10
- Python: python3.10 & python3.11
- PyTorch: torch2.2.0
- CUDA: 11.8 & 12.1
- CUDNN: 8+
- GPU: Nvidia-3060 12G & Nvidia-3090 24G
Details for Linux:
- Operating System: Ubuntu 20.04, CentOS
- Python: python3.10 & python3.11
- PyTorch: torch2.2.0
- CUDA: 11.8 & 12.1
- CUDNN: 8+
- GPU: Nvidia-V100 16G & Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G
We need approximately 60GB of available disk space. Please check!
b. Weight Placement
It is recommended to place the weights according to the specified path:
๐ฆ models/
โโโ ๐ Diffusion_Transformer/
โ โโโ ๐ CogVideoX-Fun-V1.1-2b-InP/
โ โโโ ๐ CogVideoX-Fun-V1.1-5b-InP/
โ โโโ ๐ Wan2.1-Fun-14B-InP
โ โโโ ๐ Wan2.1-Fun-1.3B-InP/
โโโ ๐ Personalized_Model/
โ โโโ your trained trainformer model / your trained lora model (for UI load)
โจ Features
This project supports text-to-video, image-to-video, video-to-video generation, and video generation with common controls (Canny, Pose, Depth, etc.). It also provides a user - friendly UI interface for easy operation.
๐ฆ Installation
The installation methods include cloud usage (AliyunDSW/Docker) and local installation. For detailed steps, please refer to the Quick Start section.
๐ป Usage Examples
Basic Usage
1. Generation
a. Memory Saving Scheme
Since the parameters of Wan2.1 are very large, we need to consider a memory saving scheme to adapt to consumer - grade graphics cards. We provide a GPU_memory_mode
for each prediction file, which can be selected from model_cpu_offload
, model_cpu_offload_and_qfloat8
, and sequential_cpu_offload
. This scheme is also applicable to the generation of CogVideoX - Fun.
model_cpu_offload
means that the entire model will be moved to the CPU after use, which can save some video memory.model_cpu_offload_and_qfloat8
means that the entire model will be moved to the CPU after use, and the Transformer model is quantized to float8, which can save more video memory.sequential_cpu_offload
means that each layer of the model will be moved to the CPU after use. It is slower but saves a large amount of video memory.
qfloat8
will partially reduce the performance of the model but can save more video memory. If the video memory is sufficient, it is recommended to use model_cpu_offload
.
b. Through ComfyUI
For details, check ComfyUI README.
c. Running Python Files
- Step 1: Download the corresponding weights and place them in the
models
folder. - Step 2: Use different files for prediction according to different weights and prediction targets. Currently, this library supports CogVideoX - Fun, Wan2.1, and Wan2.1 - Fun. They are distinguished by folder names in the
examples
folder. Different models support different functions, so please distinguish them according to the specific situation. Take CogVideoX - Fun as an example:- Text - to - video:
- Modify
prompt
,neg_prompt
,guidance_scale
, andseed
in theexamples/cogvideox_fun/predict_t2v.py
file. - Then run the
examples/cogvideox_fun/predict_t2v.py
file and wait for the generation result. The result will be saved in thesamples/cogvideox - fun - videos
folder.
- Modify
- Image - to - video:
- Modify
validation_image_start
,validation_image_end
,prompt
,neg_prompt
,guidance_scale
, andseed
in theexamples/cogvideox_fun/predict_i2v.py
file. validation_image_start
is the start image of the video, andvalidation_image_end
is the end image of the video.- Then run the
examples/cogvideox_fun/predict_i2v.py
file and wait for the generation result. The result will be saved in thesamples/cogvideox - fun - videos_i2v
folder.
- Modify
- Video - to - video:
- Modify
validation_video
,validation_image_end
,prompt
,neg_prompt
,guidance_scale
, andseed
in theexamples/cogvideox_fun/predict_v2v.py
file. validation_video
is the reference video for video - to - video generation. You can use the following video for demonstration: Demo Video- Then run the
examples/cogvideox_fun/predict_v2v.py
file and wait for the generation result. The result will be saved in thesamples/cogvideox - fun - videos_v2v
folder.
- Modify
- Video generation with common controls (Canny, Pose, Depth, etc.):
- Modify
control_video
,validation_image_end
,prompt
,neg_prompt
,guidance_scale
, andseed
in theexamples/cogvideox_fun/predict_v2v_control.py
file. control_video
is the control video for video generation with controls, which is a video extracted by operators such as Canny, Pose, and Depth. You can use the following video for demonstration: Demo Video- Then run the
examples/cogvideox_fun/predict_v2v_control.py
file and wait for the generation result. The result will be saved in thesamples/cogvideox - fun - videos_v2v_control
folder.
- Modify
- Text - to - video:
- Step 3: If you want to combine other backbones and LoRA models you trained, modify
lora_path
inexamples/{model_name}/predict_t2v.py
andexamples/{model_name}/predict_i2v.py
according to the situation.
d. Through the UI Interface
The webui supports text - to - video, image - to - video, video - to - video generation, and video generation with common controls (Canny, Pose, Depth, etc.). Currently, this library supports CogVideoX - Fun, Wan2.1, and Wan2.1 - Fun. They are distinguished by folder names in the examples
folder. Different models support different functions, so please distinguish them according to the specific situation. Take CogVideoX - Fun as an example:
- Step 1: Download the corresponding weights and place them in the
models
folder. - Step 2: Run the
examples/cogvideox_fun/app.py
file and enter the Gradio page. - Step 3: Select the generation model on the page, fill in
prompt
,neg_prompt
,guidance_scale
,seed
, etc., click "Generate", and wait for the generation result. The result will be saved in thesample
folder.
๐ Documentation
Model Address
V1.0:
Name | Storage Space | Hugging Face | Model Scope | Description |
---|---|---|---|---|
Wan2.1 - Fun - 1.3B - InP | 19.0 GB | ๐คLink | ๐Link | The text - to - video weights of Wan2.1 - Fun - 1.3B, trained at multiple resolutions, support the prediction of start and end frames. |
Wan2.1 - Fun - 14B - InP | 47.0 GB | ๐คLink | ๐Link | The text - to - video weights of Wan2.1 - Fun - 14B, trained at multiple resolutions, support the prediction of start and end frames. |
Wan2.1 - Fun - 1.3B - Control | 19.0 GB | ๐คLink | ๐Link | The video control weights of Wan2.1 - Fun - 1.3B, support different control conditions such as Canny, Depth, Pose, MLSD, etc., and also support the use of trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, support multi - language prediction. |
Wan2.1 - Fun - 14B - Control | 47.0 GB | ๐คLink | ๐Link | The video control weights of Wan2.1 - Fun - 14B, support different control conditions such as Canny, Depth, Pose, MLSD, etc., and also support the use of trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, support multi - language prediction. |
Video Works
Wan2.1 - Fun - 14B - InP && Wan2.1 - Fun - 1.3B - InP
Wan2.1 - Fun - 14B - Control && Wan2.1 - Fun - 1.3B - Control
๐ License
This project uses the Apache License (Version 2.0).
References
- CogVideo: https://github.com/THUDM/CogVideo/
- EasyAnimate: https://github.com/aigc-apps/EasyAnimate
- Wan2.1: https://github.com/Wan-Video/Wan2.1/

