Wan2.1-Fun-14B-Control Open-Source Text-to-Video Model - Multi-Resolution Training and Start-End Frame Prediction Are Extremely Practical

Wan2.1 Fun 14B Control

Developed by alibaba-pai

A text-to-video model supporting multi-resolution training and first/last frame prediction

Text-to-Video Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multi-resolution video generation #First and last frame prediction #Multilingual video control

Downloads 10.53k

Release Time : 3/26/2025

Model Overview

Wan2.1-Fun-1.3B is a text-to-video generation model that supports multi-resolution training and first/last frame prediction, capable of producing high-quality video content based on text descriptions.

Model Features

Multi-resolution support

Supports video generation at various resolutions including 512/768/1024

First/last frame prediction

Predicts the first and last frames of videos to enhance generation coherence

Multilingual support

Supports text input in both Chinese and English

Model Capabilities

Text-to-video

Image-to-video

Video-to-video

Controlled video generation

Use Cases

Creative content generation

Short video creation

Automatically generates creative short videos based on text descriptions

Produces high-quality, coherent video content

Advertisement production

Quickly generates product showcase videos

Supports multiple resolutions and control conditions

Education and training

Educational video generation

Automatically generates demonstration videos based on teaching content

Supports teaching materials in both Chinese and English

🚀 Wan-Fun

😊 Welcome! This project is designed for text-to-video generation, offering a powerful solution for creating high - quality videos from text inputs.

English | Simplified Chinese

🚀 Quick Start

1. Cloud Usage: AliyunDSW/Docker

a. Via Alibaba Cloud DSW

DSW offers free GPU time, which users can apply for once. The free GPU time is valid for 3 months after application.

Alibaba Cloud provides free GPU time on Freetier. You can obtain it and use it in Alibaba Cloud PAI - DSW. CogVideoX - Fun can be launched within 5 minutes.

b. Via ComfyUI

Our ComfyUI interface is as follows. For details, check ComfyUI README. workflow graph

c. Via Docker

If you use Docker, make sure that the graphics card driver and CUDA environment are correctly installed on your machine. Then execute the following commands in sequence:

# pull image
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# enter image
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# clone code
git clone https://github.com/aigc-apps/CogVideoX-Fun.git

# enter CogVideoX-Fun's dir
cd CogVideoX-Fun

# download weights
mkdir models/Diffusion_Transformer
mkdir models/Personalized_Model

# Please use the hugginface link or modelscope link to download the model.
# CogVideoX-Fun
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP

# Wan
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-14B-InP
# https://modelscope.cn/models/PAI/Wan2.1-Fun-14B-InP

2. Local Installation: Environment Check/Download/Installation

a. Environment Check

We have verified that this library can be executed in the following environments:

Details for Windows:

Operating System: Windows 10
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - 3060 12G & Nvidia - 3090 24G

Details for Linux:

Operating System: Ubuntu 20.04, CentOS
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - V100 16G & Nvidia - A10 24G & Nvidia - A100 40G & Nvidia - A100 80G

You need approximately 60GB of available disk space. Please check!

b. Weight Placement

It is recommended to place the weights in the specified paths:

📦 models/
├── 📂 Diffusion_Transformer/
│   ├── 📂 CogVideoX-Fun-V1.1-2b-InP/
│   ├── 📂 CogVideoX-Fun-V1.1-5b-InP/
│   ├── 📂 Wan2.1-Fun-14B-InP
│   └── 📂 Wan2.1-Fun-1.3B-InP/
├── 📂 Personalized_Model/
│   └── your trained trainformer model / your trained lora model (for UI load)

✨ Features

Model Address

V1.0:

Name	Storage Space	Hugging Face	Model Scope	Description
Wan2.1 - Fun - 1.3B - InP	19.0 GB	🤗Link	😄Link	The weights for text - to - video generation of Wan2.1 - Fun - 1.3B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1 - Fun - 14B - InP	47.0 GB	🤗Link	😄Link	The weights for text - to - video generation of Wan2.1 - Fun - 14B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1 - Fun - 1.3B - Control	19.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - 1.3B, support different control conditions such as Canny, Depth, Pose, MLSD, etc., and also support trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, support multi - language prediction.
Wan2.1 - Fun - 14B - Control	47.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - 14B, support different control conditions such as Canny, Depth, Pose, MLSD, etc., and also support trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, support multi - language prediction.

Video Works

Wan2.1 - Fun - 14B - InP && Wan2.1 - Fun - 1.3B - InP

Wan2.1 - Fun - 14B - Control && Wan2.1 - Fun - 1.3B - Control

💻 Usage Examples

1. Generation

a. Memory - Saving Solution

Since the parameters of Wan2.1 are very large, we need to consider a memory - saving solution to adapt to consumer - grade graphics cards. We provide a GPU_memory_mode for each prediction file, which can be selected from model_cpu_offload, model_cpu_offload_and_qfloat8, and sequential_cpu_offload. This solution is also applicable to the generation of CogVideoX - Fun.

model_cpu_offload means that the entire model will be transferred to the CPU after use, which can save some video memory.
model_cpu_offload_and_qfloat8 means that the entire model will be transferred to the CPU after use, and the transformer model is quantized to float8, which can save more video memory.
sequential_cpu_offload means that each layer of the model will be transferred to the CPU after use. It is slower but can save a large amount of video memory.

qfloat8 will partially reduce the performance of the model but can save more video memory. If the video memory is sufficient, it is recommended to use model_cpu_offload.

b. Via ComfyUI

For details, check ComfyUI README.

c. Run Python Files

Step 1: Download the corresponding weights and place them in the models folder.
Step 2: Use different files for prediction according to different weights and prediction targets. Currently, this library supports CogVideoX - Fun, Wan2.1, and Wan2.1 - Fun, which are distinguished by folder names in the examples folder. Different models support different functions, so please make distinctions according to the specific situation. Take CogVideoX - Fun as an example.
- Text - to - video:
  - Modify prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_t2v.py file.
  - Then run the examples/cogvideox_fun/predict_t2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox - fun - videos folder.
- Image - to - video:
  - Modify validation_image_start, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_i2v.py file.
  - validation_image_start is the starting image of the video, and validation_image_end is the ending image of the video.
  - Then run the examples/cogvideox_fun/predict_i2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox - fun - videos_i2v folder.
- Video - to - video:
  - Modify validation_video, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_v2v.py file.
  - validation_video is the reference video for video - to - video generation. You can use the following video for demonstration: Demo Video
  - Then run the examples/cogvideox_fun/predict_v2v.py file and wait for the generation result. The result will be saved in the samples/cogvideox - fun - videos_v2v folder.
- Normal control - to - video (Canny, Pose, Depth, etc.):
  - Modify control_video, validation_image_end, prompt, neg_prompt, guidance_scale, and seed in the examples/cogvideox_fun/predict_v2v_control.py file.
  - control_video is the control video for control - to - video generation, which is a video extracted by operators such as Canny, Pose, Depth, etc. You can use the following video for demonstration: Demo Video
  - Then run the examples/cogvideox_fun/predict_v2v_control.py file and wait for the generation result. The result will be saved in the samples/cogvideox - fun - videos_v2v_control folder.
Step 3: If you want to combine other backbones and LoRA models you trained, modify lora_path in examples/{model_name}/predict_t2v.py and examples/{model_name}/predict_i2v.py according to the situation.

d. Via UI Interface

The webui supports text - to - video, image - to - video, video - to - video, and normal control - to - video (Canny, Pose, Depth, etc.). Currently, this library supports CogVideoX - Fun, Wan2.1, and Wan2.1 - Fun, which are distinguished by folder names in the examples folder. Different models support different functions, so please make distinctions according to the specific situation. Take CogVideoX - Fun as an example.

Step 1: Download the corresponding weights and place them in the models folder.
Step 2: Run the examples/cogvideox_fun/app.py file and enter the Gradio page.
Step 3: Select the generation model on the page, fill in prompt, neg_prompt, guidance_scale, seed, etc., click "Generate", and wait for the generation result. The result will be saved in the sample folder.

📚 Documentation

References

CogVideo: https://github.com/THUDM/CogVideo/
EasyAnimate: https://github.com/aigc-apps/EasyAnimate
Wan2.1: https://github.com/Wan-Video/Wan2.1/

📄 License

This project is licensed under the Apache License (Version 2.0).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご