Wan2.1-Fun-V1.1-14B-InP Open-source Text-to-Video Model - Supports Multi-resolution Training and Can Predict First and Last Frames

Wan2.1 Fun V1.1 14B InP

Developed by alibaba-pai

A 1.3B parameter text-to-video model supporting multi-resolution training, capable of predicting first and last frames

Text-to-Video Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multi-resolution video generation #First and last frame prediction #Multilingual video control

Downloads 59

Release Time : 4/24/2025

Model Overview

This is a 1.3B parameter text-to-video model that supports multi-resolution training and can predict the first and last frames of videos.

Model Features

Multi-resolution support

Supports video generation at various resolutions to meet different application scenario requirements

First and last frame prediction

Capable of predicting the first and last frames of videos to enhance generation coherence

Multilingual support

Supports English and Chinese input to meet internationalization needs

Model Capabilities

Text-to-video

Multi-resolution video generation

First and last frame prediction

Multilingual input support

Use Cases

Creative content generation

Short video creation

Automatically generates creative short videos based on text descriptions

Produces high-quality, coherent short video content

Advertisement production

Quickly generates product promotional videos

Efficiently produces video materials that meet marketing needs

Education and training

Educational video generation

Automatically generates instructional videos based on course content

Improves the efficiency of educational resource production

🚀 Wan-Fun

😊 Welcome! This project is designed for text - to - video generation, offering a variety of models and features to meet different needs.

English | Simplified Chinese

🚀 Quick Start

1. Cloud Usage: AliyunDSW/Docker

a. Via Alibaba Cloud DSW

DSW provides free GPU time. Users can apply for it once, and it will be valid for 3 months after application. Alibaba Cloud offers free GPU time on Freetier. Obtain it and use it in Alibaba Cloud PAI - DSW. You can start CogVideoX - Fun within 5 minutes.

b. Via ComfyUI

Our ComfyUI interface is as follows. For details, check ComfyUI README. workflow graph

c. Via Docker

If you use Docker, make sure that the graphics card driver and CUDA environment are correctly installed on your machine, and then execute the following commands in sequence:

# pull image
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# enter image
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# clone code
git clone https://github.com/aigc-apps/VideoX-Fun.git

# enter VideoX-Fun's dir
cd VideoX-Fun

# download weights
mkdir models/Diffusion_Transformer
mkdir models/Personalized_Model

# Please use the hugginface link or modelscope link to download the model.
# CogVideoX-Fun
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP

# Wan
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-14B-InP
# https://modelscope.cn/models/PAI/Wan2.1-Fun-V1.1-14B-InP

2. Local Installation: Environment Check/Download/Installation

a. Environment Check

We have verified that this library can be executed in the following environments:

Details for Windows:

Operating System: Windows 10
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - 3060 12G & Nvidia - 3090 24G

Details for Linux:

Operating System: Ubuntu 20.04, CentOS
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - V100 16G & Nvidia - A10 24G & Nvidia - A100 40G & Nvidia - A100 80G

We need approximately 60GB of available disk space. Please check!

b. Weight Placement

It is recommended to place the weights according to the specified paths:

Via ComfyUI: Place the models in the weight folder ComfyUI/models/Fun_Models/ of Comfyui:

📦 ComfyUI/
├── 📂 models/
│   └── 📂 Fun_Models/
│       ├── 📂 CogVideoX-Fun-V1.1-2b-InP/
│       ├── 📂 CogVideoX-Fun-V1.1-5b-InP/
│       ├── 📂 Wan2.1-Fun-V1.1-14B-InP
│       └── 📂 Wan2.1-Fun-V1.1-1.3B-InP/

When running your own Python files or UI interface:

📦 models/
├── 📂 Diffusion_Transformer/
│   ├── 📂 CogVideoX-Fun-V1.1-2b-InP/
│   ├── 📂 CogVideoX-Fun-V1.1-5b-InP/
│   ├── 📂 Wan2.1-Fun-V1.1-14B-InP
│   └── 📂 Wan2.1-Fun-V1.1-1.3B-InP/
├── 📂 Personalized_Model/
│   └── your trained trainformer model / your trained lora model (for UI load)

✨ Features

Video Generation

a. Memory - Saving Scheme

Due to the large number of parameters in Wan2.1, we need to consider memory - saving schemes to adapt to consumer - grade graphics cards. We provide a GPU_memory_mode for each prediction file, which can be selected from model_cpu_offload, model_cpu_offload_and_qfloat8, and sequential_cpu_offload. This scheme also applies to the generation of CogVideoX - Fun.

model_cpu_offload means that the entire model will be moved to the CPU after use, which can save some video memory.
model_cpu_offload_and_qfloat8 means that the entire model will be moved to the CPU after use, and the Transformer model is quantized to float8, which can save more video memory.
sequential_cpu_offload means that each layer of the model will be moved to the CPU after use. It is slower but saves a large amount of video memory.

qfloat8 will partially reduce the performance of the model but can save more video memory. If you have enough video memory, it is recommended to use model_cpu_offload.

b. Via ComfyUI

Check ComfyUI README for details.

c. Running Python Files

Step 1: Download the corresponding [weights]

📦 Model Address

V1.1

Name	Storage Space	Hugging Face	Model Scope	Description
Wan2.1 - Fun - V1.1 - 1.3B - InP	19.0 GB	🤗Link	😄Link	The weights of Wan2.1 - Fun - V1.1 - 1.3B for text - to - video generation, trained at multiple resolutions and supporting first - and last - frame prediction.
Wan2.1 - Fun - V1.1 - 14B - InP	47.0 GB	🤗Link	😄Link	The weights of Wan2.1 - Fun - V1.1 - 14B for text - to - video generation, trained at multiple resolutions and supporting first - and last - frame prediction.
Wan2.1 - Fun - V1.1 - 1.3B - Control	19.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - V1.1 - 1.3B, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and also supporting trajectory control. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.
Wan2.1 - Fun - V1.1 - 14B - Control	47.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - V1.1 - 14B, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and also supporting trajectory control. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.
Wan2.1 - Fun - V1.1 - 1.3B - Control - Camera	19.0 GB	🤗Link	😄Link	The camera lens control weights of Wan2.1 - Fun - V1.1 - 1.3B. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.
Wan2.1 - Fun - V1.1 - 14B - Control	47.0 GB	🤗Link	😄Link	The camera lens control weights of Wan2.1 - Fun - V1.1 - 14B. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.

V1.0

Name	Storage Space	Hugging Face	Model Scope	Description
Wan2.1 - Fun - 1.3B - InP	19.0 GB	🤗Link	😄Link	The weights of Wan2.1 - Fun - 1.3B for text - to - video generation, trained at multiple resolutions and supporting first - and last - frame prediction.
Wan2.1 - Fun - 14B - InP	47.0 GB	🤗Link	😄Link	The weights of Wan2.1 - Fun - 14B for text - to - video generation, trained at multiple resolutions and supporting first - and last - frame prediction.
Wan2.1 - Fun - 1.3B - Control	19.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - 1.3B, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and also supporting trajectory control. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.
Wan2.1 - Fun - 14B - Control	47.0 GB	🤗Link	😄Link	The video control weights of Wan2.1 - Fun - 14B, supporting different control conditions such as Canny, Depth, Pose, MLSD, etc., and also supporting trajectory control. It supports video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second and supporting multi - language prediction.