Wan2.1-Fun-V1.1-1.3B-Control-Camera Open Source Text-to-Video Model - Multi-Resolution Training, Capable of Predicting First and Last Frames

Wan2.1 Fun V1.1 1.3B Control Camera

Developed by alibaba-pai

A text-to-video model supporting multi-resolution training with first and last frame prediction capabilities

Text-to-Video Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multi-resolution video generation #First and last frame prediction #Camera motion control

Downloads 54

Release Time : 4/24/2025

Model Overview

This is a text-to-video generation model that supports multi-resolution training and features first and last frame prediction, capable of generating high-quality video content from text or images.

Model Features

Multi-resolution support

Supports training and generation at various resolutions to meet different application needs

First and last frame prediction

Features first and last frame prediction for generating more coherent video content

Multi-language input

Supports both English and Chinese input to accommodate users of different languages

Model Capabilities

Text-to-video

Image-to-video

Multi-resolution video generation

First and last frame prediction

Use Cases

Content creation

Short video generation

Automatically generates short video content based on text descriptions

Can produce high-quality, coherent short video clips

Advertisement production

Quickly generates product promotional videos

Can create professional-grade advertisement videos based on product descriptions

Education

Educational video generation

Automatically generates demonstration videos based on teaching content

Can produce clear and vivid instructional demonstration videos

🚀 Wan-Fun

😊 Welcome! This project is a text - to - video solution that enables users to generate high - quality videos from text inputs. It offers various model versions with different capabilities and can be easily used through multiple methods.

English | Simplified Chinese

🚀 Quick Start

1. Cloud Usage: AliyunDSW/Docker

a. Through Alibaba Cloud DSW

DSW provides free GPU hours. Users can apply for it once, and it will be valid for 3 months after application.

Alibaba Cloud offers free GPU hours on Freetier. Obtain them and use them in Alibaba Cloud PAI - DSW. You can start CogVideoX - Fun within 5 minutes.

b. Through ComfyUI

Our ComfyUI interface is as follows. For details, check ComfyUI README. workflow graph

c. Through Docker

If you use Docker, make sure that the graphics card driver and CUDA environment are correctly installed on your machine, and then execute the following commands in sequence:

# pull image
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# enter image
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:cogvideox_fun

# clone code
git clone https://github.com/aigc-apps/VideoX-Fun.git

# enter VideoX-Fun's dir
cd VideoX-Fun

# download weights
mkdir models/Diffusion_Transformer
mkdir models/Personalized_Model

# Please use the hugginface link or modelscope link to download the model.
# CogVideoX-Fun
# https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP
# https://modelscope.cn/models/PAI/CogVideoX-Fun-V1.1-5b-InP

# Wan
# https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-14B-InP
# https://modelscope.cn/models/PAI/Wan2.1-Fun-V1.1-14B-InP

2. Local Installation: Environment Check/Download/Installation

a. Environment Check

We have verified that this library can be executed in the following environments:

Details for Windows:

Operating System: Windows 10
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - 3060 12G & Nvidia - 3090 24G

Details for Linux:

Operating System: Ubuntu 20.04, CentOS
Python: python3.10 & python3.11
PyTorch: torch2.2.0
CUDA: 11.8 & 12.1
CUDNN: 8+
GPU: Nvidia - V100 16G & Nvidia - A10 24G & Nvidia - A100 40G & Nvidia - A100 80G

We need approximately 60GB of available disk space. Please check!

b. Weight Placement

It is recommended that we place the weights according to the specified paths:

Through ComfyUI: Put the models into the weight folder ComfyUI/models/Fun_Models/ of ComfyUI:

📦 ComfyUI/
├── 📂 models/
│   └── 📂 Fun_Models/
│       ├── 📂 CogVideoX-Fun-V1.1-2b-InP/
│       ├── 📂 CogVideoX-Fun-V1.1-5b-InP/
│       ├── 📂 Wan2.1-Fun-V1.1-14B-InP
│       └── 📂 Wan2.1-Fun-V1.1-1.3B-InP/

When running your own Python files or UI interface:

📦 models/
├── 📂 Diffusion_Transformer/
│   ├── 📂 CogVideoX-Fun-V1.1-2b-InP/
│   ├── 📂 CogVideoX-Fun-V1.1-5b-InP/
│   ├── 📂 Wan2.1-Fun-V1.1-14B-InP
│   └── 📂 Wan2.1-Fun-V1.1-1.3B-InP/
├── 📂 Personalized_Model/
│   └── your trained trainformer model / your trained lora model (for UI load)

✨ Features

Model Address

V1.1:

Property	Storage Space	Hugging Face	Model Scope	Details
Wan2.1-Fun-V1.1-1.3B-InP	19.0 GB	🤗Link	😄Link	The text - to - video weights of Wan2.1-Fun-V1.1-1.3B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1-Fun-V1.1-14B-InP	47.0 GB	🤗Link	😄Link	The text - to - video weights of Wan2.1-Fun-V1.1-14B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1-Fun-V1.1-1.3B-Control	19.0 GB	🤗Link	😄Link	The video control weights of Wan2.1-Fun-V1.1-1.3B support different control conditions, such as Canny, Depth, Pose, MLSD, etc. Support control with reference images + control conditions and trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.
Wan2.1-Fun-V1.1-14B-Control	47.0 GB	🤗Link	😄Link	The video control weights of Wan2.1-Fun-V1.1-14B support different control conditions, such as Canny, Depth, Pose, MLSD, etc. Support control with reference images + control conditions and trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.
Wan2.1-Fun-V1.1-1.3B-Control-Camera	19.0 GB	🤗Link	😄Link	The camera lens control weights of Wan2.1-Fun-V1.1-1.3B. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.
Wan2.1-Fun-V1.1-14B-Control	47.0 GB	🤗Link	😄Link	The camera lens control weights of Wan2.1-Fun-V1.1-14B. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.

V1.0:

Property	Storage Space	Hugging Face	Model Scope	Details
Wan2.1-Fun-1.3B-InP	19.0 GB	🤗Link	😄Link	The text - to - video weights of Wan2.1-Fun-1.3B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1-Fun-14B-InP	47.0 GB	🤗Link	😄Link	The text - to - video weights of Wan2.1-Fun-14B, trained at multiple resolutions, support the prediction of the first and last frames.
Wan2.1-Fun-1.3B-Control	19.0 GB	🤗Link	😄Link	The video control weights of Wan2.1-Fun-1.3B support different control conditions, such as Canny, Depth, Pose, MLSD, etc., and also support trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.
Wan2.1-Fun-14B-Control	47.0 GB	🤗Link	😄Link	The video control weights of Wan2.1-Fun-14B support different control conditions, such as Canny, Depth, Pose, MLSD, etc., and also support trajectory control. Support video prediction at multiple resolutions (512, 768, 1024), trained with 81 frames at 16 frames per second, and support multi - language prediction.

Video Works

Wan2.1-Fun-V1.1-14B-InP && Wan2.1-Fun-V1.1-1.3B-InP

Wan2.1-Fun-V1.1-14B-Control && Wan2.1-Fun-V1.1-1.3B-Control

Generic Control Video + Reference Image:

Reference Image	Control Video	Wan2.1-Fun-V1.1-14B-Control	Wan2.1-Fun-V1.1-1.3B-Control

Generic Control Video (Canny, Pose, Depth, etc.) and Trajectory Control:

Wan2.1-Fun-V1.1-14B-Control-Camera && Wan2.1-Fun-V1.1-1.3B-Control-Camera

Pan Up	Pan Left	Pan Right

Pan Down	Pan Up + Pan Left	Pan Up + Pan Right

📚 Documentation

How to Use

1. Generation

a. Video Memory Saving Scheme

Since the parameters of Wan2.1 are very large, we need to consider a video memory saving scheme to save video memory and adapt to consumer - grade graphics cards. We provide GPU_memory_mode for each prediction file, which can be selected from model_cpu_offload, model_cpu_offload_and_qfloat8, and sequential_cpu_offload. This scheme is also applicable to the generation of CogVideoX - Fun.

model_cpu_offload means that the entire model will be transferred to the CPU after use, which can save some video memory.
model_cpu_offload_and_qfloat8 means that the entire model will be transferred to the CPU after use, and the Transformer model is quantized to float8, which can save more video memory.
sequential_cpu_offload means that each layer of the model will be transferred to the CPU after use. It is slower but saves a large amount of video memory.

qfloat8 will partially reduce the performance of the model but can save more video memory. If the video memory is sufficient, it is recommended to use model_cpu_offload.

b. Through ComfyUI

For details, check ComfyUI README.

c. Running Python Files

Step 1: Download the corresponding [weights]

📄 License

The project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご