SkyReels-V2-I2V-1.3B-540P Open Source Model - Enjoy Free, High-Quality and Unlimited-Length Movie Generation

Skyreels V2 I2V 1.3B 540P

Developed by Skywork

SkyReels V2 is an unlimited-length cinematic generation model that employs an autoregressive diffusion-forced architecture, supporting high-quality video generation.

Video Processing

Safetensors

Open Source License:Other #Unlimited-length video generation #Autoregressive diffusion architecture #High-resolution video synthesis

Downloads 435

Release Time : 4/20/2025

Model Overview

SkyReels V2 is an advanced video generation model capable of producing unlimited-length cinematic video content. It supports text-to-video (T2V) and image-to-video (I2V) tasks and offers multiple resolution options.

Model Features

Unlimited-length video generation

Utilizes an autoregressive diffusion-forced architecture to support video generation of any length

Multi-resolution support

Offers 540P and 720P resolution options to meet diverse needs

Multiple generation modes

Supports both synchronous and asynchronous inference modes to adapt to different scenarios

Model Capabilities

Text-to-video (T2V)

Image-to-video (I2V)

Long video generation

High-quality video synthesis

Use Cases

Film production

Movie trailer generation

Automatically generates movie trailers based on text descriptions

Can produce high-quality video content exceeding 30 seconds

Advertising creativity

Product promotional videos

Generates dynamic showcase videos from product images

720P HD video presentation

🚀 SkyReels V2: Infinite-Length Film Generative Model

SkyReels V2 is an infinite-length film generative model. It's the first open - source video generative model using AutoRegressive Diffusion - Forcing architecture, achieving SOTA performance among publicly available models.

SkyReels Logo

📑 Technical Report · 👋 Playground · 💬 Discord · 🤗 Hugging Face · 🤖 ModelScope · 🌐 GitHub

✨ Features

First open - source video generative model with AutoRegressive Diffusion - Forcing architecture.
Achieves SOTA performance among publicly available models.
Supports multiple tasks like text - to - video, image - to - video, and long - video generation.

🔥🔥🔥 News!!

Apr 24, 2025: 🔥 We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite - length autoregressive video generation, and the latter focuses on Image2Video synthesis.
Apr 21, 2025: 👋 We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1.
Apr 3, 2025: 🔥 We also release SkyReels-A2. This is an open - sourced controllable video generation framework capable of assembling arbitrary visual elements.
Feb 18, 2025: 🔥 we released SkyReels-A1. This is an open - sourced and effective framework for portrait image animation.
Feb 18, 2025: 🔥 We released SkyReels-V1. This is the first and most advanced open - source human - centric video foundation model.

🎥 Demos

The demos above showcase 30 - second videos generated using our SkyReels - V2 Diffusion Forcing model.

📑 TODO List

[x] Technical Report
[x] Checkpoints of the 14B and 1.3B Models Series
[x] Single - GPU & Multi - GPU Inference Code
[x] SkyCaptioner-V1: A Video Captioning Model
[x] Prompt Enhancer
[ ] Diffusers integration
[ ] Checkpoints of the 5B Models Series
[ ] Checkpoints of the Camera Director Models
[ ] Checkpoints of the Step & Guidance Distill Model

🚀 Quick Start

📦 Installation

# clone the repository.
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2
# Install dependencies. Test environment uses Python 3.10.12.
pip install -r requirements.txt

💾 Model Download

You can download our models from Hugging Face:

Type	Model Variant	Recommended Height/Width/Frame	Link
Diffusion Forcing	1.3B - 540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
Diffusion Forcing	5B - 540P	544 * 960 * 97f	Coming Soon
Diffusion Forcing	5B - 720P	720 * 1280 * 121f	Coming Soon
Diffusion Forcing	14B - 540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
Diffusion Forcing	14B - 720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Text - to - Video	1.3B - 540P	544 * 960 * 97f	Coming Soon
Text - to - Video	5B - 540P	544 * 960 * 97f	Coming Soon
Text - to - Video	5B - 720P	720 * 1280 * 121f	Coming Soon
Text - to - Video	14B - 540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
Text - to - Video	14B - 720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Image - to - Video	1.3B - 540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
Image - to - Video	5B - 540P	544 * 960 * 97f	Coming Soon
Image - to - Video	5B - 720P	720 * 1280 * 121f	Coming Soon
Image - to - Video	14B - 540P	544 * 960 * 97f	🤗 Huggingface 🤖 ModelScope
Image - to - Video	14B - 720P	720 * 1280 * 121f	🤗 Huggingface 🤖 ModelScope
Camera Director	5B - 540P	544 * 960 * 97f	Coming Soon
Camera Director	5B - 720P	720 * 1280 * 121f	Coming Soon
Camera Director	14B - 720P	720 * 1280 * 121f	Coming Soon

After downloading, set the model path in your generation commands.

💻 Usage Examples

Single GPU Inference

Diffusion Forcing for Long Video Generation

The Diffusion Forcing version model allows us to generate Infinite - Length videos. This model supports both text - to - video (T2V) and image - to - video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.

Synchronous generation for 10s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# synchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 0 \
  --base_num_frames 97 \
  --num_frames 257 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload \
  --teacache \
  --use_ret_steps \
  --teacache_thresh 0.3

Asynchronous generation for 30s video

model_id=Skywork/SkyReels-V2-DF-14B-540P
# asynchronous inference
python3 generate_video_df.py \
  --model_id ${model_id} \
  --resolution 540P \
  --ar_step 5 \
  --causal_block_size 5 \
  --base_num_frames 97 \
  --num_frames 737 \
  --overlap_history 17 \
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
  --addnoise_condition 20 \
  --offload

⚠️ Important Note

If you want to run the image - to - video (I2V) task, add --image ${image_path} to your command and it is also better to use text - to - video (T2V) - like prompt which includes some descriptions of the first - frame image.

For long video generation, you can just switch the --num_frames, e.g., --num_frames 257 for 10s video, --num_frames 377 for 15s video, --num_frames 737 for 30s video, --num_frames 1457 for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causal_block_size > 1, the --num_frames should be carefully set.

You can use --ar_step 5 to enable asynchronous inference. When asynchronous inference, --causal_block_size 5 is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97 - 1)//4+1 = 25 for base_num_frames = 97) and (e.g., (237 - 97-(97 - 17)x1+17 - 1)//4+1 = 20 for base_num_frames = 97, num_frames = 237, overlap_history = 17) for the last iteration, MUST be divided by causal_block_size. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance.

To reduce peak VRAM, just lower the --base_num_frames, e.g., to 77 or 57, while keeping the same generative length --num_frames you want to generate. This may slightly reduce video quality, and it should not be set too small.

--addnoise_condition is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.

Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.

Text To Video & Image To Video

# run Text - to - Video Generation
model_id=Skywork/SkyReels-V2-T2V-14B-540P
python3 generate_video.py \
  --model_id ${model_id} \
  --resolution 540P \
  --num_frames 97 \
  --guidance_scale 6.0 \
  --shift 8.0 \
  --fps 24 \
  --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
  --offload \
  --teacache \
  --use_ret_steps \
  --teacache_thresh 0.3

⚠️ Important Note

When using an image - to - video (I2V) model, you must provide an input image using the --image ${image_path} parameter. The --guidance_scale 5.0 and --shift 3.0 is recommended for I2V model.

Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM.

Prompt Enhancer

The prompt enhancer is implemented based on Qwen2.5 - 32B - Instruct and is utilized via the --prompt_enhancer parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over - saturation in the generative video. Note the peak memory of GPU is 64G+ if you use --prompt_enhancer. If you want to obtain the enhanced prompt separately, you can also run the prompt_enhancer script separately for testing. The steps are as follows:

cd skyreels_v2_infer/p

📄 License

This project is under the skywork-license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご