Model Overview
Model Features
Model Capabilities
Use Cases
🚀 SkyReels V2: Infinite-Length Film Generative Model
SkyReels V2 is an infinite-length film generative model. It's the first open - source video generative model using the AutoRegressive Diffusion - Forcing architecture, achieving SOTA performance among publicly available models.
📑 Technical Report · 👋 Playground · 💬 Discord · 🤗 Hugging Face · 🤖 ModelScope · 🌐 GitHub
Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite - length film generative models. To the best of our knowledge, it represents the first open - source video generative model employing AutoRegressive Diffusion - Forcing architecture that achieves the SOTA performance among publicly available models.
🔥🔥🔥 News!!
- Apr 24, 2025: 🔥 We release the 720P models, SkyReels - V2 - DF - 14B - 720P and SkyReels - V2 - I2V - 14B - 720P. The former facilitates infinite - length autoregressive video generation, and the latter focuses on Image2Video synthesis.
- Apr 21, 2025: 👋 We release the inference code and model weights of SkyReels - V2 Series Models and the video captioning model SkyCaptioner - V1.
- Apr 3, 2025: 🔥 We also release SkyReels - A2. This is an open - sourced controllable video generation framework capable of assembling arbitrary visual elements.
- Feb 18, 2025: 🔥 we released SkyReels - A1. This is an open - sourced and effective framework for portrait image animation.
- Feb 18, 2025: 🔥 We released SkyReels - V1. This is the first and most advanced open - source human - centric video foundation model.
🎥 Demos
📑 TODO List
- [x] Technical Report
- [x] Checkpoints of the 14B and 1.3B Models Series
- [x] Single - GPU & Multi - GPU Inference Code
- [x] SkyCaptioner - V1: A Video Captioning Model
- [x] Prompt Enhancer
- [ ] Diffusers integration
- [ ] Checkpoints of the 5B Models Series
- [ ] Checkpoints of the Camera Director Models
- [ ] Checkpoints of the Step & Guidance Distill Model
🚀 Quick Start
📦 Installation
# clone the repository.
git clone https://github.com/SkyworkAI/SkyReels-V2
cd SkyReels-V2
# Install dependencies. Test environment uses Python 3.10.12.
pip install -r requirements.txt
Model Download
You can download our models from Hugging Face:
Property | Details |
---|---|
Model Type | You can download different types of models, including Diffusion Forcing, Text - to - Video, Image - to - Video, and Camera Director models. |
Download Link | 🤗 Huggingface 🤖 ModelScope |
Type | Model Variant | Recommended Height/Width/Frame | Link |
---|---|---|---|
Diffusion Forcing | 1.3B - 540P | 544 * 960 * 97f | 🤗 Huggingface 🤖 ModelScope |
5B - 540P | 544 * 960 * 97f | Coming Soon | |
5B - 720P | 720 * 1280 * 121f | Coming Soon | |
14B - 540P | 544 * 960 * 97f | 🤗 Huggingface 🤖 ModelScope | |
14B - 720P | 720 * 1280 * 121f | 🤗 Huggingface 🤖 ModelScope | |
Text - to - Video | 1.3B - 540P | 544 * 960 * 97f | Coming Soon |
5B - 540P | 544 * 960 * 97f | Coming Soon | |
5B - 720P | 720 * 1280 * 121f | Coming Soon | |
14B - 540P | 544 * 960 * 97f | 🤗 Huggingface 🤖 ModelScope | |
14B - 720P | 720 * 1280 * 121f | 🤗 Huggingface 🤖 ModelScope | |
Image - to - Video | 1.3B - 540P | 544 * 960 * 97f | 🤗 Huggingface 🤖 ModelScope |
5B - 540P | 544 * 960 * 97f | Coming Soon | |
5B - 720P | 720 * 1280 * 121f | Coming Soon | |
14B - 540P | 544 * 960 * 97f | 🤗 Huggingface 🤖 ModelScope | |
14B - 720P | 720 * 1280 * 121f | 🤗 Huggingface 🤖 ModelScope | |
Camera Director | 5B - 540P | 544 * 960 * 97f | Coming Soon |
5B - 720P | 720 * 1280 * 121f | Coming Soon | |
14B - 720P | 720 * 1280 * 121f | Coming Soon |
After downloading, set the model path in your generation commands.
💻 Usage Examples
Single GPU Inference
Basic Usage - Diffusion Forcing for Long Video Generation
The Diffusion Forcing version model allows us to generate Infinite - Length videos. This model supports both text - to - video (T2V) and image - to - video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.
# synchronous generation for 10s video
model_id=Skywork/SkyReels-V2-DF-14B-540P
# synchronous inference
python3 generate_video_df.py \
--model_id ${model_id} \
--resolution 540P \
--ar_step 0 \
--base_num_frames 97 \
--num_frames 257 \
--overlap_history 17 \
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
--addnoise_condition 20 \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3
# asynchronous generation for 30s video
model_id=Skywork/SkyReels-V2-DF-14B-540P
# asynchronous inference
python3 generate_video_df.py \
--model_id ${model_id} \
--resolution 540P \
--ar_step 5 \
--causal_block_size 5 \
--base_num_frames 97 \
--num_frames 737 \
--overlap_history 17 \
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
--addnoise_condition 20 \
--offload
⚠️ Important Note
- If you want to run the image - to - video (I2V) task, add
--image ${image_path}
to your command and it is also better to use text - to - video (T2V) - like prompt which includes some descriptions of the first - frame image.- For long video generation, you can just switch the
--num_frames
, e.g.,--num_frames 257
for 10s video,--num_frames 377
for 15s video,--num_frames 737
for 30s video,--num_frames 1457
for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causal_block_size > 1, the--num_frames
should be carefully set.- You can use
--ar_step 5
to enable asynchronous inference. When asynchronous inference,--causal_block_size 5
is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97 - 1)//4+1 = 25 for base_num_frames = 97) and (e.g., (237 - 97-(97 - 17)x1+17 - 1)//4+1 = 20 for base_num_frames = 97, num_frames = 237, overlap_history = 17) for the last iteration, MUST be divided by causal_block_size. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance.- To reduce peak VRAM, just lower the
--base_num_frames
, e.g., to 77 or 57, while keeping the same generative length--num_frames
you want to generate. This may slightly reduce video quality, and it should not be set too small.--addnoise_condition
is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.- Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.
# run Text - to - Video Generation
model_id=Skywork/SkyReels-V2-T2V-14B-540P
python3 generate_video.py \
--model_id ${model_id} \
--resolution 540P \
--num_frames 97 \
--guidance_scale 6.0 \
--shift 8.0 \
--fps 24 \
--prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
--offload \
--teacache \
--use_ret_steps \
--teacache_thresh 0.3
⚠️ Important Note
- When using an image - to - video (I2V) model, you must provide an input image using the
--image ${image_path}
parameter. The--guidance_scale 5.0
and--shift 3.0
is recommended for I2V model.- Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM.
Advanced Usage - Prompt Enhancer
The prompt enhancer is implemented based on Qwen2.5 - 32B - Instruct and is utilized via the --prompt_enhancer
parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over - saturation in the generative video. Note the peak memory of GPU is 64G+ if you use --prompt_enhancer
. If you want to obtain the enhanced prompt separately, you can also run the prompt_enhancer script separately for testing. The steps are as follows:
cd skyreels_v2_infer/pipelines
python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
⚠️ Important Note
--prompt_enhancer
is not allowed if using--use_usp
. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the--use_usp
parameter.
Advanced Configuration Options
Below are the key parameters you can customize for video generation:
Parameter | Recommended Value | Description |
---|---|---|
--prompt | Text description for generating your video | |
--image | Path to input image for image - to - video generation | |
--resolution | 540P or 720P | Output video resolution (select based on model type) |
--num_frames | 97 or 121 | Total frames to generate (97 for 540P models, 121 for 720P models) |
--inference_steps | 50 | Number of denoising steps |
--fps | 24 | Frames per second in the output video |
--shift | 8.0 or 5.0 | Flow matching scheduler parameter (8.0 for T2V, 5.0 for I2V) |
--guidance_scale | 6.0 or 5.0 | Controls text adherence strength (6.0 for T2V, 5.0 for I2V) |
--seed | Fixed seed for reproducible results (omit for random generation) | |
--offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
--use_usp | True | Enables multi - GPU acceleration with xDiT USP |
--outdir | ./video_out | Directory where generated videos will be saved |
--prompt_enhancer | True | Expand the prompt into a more detailed description |
--teacache | False | Enables teacache for faster inference |
--teacache_thresh | 0.2 | Higher speedup will cause to worse quality |
--use_ret_steps | False | Retention Steps for teacache |
Diffusion Forcing Additional Parameters
Parameter | Recommended Value | Description |
---|---|---|
--ar_step | 0 | Controls asynchronous inference (0 for synchronous mode) |
--base_num_frames | 97 or 121 | Base frame count (97 for 540P, 121 for 720P) |
--overlap_history | 17 | Number of frames to overlap for smooth transitions in long videos |
--addnoise_condition | 20 | Improves consistency in long video generation |
--causal_block_size | 5 | Recommended when using asynchronous inference (--ar_step > 0) |
Multi - GPU inference using xDiT USP
We use xDiT USP to accelerate inference. For example, to generate a video with 2 GPUs, you can use the following command:
# diffusion forcing synchronous inference
model_id=Skywork/SkyReels-V2-DF-14B-540P
torchrun --nproc_per_node=2 generate_video_df.py \
--model_id ${model_id} \
--resolution 540P \
--ar_step 0 \
--base_num_frames 97 \
--num_frames 257 \
--overlap_history 17 \
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
--addnoise_condition 20 \
--use_usp \
--offload \
--seed 42
# run Text - to - Video Generation
model_id=Skywork/SkyReels-V2-T2V-14B-540P
torchrun --nproc_per_node=2 generate_video.py \
--model_id ${model_id} \
--resolution 540P \
--num_frames 97 \
--guidance_scale 6.0 \
--shift 8.0 \
--fps 24 \
--offload \
--prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
--use_usp \
--seed 42
⚠️ Important Note
- When using an image - to - video (I2V) model, you must provide an input image using the
--image ${image_path}
parameter. The--guidance_scale 5.0
and--shift 3.0
is recommended for I2V model.
📚 Documentation
Abstract
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5 - 10 seconds) to prioritize resolution, and inadequate shot - aware generation stemming from general - purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long - form synthesis and professional film - style generation.
To address these limitations, we introduce SkyReels - V2, the world's first infinite - length film generative model using a Diffusion Forcing framework. Our approach synergizes Multi - modal Large Language Models (MLLM), Multi - stage Pretraining, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization. Beyond its technical innovations, SkyReels - V2 enables multiple practical applications, including Story Generation, Image - to - Video Synthesis, Camera Director functionality, and multi - subject consistent video generation through our Skyreels - A2 system.
Methodology of SkyReels - V2
The SkyReels - V2 methodology consists of several interconnected components. It starts with a comprehensive data processing pipeline that prepares various quality training data. At its core is the Video Captioner architecture, which provides detailed annotations for video content. The system employs a multi - task pretraining strategy to build fundamental video generation capabilities. Post - training optimization includes Reinforcement Learning to enhance motion quality, Diffusion Forcing Training for generating extended videos, and High - quality Supervised Fine - Tuning (SFT) stages for visual refinement. The model runs on optimized computational infrastructure for efficient training and inference. SkyReels - V2 supports multiple applications, including Story Generation, Image - to - Video Synthesis, Camera Director functionality, and Elements - to - Video Generation.
Key Contributions of SkyReels - V2
Video Captioner
SkyCaptioner - V1 serves as our video captioner... (The original text seems incomplete here, but we keep it as it is)
Performance
(No content provided in the original README, so this section is skipped)
Acknowledgements
(No content provided in the original README, so this section is skipped)
Citation
(No content provided in the original README, so this section is skipped)
📄 License
The project uses the skywork - license.

