Wan2.1-VACE-14B开源视频模型 - 支持多种视频生成与编辑任务！

首页

Wan2.1 VACE 14B

由 Wan-AI 开发

Wan2.1是一套全面且开放的视频基础模型，旨在突破视频生成的边界，支持多种视频生成和编辑任务。

文本生成视频支持多种语言开源协议:Apache-2.0 #多任务视频生成 #消费级GPU适配 #中英文本生成

下载量 8,797

发布时间 : 5/13/2025

模型简介

Wan2.1是一套先进的视频生成模型，具备文本到视频、图像到视频、视频编辑、文本到图像及视频到音频等多任务支持，推动视频生成领域发展。

模型特点

SOTA性能

在多项基准测试中持续超越现有开源模型及最先进的商业解决方案。

支持消费级GPU

T2V-1.3B模型仅需8.19GB显存，兼容几乎所有消费级GPU。

多任务支持

在文本到视频、图像到视频、视频编辑、文本到图像及视频到音频任务中表现卓越。

视觉文本生成

首个能生成中英双语文本的视频模型，具备强大的文本生成能力。

高效视频VAE

Wan-VAE在编码和解码任意长度的1080P视频时保持时序信息。

模型能力

文本到视频生成

图像到视频生成

视频编辑

文本到图像生成

视频到音频生成

中英双语文本生成

使用案例

视频创作

短视频生成

根据文本描述生成短视频内容。

生成5秒480P视频约需4分钟（RTX 4090）。

视频编辑

视频风格转换

根据参考图像或文本修改视频风格。

🚀 Wan2.1

Wan2.1 是一套全面且开放的视频基础模型套件，突破了视频生成的界限。它具备SOTA性能，支持消费级GPU，可处理多种任务，能进行视觉文本生成，还拥有强大的视频VAE，为视频生成领域带来了新的突破。

🚀 快速开始

安装

克隆仓库：

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

安装依赖：

# 确保torch >= 2.4.0
pip install -r requirements.txt

模型下载

模型	下载链接	注意事项
T2V-14B	🤗 Huggingface 🤖 ModelScope	支持480P和720P
I2V-14B-720P	🤗 Huggingface 🤖 ModelScope	支持720P
I2V-14B-480P	🤗 Huggingface 🤖 ModelScope	支持480P
T2V-1.3B	🤗 Huggingface 🤖 ModelScope	支持480P
FLF2V-14B	🤗 Huggingface 🤖 ModelScope	支持720P
VACE-1.3B	🤗 Huggingface 🤖 ModelScope	支持480P
VACE-14B	🤗 Huggingface 🤖 ModelScope	支持480P和720P

⚠️ 重要提示

1.3B模型能够生成720P分辨率的视频。然而，由于在该分辨率下的训练有限，与480P相比，结果通常不太稳定。为获得最佳性能，建议使用480P分辨率。

对于首尾帧到视频生成，我们主要在中文文本 - 视频对上训练模型。因此，建议使用中文提示以获得更好的结果。

使用huggingface-cli下载模型：

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B

使用modelscope-cli下载模型：

pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B --local_dir ./Wan2.1-T2V-14B

运行文本到视频生成

本仓库支持两个文本到视频模型（1.3B和14B）和两种分辨率（480P和720P）。这些模型的参数和配置如下：

任务	480P	720P	模型
t2v-14B	✔️	✔️	Wan2.1-T2V-14B
t2v-1.3B	✔️	❌	Wan2.1-T2V-1.3B

（1）不使用提示扩展

为便于实现，我们从跳过提示扩展步骤的基本推理过程开始。

单GPU推理

python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

如果遇到OOM（内存不足）问题，可以使用--offload_model True和--t5_cpu选项来减少GPU内存使用。例如，在RTX 4090 GPU上：

python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

💡 使用建议

如果使用T2V-1.3B模型，建议将参数--sample_guide_scale设置为6。--sample_shift参数可以根据性能在8到12的范围内调整。

使用FSDP + xDiT USP进行多GPU推理我们使用FSDP和xDiT USP来加速推理。
- Ulysess策略如果想使用Ulysses策略，应设置--ulysses_size $GPU_NUMS。注意，如果希望使用Ulysess策略，num_heads应该能被ulysses_size整除。对于1.3B模型，num_heads是12，不能被8整除（因为大多数多GPU机器有8个GPU）。因此，建议使用Ring策略。
- Ring策略如果想使用Ring策略，应设置--ring_size $GPU_NUMS。注意，使用Ring策略时，sequence length应该能被ring_size整除。

当然，也可以结合使用Ulysses和Ring策略。

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

（2）使用提示扩展

扩展提示可以有效地丰富生成视频中的细节，进一步提高视频质量。因此，建议启用提示扩展。我们提供以下两种提示扩展方法：

使用Dashscope API进行扩展
- 提前申请dashscope.api_key（英文 | 中文）。
- 配置环境变量DASH_API_KEY以指定Dashscope API密钥。对于阿里云国际站的用户，还需要将环境变量DASH_API_URL设置为'https://dashscope-intl.aliyuncs.com/api/v1'。有关更多详细说明，请参阅dashscope文档。
- 对于文本到视频任务，使用qwen-plus模型；对于图像到视频任务，使用qwen-vl-max模型。
- 可以使用参数--prompt_extend_model修改用于扩展的模型。例如：

DASH_API_KEY=your_key python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'

使用本地模型进行扩展
- 默认情况下，使用HuggingFace上的Qwen模型进行扩展。用户可以根据可用的GPU内存大小选择Qwen模型或其他模型。
- 对于文本到视频任务，可以使用Qwen/Qwen2.5-14B-Instruct、Qwen/Qwen2.5-7B-Instruct和Qwen/Qwen2.5-3B-Instruct等模型。
- 对于图像到视频或首尾帧到视频任务，可以使用Qwen/Qwen2.5-VL-7B-Instruct和Qwen/Qwen2.5-VL-3B-Instruct等模型。
- 较大的模型通常提供更好的扩展结果，但需要更多的GPU内存。
- 可以使用参数--prompt_extend_model修改用于扩展的模型，允许指定本地模型路径或Hugging Face模型。例如：

python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'

（3）使用Diffusers运行

可以使用以下命令轻松使用Diffusers对Wan2.1-T2V进行推理：

import torch
from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

# 可用模型: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
flow_shift = 5.0 # 720P为5.0，480P为3.0
scheduler = UniPCMultistepScheduler(prediction_type='flow_prediction', use_flow_sigmas=True, num_train_timesteps=1000, flow_shift=flow_shift)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.scheduler = scheduler
pipe.to("cuda")

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
     prompt=prompt,
     negative_prompt=negative_prompt,
     height=720,
     width=1280,
     num_frames=81,
     guidance_scale=5.0,
    ).frames[0]
export_to_video(output, "output.mp4", fps=16)

💡 使用建议

请注意，此示例未集成提示扩展和分布式推理。我们将尽快更新集成提示扩展和多GPU版本的Diffusers。

（4）运行本地gradio

cd gradio
# 如果使用dashscope的API进行提示扩展
DASH_API_KEY=your_key python t2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir ./Wan2.1-T2V-14B

# 如果使用本地模型进行提示扩展
python t2v_14B_singleGPU.py --prompt_extend_method 'local_qwen' --ckpt_dir ./Wan2.1-T2V-14B

运行图像到视频生成

与文本到视频类似，图像到视频也分为有提示扩展步骤和无提示扩展步骤的过程。具体参数及其相应设置如下：

任务	480P	720P	模型
i2v-14B	❌	✔️	Wan2.1-I2V-14B-720P
i2v-14B	✔️	❌	Wan2.1-T2V-14B-480P

（1）不使用提示扩展

单GPU推理

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

💡 使用建议

对于图像到视频任务，size参数表示生成视频的面积，宽高比遵循原始输入图像的宽高比。

使用FSDP + xDiT USP进行多GPU推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

（2）使用提示扩展

提示扩展的过程可以参考此处。

使用Qwen/Qwen2.5-VL-7B-Instruct进行本地提示扩展运行：

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

使用dashscope进行远程提示扩展运行：

DASH_API_KEY=your_key python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

（3）使用Diffusers运行

可以使用以下命令轻松使用Diffusers对Wan2.1-I2V进行推理：

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel

# 可用模型: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "Wan-AI/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = (
    "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
    "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
)
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height, width=width,
    num_frames=81,
    guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=16)

💡 使用建议

请注意，此示例未集成提示扩展和分布式推理。我们将尽快更新集成提示扩展和多GPU版本的Diffusers。

（4）运行本地gradio

cd gradio
# 如果在gradio中仅使用480P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P

# 如果在gradio中仅使用720P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

# 如果在gradio中同时使用480P和720P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

运行首尾帧到视频生成

首尾帧到视频也分为有提示扩展步骤和无提示扩展步骤的过程。目前仅支持720P。具体参数和相应设置如下：

任务	480P	720P	模型
flf2v-14B	❌	✔️	Wan2.1-FLF2V-14B-720P

（1）不使用提示扩展

单GPU推理

python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

💡 使用建议

与图像到视频类似，size参数表示生成视频的面积，宽高比遵循原始输入图像的宽高比。

使用FSDP + xDiT USP进行多GPU推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

（2）使用提示扩展

提示扩展的过程可以参考此处。

使用Qwen/Qwen2.5-VL-7B-Instruct进行本地提示扩展运行：

python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

使用dashscope进行远程提示扩展运行：

DASH_API_KEY=your_key python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

（3）运行本地gradio

cd gradio
# 在gradio中使用720P模型
DASH_API_KEY=your_key python flf2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-FLF2V-14B-720P

运行VACE

VACE现在支持两个模型（1.3B和14B）和两种主要分辨率（480P和720P）。输入支持任何分辨率，但为了获得最佳结果，视频大小应在特定范围内。这些模型的参数和配置如下：

任务	480P(~81x480x832)	720P(~81x720x1280)	模型
VACE	✔️	✔️	Wan2.1-VACE-14B
VACE	✔️	❌	Wan2.1-VACE-1.3B

在VACE中，用户可以输入文本提示以及可选的视频、掩码和图像进行视频生成或编辑。使用VACE的详细说明可以在用户指南中找到。执行过程如下：

（1）预处理

用户收集的材料需要预处理成VACE可识别的输入，包括src_video、src_mask、src_ref_images和prompt。对于R2V（参考到视频生成），可以跳过此预处理，但对于V2V（视频到视频编辑）和MV2V（掩码视频到视频编辑）任务，需要额外的预处理来获得具有深度、姿势或掩码区域等条件的视频。更多详细信息，请参考vace_preproccess。

（2）命令行推理

单GPU推理

python generate.py --task vace-1.3B --size 832*480 --ckpt_dir ./Wan2.1-VACE-1.3B --src_ref_images examples/girl.png,examples/snake.png --prompt "在一个欢乐而充满节日气氛的场景中，穿着鲜艳红色春服的小女孩正与她的可爱卡通蛇嬉戏。她的春服上绣着金色吉祥图案，散发着喜庆的气息，脸上洋溢着灿烂的笑容。蛇身呈现出亮眼的绿色，形状圆润，宽大的眼睛让它显得既友善又幽默。小女孩欢快地用手轻轻抚摸着蛇的头部，共同享受着这温馨的时刻。周围五彩斑斓的灯笼和彩带装饰着环境，阳光透过洒在她们身上，营造出一个充满友爱与幸福的新年氛围。"

使用FSDP + xDiT USP进行多GPU推理

torchrun --nproc_per_node=8 generate.py --task vace-14B --size 1280*720 --ckpt_dir ./Wan2.1-VACE-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --src_ref_images examples/girl.png,examples/snake.png --prompt "在一个欢乐而充满节日气氛的场景中，穿着鲜艳红色春服的小女孩正与她的可爱卡通蛇嬉戏。她的春服上绣着金色吉祥图案，散发着喜庆的气息，脸上洋溢着灿烂的笑容。蛇身呈现出亮眼的绿色，形状圆润，宽大的眼睛让它显得既友善又幽默。小女孩欢快地用手轻轻抚摸着蛇的头部，共同享受着这温馨的时刻。周围五彩斑斓的灯笼和彩带装饰着环境，阳光透过洒在她们身上，营造出一个充满友爱与幸福的新年氛围。"

（3）运行本地gradio

单GPU推理

python gradio/vace.py --ckpt_dir ./Wan2.1-VACE-1.3B

使用FSDP + xDiT USP进行多GPU推理

python gradio/vace.py --mp --ulysses_size 8 --ckpt_dir ./Wan2.1-VACE-14B/

运行文本到图像生成

Wan2.1是一个用于图像和视频生成的统一模型。由于它在这两种类型的数据上进行了训练，因此也可以生成图像。生成图像的命令与视频生成类似，如下所示：

（1）不使用提示扩展

单GPU推理

python generate.py --task t2i-14B --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B  --prompt '一个朴素端庄的美人'

使用FSDP + xDiT USP进行多GPU推理

torchrun --nproc_per_node=8 generate.py --dit_fsdp --t5_fsdp --ulysses_size 8 --base_seed 0 --frame_num 1 --task t2i-14B  --size 1024*1024 --prompt '一个朴素端庄的美人' --ckpt_dir ./Wan2.1-T2V-14B

（2）使用提示扩展

单GPU推理

python generate.py --task t2i-14B --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B  --prompt '一个朴素端庄的美人' --use_prompt_extend

使用FSDP + xDiT USP进行多GPU推理

torchrun --nproc_per_node=8 generate.py --dit_fsdp --t5_fsdp --ulysses_size 8 --base_seed 0 --frame_num 1 --task t2i-14B  --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B --prompt '一个朴素端庄的美人' --use_prompt_extend

✨ 主要特性

👍 SOTA性能：Wan2.1在多个基准测试中始终优于现有的开源模型和最先进的商业解决方案。
👍 支持消费级GPU：T2V - 1.3B模型仅需要8.19 GB的VRAM，几乎与所有消费级GPU兼容。它可以在RTX 4090上约4分钟内生成一个5秒的480P视频（不使用量化等优化技术）。其性能甚至可与一些闭源模型相媲美。
👍 多任务处理：Wan2.1在文本到视频、图像到视频、视频编辑、文本到图像和视频到音频等任务中表现出色，推动了视频生成领域的发展。
👍 视觉文本生成：Wan2.1是第一个能够生成中文和英文文本的视频模型，具有强大的文本生成能力，增强了其实际应用价值。
👍 强大的视频VAE：Wan - VAE具有出色的效率和性能，能够对任意长度的1080P视频进行编码和解码，同时保留时间信息，是视频和图像生成的理想基础。

📚 详细文档

视频演示

社区作品

如果您的工作改进了Wan2.1，并且希望更多人看到，请告知我们。

Phantom基于Wan2.1 - T2V - 1.3B开发了一个用于单主题和多主题参考的统一视频生成框架。请参考他们的示例。
UniAnimate - DiT基于Wan2.1 - 14B - I2V训练了一个人体图像动画模型，并开源了推理和训练代码。欢迎使用！
CFG - Zero从CFG的角度增强了Wan2.1（涵盖T2V和I2V模型）。
TeaCache现在支持Wan2.1加速，能够将速度提高约2倍。欢迎试用！
DiffSynth - Studio为Wan2.1提供了更多支持，包括视频到视频、FP8量化、VRAM优化、LoRA训练等。请参考他们的示例。

待办事项列表

Wan2.1文本到视频
- [x] 14B和1.3B模型的多GPU推理代码
- [x] 14B和1.3B模型的检查点
- [x] Gradio演示
- [x] ComfyUI集成
- [x] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1图像到视频
- [x] 14B模型的多GPU推理代码
- [x] 14B模型的检查点
- [x] Gradio演示
- [x] ComfyUI集成
- [x] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1首尾帧到视频
- [x] 14B模型的多GPU推理代码
- [x] 14B模型的检查点
- [x] Gradio演示
- [ ] ComfyUI集成
- [ ] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1 VACE
- [x] 14B和1.3B模型的多GPU推理代码
- [x] 14B和1.3B模型的检查点
- [x] Gradio演示
- [x] ComfyUI集成
- [ ] Diffusers集成
- [ ] Diffusers + 多GPU推理

人工评估

（1）文本到视频评估

通过人工评估，提示扩展后生成的结果优于闭源和开源模型的结果。

（2）图像到视频评估

我们还进行了广泛的人工评估，以评估图像到视频模型的性能，结果如下表所示。结果清楚地表明，Wan2.1优于闭源和开源模型。

不同GPU上的计算效率

我们在不同的GPU上测试了不同Wan2.1模型的计算效率，结果如下表所示。结果以总时间（秒）/ 峰值GPU内存（GB） 的格式呈现。

此表中测试的参数设置如下： (1) 对于8个GPU上的1.3B模型，设置--ring_size 8和--ulysses_size 1； (2) 对于1个GPU上的14B模型，使用--offload_model True； (3) 对于单个4090 GPU上的1.3B模型，设置--offload_model True --t5_cpu； (4) 对于所有测试，未应用提示扩展，即未启用--use_prompt_extend。

💡 使用建议

T2V - 14B比I2V - 14B慢，因为前者采样50步，而后者使用40步。

Wan2.1介绍

Wan2.1是基于主流扩散变压器范式设计的，通过一系列创新在生成能力方面取得了显著进展。这些创新包括我们新颖的时空变分自编码器（VAE）、可扩展的训练策略、大规模数据构建和自动化评估指标。这些贡献共同提高了模型的性能和通用性。

（1）3D变分自编码器

我们提出了一种新颖的3D因果VAE架构，称为Wan - VAE，专门为视频生成设计。通过结合多种策略，我们提高了时空压缩率，减少了内存使用，并确保了时间因果性。与其他开源VAE相比，Wan - VAE在性能效率方面显示出显著优势。此外，我们的Wan - VAE可以对无限长度的1080P视频进行编码和解码，而不会丢失历史时间信息，使其特别适合视频生成任务。

（2）视频扩散DiT

Wan2.1是在主流扩散变压器范式内使用流匹配框架设计的。我们的模型架构使用T5编码器对多语言文本输入进行编码，每个变压器块中的交叉注意力将文本嵌入到模型结构中。此外，我们使用一个带有线性层和SiLU层的MLP来处理输入时间嵌入，并分别预测六个调制参数。这个MLP在所有变压器块中共享，每个块学习一组不同的偏差。我们的实验结果表明，在相同的参数规模下，这种方法显著提高了性能。

模型	维度	输入维度	输出维度	前馈维度	频率维度	头数	层数
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

数据

我们整理并去重了一个包含大量图像和视频数据的候选数据集。在数据整理过程中，我们设计了一个四步数据清理过程，重点关注基本维度、视觉质量和运动质量。通过强大的数据处理管道，我们可以轻松获得高质量、多样化和大规模的图像和视频训练集。

与SOTA的比较

我们将Wan2.1与领先的开源和闭源模型进行了比较，以评估其性能。使用我们精心设计的1035个内部提示集，我们在14个主要维度和26个子维度上进行了测试。然后，我们通过对每个维度的分数进行加权计算来计算总分，权重来自匹配过程中的人类偏好。详细结果如下表所示。这些结果表明，我们的模型与开源和闭源模型相比具有优越的性能。

🔧 技术细节

3D变分自编码器

我们提出了一种新颖的3D因果VAE架构，称为Wan - VAE，专门为视频生成设计。通过结合多种策略，我们提高了时空压缩率，减少了内存使用，并确保了时间因果性。Wan - VAE在性能效率方面显示出显著优势，与其他开源VAE相比，它可以对无限长度的1080P视频进行编码和解码，而不会丢失历史时间信息，使其特别适合视频生成任务。

视频扩散DiT

📄 许可证

本仓库中的模型遵循Apache 2.0许可证。我们对您生成的内容不主张任何权利，允许您自由使用它们，但请确保您的使用符合本许可证的规定。您对模型的使用负全部责任，不得使用模型分享任何违反适用法律、对个人或群体造成伤害、传播用于伤害目的的个人信息、传播错误信息或针对弱势群体的内容。有关完整的限制列表和您的权利详情，请参阅许可证全文。

致谢

我们要感谢SD3、Qwen、umt5 - xxl、diffusers和HuggingFace仓库的贡献者，感谢他们的开放研究。

联系我们

如果您想给我们的研究或产品团队留言，请随时加入我们的Discord或微信群！

引用

如果您觉得我们的工作有帮助，请引用我们：

@article{wan2025,
      title={Wan: Open and Advanced Large-Scale Video Generative Models}, 
      author={Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},
      journal = {arXiv preprint arXiv:2503.20314},
      year={2025}
}