Wan2.1-VACE-14B開源視頻模型 - 支持多種視頻生成與編輯任務！

首頁

Wan2.1 VACE 14B

由Wan-AI開發

Wan2.1是一套全面且開放的視頻基礎模型，旨在突破視頻生成的邊界，支持多種視頻生成和編輯任務。

文本生成視頻支持多種語言開源協議:Apache-2.0 #多任務視頻生成 #消費級GPU適配 #中英文本生成

下載量 8,797

發布時間 : 5/13/2025

模型概述

Wan2.1是一套先進的視頻生成模型，具備文本到視頻、圖像到視頻、視頻編輯、文本到圖像及視頻到音頻等多任務支持，推動視頻生成領域發展。

模型特點

SOTA性能

在多項基準測試中持續超越現有開源模型及最先進的商業解決方案。

支持消費級GPU

T2V-1.3B模型僅需8.19GB顯存，兼容幾乎所有消費級GPU。

多任務支持

在文本到視頻、圖像到視頻、視頻編輯、文本到圖像及視頻到音頻任務中表現卓越。

視覺文本生成

首個能生成中英雙語文本的視頻模型，具備強大的文本生成能力。

高效視頻VAE

Wan-VAE在編碼和解碼任意長度的1080P視頻時保持時序信息。

模型能力

文本到視頻生成

圖像到視頻生成

視頻編輯

文本到圖像生成

視頻到音頻生成

中英雙語文本生成

使用案例

視頻創作

短視頻生成

根據文本描述生成短視頻內容。

生成5秒480P視頻約需4分鐘（RTX 4090）。

視頻編輯

視頻風格轉換

根據參考圖像或文本修改視頻風格。

🚀 Wan2.1

Wan2.1 是一套全面且開放的視頻基礎模型套件，突破了視頻生成的界限。它具備SOTA性能，支持消費級GPU，可處理多種任務，能進行視覺文本生成，還擁有強大的視頻VAE，為視頻生成領域帶來了新的突破。

🚀 快速開始

安裝

克隆倉庫：

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

安裝依賴：

# 確保torch >= 2.4.0
pip install -r requirements.txt

模型下載

模型	下載鏈接	注意事項
T2V-14B	🤗 Huggingface 🤖 ModelScope	支持480P和720P
I2V-14B-720P	🤗 Huggingface 🤖 ModelScope	支持720P
I2V-14B-480P	🤗 Huggingface 🤖 ModelScope	支持480P
T2V-1.3B	🤗 Huggingface 🤖 ModelScope	支持480P
FLF2V-14B	🤗 Huggingface 🤖 ModelScope	支持720P
VACE-1.3B	🤗 Huggingface 🤖 ModelScope	支持480P
VACE-14B	🤗 Huggingface 🤖 ModelScope	支持480P和720P

⚠️ 重要提示

1.3B模型能夠生成720P分辨率的視頻。然而，由於在該分辨率下的訓練有限，與480P相比，結果通常不太穩定。為獲得最佳性能，建議使用480P分辨率。

對於首尾幀到視頻生成，我們主要在中文文本 - 視頻對上訓練模型。因此，建議使用中文提示以獲得更好的結果。

使用huggingface-cli下載模型：

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B

使用modelscope-cli下載模型：

pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B --local_dir ./Wan2.1-T2V-14B

運行文本到視頻生成

本倉庫支持兩個文本到視頻模型（1.3B和14B）和兩種分辨率（480P和720P）。這些模型的參數和配置如下：

任務	480P	720P	模型
t2v-14B	✔️	✔️	Wan2.1-T2V-14B
t2v-1.3B	✔️	❌	Wan2.1-T2V-1.3B

（1）不使用提示擴展

為便於實現，我們從跳過提示擴展步驟的基本推理過程開始。

單GPU推理

python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

如果遇到OOM（內存不足）問題，可以使用--offload_model True和--t5_cpu選項來減少GPU內存使用。例如，在RTX 4090 GPU上：

python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

💡 使用建議

如果使用T2V-1.3B模型，建議將參數--sample_guide_scale設置為6。--sample_shift參數可以根據性能在8到12的範圍內調整。

使用FSDP + xDiT USP進行多GPU推理我們使用FSDP和xDiT USP來加速推理。
- Ulysess策略如果想使用Ulysses策略，應設置--ulysses_size $GPU_NUMS。注意，如果希望使用Ulysess策略，num_heads應該能被ulysses_size整除。對於1.3B模型，num_heads是12，不能被8整除（因為大多數多GPU機器有8個GPU）。因此，建議使用Ring策略。
- Ring策略如果想使用Ring策略，應設置--ring_size $GPU_NUMS。注意，使用Ring策略時，sequence length應該能被ring_size整除。

當然，也可以結合使用Ulysses和Ring策略。

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."

（2）使用提示擴展

擴展提示可以有效地豐富生成視頻中的細節，進一步提高視頻質量。因此，建議啟用提示擴展。我們提供以下兩種提示擴展方法：

使用Dashscope API進行擴展
- 提前申請dashscope.api_key（英文 | 中文）。
- 配置環境變量DASH_API_KEY以指定Dashscope API密鑰。對於阿里雲國際站的用戶，還需要將環境變量DASH_API_URL設置為'https://dashscope-intl.aliyuncs.com/api/v1'。有關更多詳細說明，請參閱dashscope文檔。
- 對於文本到視頻任務，使用qwen-plus模型；對於圖像到視頻任務，使用qwen-vl-max模型。
- 可以使用參數--prompt_extend_model修改用於擴展的模型。例如：

DASH_API_KEY=your_key python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'zh'

使用本地模型進行擴展
- 默認情況下，使用HuggingFace上的Qwen模型進行擴展。用戶可以根據可用的GPU內存大小選擇Qwen模型或其他模型。
- 對於文本到視頻任務，可以使用Qwen/Qwen2.5-14B-Instruct、Qwen/Qwen2.5-7B-Instruct和Qwen/Qwen2.5-3B-Instruct等模型。
- 對於圖像到視頻或首尾幀到視頻任務，可以使用Qwen/Qwen2.5-VL-7B-Instruct和Qwen/Qwen2.5-VL-3B-Instruct等模型。
- 較大的模型通常提供更好的擴展結果，但需要更多的GPU內存。
- 可以使用參數--prompt_extend_model修改用於擴展的模型，允許指定本地模型路徑或Hugging Face模型。例如：

python generate.py  --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage" --use_prompt_extend --prompt_extend_method 'local_qwen' --prompt_extend_target_lang 'zh'

（3）使用Diffusers運行

可以使用以下命令輕鬆使用Diffusers對Wan2.1-T2V進行推理：

import torch
from diffusers.utils import export_to_video
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

# 可用模型: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
flow_shift = 5.0 # 720P為5.0，480P為3.0
scheduler = UniPCMultistepScheduler(prediction_type='flow_prediction', use_flow_sigmas=True, num_train_timesteps=1000, flow_shift=flow_shift)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.scheduler = scheduler
pipe.to("cuda")

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
     prompt=prompt,
     negative_prompt=negative_prompt,
     height=720,
     width=1280,
     num_frames=81,
     guidance_scale=5.0,
    ).frames[0]
export_to_video(output, "output.mp4", fps=16)

💡 使用建議

請注意，此示例未集成提示擴展和分佈式推理。我們將盡快更新集成提示擴展和多GPU版本的Diffusers。

（4）運行本地gradio

cd gradio
# 如果使用dashscope的API進行提示擴展
DASH_API_KEY=your_key python t2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir ./Wan2.1-T2V-14B

# 如果使用本地模型進行提示擴展
python t2v_14B_singleGPU.py --prompt_extend_method 'local_qwen' --ckpt_dir ./Wan2.1-T2V-14B

運行圖像到視頻生成

與文本到視頻類似，圖像到視頻也分為有提示擴展步驟和無提示擴展步驟的過程。具體參數及其相應設置如下：

任務	480P	720P	模型
i2v-14B	❌	✔️	Wan2.1-I2V-14B-720P
i2v-14B	✔️	❌	Wan2.1-T2V-14B-480P

（1）不使用提示擴展

單GPU推理

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

💡 使用建議

對於圖像到視頻任務，size參數表示生成視頻的面積，寬高比遵循原始輸入圖像的寬高比。

使用FSDP + xDiT USP進行多GPU推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

（2）使用提示擴展

提示擴展的過程可以參考此處。

使用Qwen/Qwen2.5-VL-7B-Instruct進行本地提示擴展運行：

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

使用dashscope進行遠程提示擴展運行：

DASH_API_KEY=your_key python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

（3）使用Diffusers運行

可以使用以下命令輕鬆使用Diffusers對Wan2.1-I2V進行推理：

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel

# 可用模型: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "Wan-AI/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = (
    "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
    "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
)
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height, width=width,
    num_frames=81,
    guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=16)

💡 使用建議

請注意，此示例未集成提示擴展和分佈式推理。我們將盡快更新集成提示擴展和多GPU版本的Diffusers。

（4）運行本地gradio

cd gradio
# 如果在gradio中僅使用480P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P

# 如果在gradio中僅使用720P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

# 如果在gradio中同時使用480P和720P模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

運行首尾幀到視頻生成

首尾幀到視頻也分為有提示擴展步驟和無提示擴展步驟的過程。目前僅支持720P。具體參數和相應設置如下：

任務	480P	720P	模型
flf2v-14B	❌	✔️	Wan2.1-FLF2V-14B-720P

（1）不使用提示擴展

單GPU推理

python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

💡 使用建議

與圖像到視頻類似，size參數表示生成視頻的面積，寬高比遵循原始輸入圖像的寬高比。

使用FSDP + xDiT USP進行多GPU推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

（2）使用提示擴展

提示擴展的過程可以參考此處。

使用Qwen/Qwen2.5-VL-7B-Instruct進行本地提示擴展運行：

python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

使用dashscope進行遠程提示擴展運行：

DASH_API_KEY=your_key python generate.py --task flf2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-FLF2V-14B-720P --first_frame examples/flf2v_input_first_frame.png --last_frame examples/flf2v_input_last_frame.png --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird’s feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."

（3）運行本地gradio

cd gradio
# 在gradio中使用720P模型
DASH_API_KEY=your_key python flf2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-FLF2V-14B-720P

運行VACE

VACE現在支持兩個模型（1.3B和14B）和兩種主要分辨率（480P和720P）。輸入支持任何分辨率，但為了獲得最佳結果，視頻大小應在特定範圍內。這些模型的參數和配置如下：

任務	480P(~81x480x832)	720P(~81x720x1280)	模型
VACE	✔️	✔️	Wan2.1-VACE-14B
VACE	✔️	❌	Wan2.1-VACE-1.3B

在VACE中，用戶可以輸入文本提示以及可選的視頻、掩碼和圖像進行視頻生成或編輯。使用VACE的詳細說明可以在用戶指南中找到。執行過程如下：

（1）預處理

用戶收集的材料需要預處理成VACE可識別的輸入，包括src_video、src_mask、src_ref_images和prompt。對於R2V（參考到視頻生成），可以跳過此預處理，但對於V2V（視頻到視頻編輯）和MV2V（掩碼視頻到視頻編輯）任務，需要額外的預處理來獲得具有深度、姿勢或掩碼區域等條件的視頻。更多詳細信息，請參考vace_preproccess。

（2）命令行推理

單GPU推理

python generate.py --task vace-1.3B --size 832*480 --ckpt_dir ./Wan2.1-VACE-1.3B --src_ref_images examples/girl.png,examples/snake.png --prompt "在一個歡樂而充滿節日氣氛的場景中，穿著鮮豔紅色春服的小女孩正與她的可愛卡通蛇嬉戲。她的春服上繡著金色吉祥圖案，散發著喜慶的氣息，臉上洋溢著燦爛的笑容。蛇身呈現出亮眼的綠色，形狀圓潤，寬大的眼睛讓它顯得既友善又幽默。小女孩歡快地用手輕輕撫摸著蛇的頭部，共同享受著這溫馨的時刻。周圍五彩斑斕的燈籠和綵帶裝飾著環境，陽光透過灑在她們身上，營造出一個充滿友愛與幸福的新年氛圍。"

使用FSDP + xDiT USP進行多GPU推理

torchrun --nproc_per_node=8 generate.py --task vace-14B --size 1280*720 --ckpt_dir ./Wan2.1-VACE-14B --dit_fsdp --t5_fsdp --ulysses_size 8 --src_ref_images examples/girl.png,examples/snake.png --prompt "在一個歡樂而充滿節日氣氛的場景中，穿著鮮豔紅色春服的小女孩正與她的可愛卡通蛇嬉戲。她的春服上繡著金色吉祥圖案，散發著喜慶的氣息，臉上洋溢著燦爛的笑容。蛇身呈現出亮眼的綠色，形狀圓潤，寬大的眼睛讓它顯得既友善又幽默。小女孩歡快地用手輕輕撫摸著蛇的頭部，共同享受著這溫馨的時刻。周圍五彩斑斕的燈籠和綵帶裝飾著環境，陽光透過灑在她們身上，營造出一個充滿友愛與幸福的新年氛圍。"

（3）運行本地gradio

單GPU推理

python gradio/vace.py --ckpt_dir ./Wan2.1-VACE-1.3B

使用FSDP + xDiT USP進行多GPU推理

python gradio/vace.py --mp --ulysses_size 8 --ckpt_dir ./Wan2.1-VACE-14B/

運行文本到圖像生成

Wan2.1是一個用於圖像和視頻生成的統一模型。由於它在這兩種類型的數據上進行了訓練，因此也可以生成圖像。生成圖像的命令與視頻生成類似，如下所示：

（1）不使用提示擴展

單GPU推理

python generate.py --task t2i-14B --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B  --prompt '一個樸素端莊的美人'

使用FSDP + xDiT USP進行多GPU推理

torchrun --nproc_per_node=8 generate.py --dit_fsdp --t5_fsdp --ulysses_size 8 --base_seed 0 --frame_num 1 --task t2i-14B  --size 1024*1024 --prompt '一個樸素端莊的美人' --ckpt_dir ./Wan2.1-T2V-14B

（2）使用提示擴展

單GPU推理

python generate.py --task t2i-14B --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B  --prompt '一個樸素端莊的美人' --use_prompt_extend

使用FSDP + xDiT USP進行多GPU推理

torchrun --nproc_per_node=8 generate.py --dit_fsdp --t5_fsdp --ulysses_size 8 --base_seed 0 --frame_num 1 --task t2i-14B  --size 1024*1024 --ckpt_dir ./Wan2.1-T2V-14B --prompt '一個樸素端莊的美人' --use_prompt_extend

✨ 主要特性

👍 SOTA性能：Wan2.1在多個基準測試中始終優於現有的開源模型和最先進的商業解決方案。
👍 支持消費級GPU：T2V - 1.3B模型僅需要8.19 GB的VRAM，幾乎與所有消費級GPU兼容。它可以在RTX 4090上約4分鐘內生成一個5秒的480P視頻（不使用量化等優化技術）。其性能甚至可與一些閉源模型相媲美。
👍 多任務處理：Wan2.1在文本到視頻、圖像到視頻、視頻編輯、文本到圖像和視頻到音頻等任務中表現出色，推動了視頻生成領域的發展。
👍 視覺文本生成：Wan2.1是第一個能夠生成中文和英文文本的視頻模型，具有強大的文本生成能力，增強了其實際應用價值。
👍 強大的視頻VAE：Wan - VAE具有出色的效率和性能，能夠對任意長度的1080P視頻進行編碼和解碼，同時保留時間信息，是視頻和圖像生成的理想基礎。

📚 詳細文檔

視頻演示

社區作品

如果您的工作改進了Wan2.1，並且希望更多人看到，請告知我們。

Phantom基於Wan2.1 - T2V - 1.3B開發了一個用於單主題和多主題參考的統一視頻生成框架。請參考他們的示例。
UniAnimate - DiT基於Wan2.1 - 14B - I2V訓練了一個人體圖像動畫模型，並開源了推理和訓練代碼。歡迎使用！
CFG - Zero從CFG的角度增強了Wan2.1（涵蓋T2V和I2V模型）。
TeaCache現在支持Wan2.1加速，能夠將速度提高約2倍。歡迎試用！
DiffSynth - Studio為Wan2.1提供了更多支持，包括視頻到視頻、FP8量化、VRAM優化、LoRA訓練等。請參考他們的示例。

待辦事項列表

Wan2.1文本到視頻
- [x] 14B和1.3B模型的多GPU推理代碼
- [x] 14B和1.3B模型的檢查點
- [x] Gradio演示
- [x] ComfyUI集成
- [x] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1圖像到視頻
- [x] 14B模型的多GPU推理代碼
- [x] 14B模型的檢查點
- [x] Gradio演示
- [x] ComfyUI集成
- [x] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1首尾幀到視頻
- [x] 14B模型的多GPU推理代碼
- [x] 14B模型的檢查點
- [x] Gradio演示
- [ ] ComfyUI集成
- [ ] Diffusers集成
- [ ] Diffusers + 多GPU推理
Wan2.1 VACE
- [x] 14B和1.3B模型的多GPU推理代碼
- [x] 14B和1.3B模型的檢查點
- [x] Gradio演示
- [x] ComfyUI集成
- [ ] Diffusers集成
- [ ] Diffusers + 多GPU推理

人工評估

（1）文本到視頻評估

通過人工評估，提示擴展後生成的結果優於閉源和開源模型的結果。

（2）圖像到視頻評估

我們還進行了廣泛的人工評估，以評估圖像到視頻模型的性能，結果如下表所示。結果清楚地表明，Wan2.1優於閉源和開源模型。

不同GPU上的計算效率

我們在不同的GPU上測試了不同Wan2.1模型的計算效率，結果如下表所示。結果以總時間（秒）/ 峰值GPU內存（GB） 的格式呈現。

此表中測試的參數設置如下： (1) 對於8個GPU上的1.3B模型，設置--ring_size 8和--ulysses_size 1； (2) 對於1個GPU上的14B模型，使用--offload_model True； (3) 對於單個4090 GPU上的1.3B模型，設置--offload_model True --t5_cpu； (4) 對於所有測試，未應用提示擴展，即未啟用--use_prompt_extend。

💡 使用建議

T2V - 14B比I2V - 14B慢，因為前者採樣50步，而後者使用40步。

Wan2.1介紹

Wan2.1是基於主流擴散變壓器範式設計的，通過一系列創新在生成能力方面取得了顯著進展。這些創新包括我們新穎的時空變分自編碼器（VAE）、可擴展的訓練策略、大規模數據構建和自動化評估指標。這些貢獻共同提高了模型的性能和通用性。

（1）3D變分自編碼器

我們提出了一種新穎的3D因果VAE架構，稱為Wan - VAE，專門為視頻生成設計。通過結合多種策略，我們提高了時空壓縮率，減少了內存使用，並確保了時間因果性。與其他開源VAE相比，Wan - VAE在性能效率方面顯示出顯著優勢。此外，我們的Wan - VAE可以對無限長度的1080P視頻進行編碼和解碼，而不會丟失歷史時間信息，使其特別適合視頻生成任務。

（2）視頻擴散DiT

Wan2.1是在主流擴散變壓器範式內使用流匹配框架設計的。我們的模型架構使用T5編碼器對多語言文本輸入進行編碼，每個變壓器塊中的交叉注意力將文本嵌入到模型結構中。此外，我們使用一個帶有線性層和SiLU層的MLP來處理輸入時間嵌入，並分別預測六個調製參數。這個MLP在所有變壓器塊中共享，每個塊學習一組不同的偏差。我們的實驗結果表明，在相同的參數規模下，這種方法顯著提高了性能。

模型	維度	輸入維度	輸出維度	前饋維度	頻率維度	頭數	層數
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

數據

我們整理並去重了一個包含大量圖像和視頻數據的候選數據集。在數據整理過程中，我們設計了一個四步數據清理過程，重點關注基本維度、視覺質量和運動質量。通過強大的數據處理管道，我們可以輕鬆獲得高質量、多樣化和大規模的圖像和視頻訓練集。

與SOTA的比較

我們將Wan2.1與領先的開源和閉源模型進行了比較，以評估其性能。使用我們精心設計的1035個內部提示集，我們在14個主要維度和26個子維度上進行了測試。然後，我們通過對每個維度的分數進行加權計算來計算總分，權重來自匹配過程中的人類偏好。詳細結果如下表所示。這些結果表明，我們的模型與開源和閉源模型相比具有優越的性能。

🔧 技術細節

3D變分自編碼器

我們提出了一種新穎的3D因果VAE架構，稱為Wan - VAE，專門為視頻生成設計。通過結合多種策略，我們提高了時空壓縮率，減少了內存使用，並確保了時間因果性。Wan - VAE在性能效率方面顯示出顯著優勢，與其他開源VAE相比，它可以對無限長度的1080P視頻進行編碼和解碼，而不會丟失歷史時間信息，使其特別適合視頻生成任務。

視頻擴散DiT

📄 許可證

本倉庫中的模型遵循Apache 2.0許可證。我們對您生成的內容不主張任何權利，允許您自由使用它們，但請確保您的使用符合本許可證的規定。您對模型的使用負全部責任，不得使用模型分享任何違反適用法律、對個人或群體造成傷害、傳播用於傷害目的的個人信息、傳播錯誤信息或針對弱勢群體的內容。有關完整的限制列表和您的權利詳情，請參閱許可證全文。

致謝

我們要感謝SD3、Qwen、umt5 - xxl、diffusers和HuggingFace倉庫的貢獻者，感謝他們的開放研究。

聯繫我們

如果您想給我們的研究或產品團隊留言，請隨時加入我們的Discord或微信群！

引用

如果您覺得我們的工作有幫助，請引用我們：

@article{wan2025,
      title={Wan: Open and Advanced Large-Scale Video Generative Models}, 
      author={Ang Wang and Baole Ai and Bin Wen and Chaojie Mao and Chen-Wei Xie and Di Chen and Feiwu Yu and Haiming Zhao and Jianxiao Yang and Jianyuan Zeng and Jiayu Wang and Jingfeng Zhang and Jingren Zhou and Jinkai Wang and Jixuan Chen and Kai Zhu and Kang Zhao and Keyu Yan and Lianghua Huang and Mengyang Feng and Ningyi Zhang and Pandeng Li and Pingyu Wu and Ruihang Chu and Ruili Feng and Shiwei Zhang and Siyang Sun and Tao Fang and Tianxing Wang and Tianyi Gui and Tingyu Weng and Tong Shen and Wei Lin and Wei Wang and Wei Wang and Wenmeng Zhou and Wente Wang and Wenting Shen and Wenyuan Yu and Xianzhong Shi and Xiaoming Huang and Xin Xu and Yan Kou and Yangyu Lv and Yifei Li and Yijing Liu and Yiming Wang and Yingya Zhang and Yitong Huang and Yong Li and You Wu and Yu Liu and Yulin Pan and Yun Zheng and Yuntao Hong and Yupeng Shi and Yutong Feng and Zeyinzi Jiang and Zhen Han and Zhi-Fan Wu and Ziyu Liu},
      journal = {arXiv preprint arXiv:2503.20314},
      year={2025}
}