Wan2.1-I2V-14B-720P-Diffusers開源視頻模型 - 消費級GPU可用，支持視覺文本生成

首頁

Wan2.1 I2V 14B 720P Diffusers

由grnr9730開發

萬2.1是一套全面開放的視頻基礎模型，具有頂尖性能，支持消費級GPU，多任務支持，視覺文本生成和高效視頻VAE。

視頻處理支持多種語言開源協議:Apache-2.0 #高清視頻生成 #多語言文本支持 #低顯存需求

下載量 96

發布時間 : 4/2/2025

模型概述

萬2.1是一個開放且先進的大規模視頻生成模型，支持圖像轉視頻等多種任務，在多個基準測試中表現優異。

模型特點

頂尖性能

在多個基準測試中持續超越現有開源模型和商業解決方案。

支持消費級GPU

T2V-1.3B模型僅需8.19GB顯存，兼容幾乎所有消費級GPU。

多任務支持

在文本轉視頻、圖像轉視頻、視頻編輯、文本轉圖像及視頻轉音頻等任務中表現卓越。

視覺文本生成

首個支持中英文文本生成的視頻模型，具備強大的文本生成能力。

高效視頻VAE

萬-VAE在編碼和解碼任意長度的1080P視頻時保持時間信息完整。

模型能力

圖像轉視頻

文本轉視頻

視頻編輯

文本轉圖像

視頻轉音頻

使用案例

創意內容生成

廣告視頻生成

根據靜態圖像和文本描述生成動態廣告視頻。

生成高質量、具有吸引力的廣告內容。

社交媒體內容

將用戶上傳的圖片轉換為短視頻內容。

提升用戶參與度和內容多樣性。

教育培訓

教學視頻生成

將教材中的靜態圖表轉換為動態演示視頻。

增強教學材料的互動性和理解度。

🚀 Wan2.1

Wan2.1 是一套全面且開放的視頻基礎模型套件，突破了視頻生成的界限。它具備卓越的性能、支持消費級 GPU、可處理多種任務、能進行視覺文本生成，還擁有強大的視頻 VAE，為視頻生成領域帶來了新的突破。

Wan：開放且先進的大規模視頻生成模型

在這個倉庫中，我們推出了 Wan2.1，這是一套全面且開放的視頻基礎模型套件，突破了視頻生成的界限。Wan2.1 具備以下關鍵特性：

👍 SOTA 性能：在多個基準測試中，Wan2.1 始終優於現有的開源模型和最先進的商業解決方案。
👍 支持消費級 GPU：T2V - 1.3B 模型僅需 8.19 GB 的顯存，幾乎與所有消費級 GPU 兼容。在 RTX 4090 上，它大約可以在 4 分鐘內生成一個 5 秒的 480P 視頻（不使用量化等優化技術）。其性能甚至可與一些閉源模型相媲美。
👍 多任務支持：Wan2.1 在文本到視頻、圖像到視頻、視頻編輯、文本到圖像和視頻到音頻等任務中表現出色，推動了視頻生成領域的發展。
👍 視覺文本生成：Wan2.1 是首個能夠同時生成中文和英文文本的視頻模型，強大的文本生成能力增強了其實際應用價值。
👍 強大的視頻 VAE：Wan - VAE 具有出色的效率和性能，能夠對任意長度的 1080P 視頻進行編碼和解碼，同時保留時間信息，是視頻和圖像生成的理想基礎。

本倉庫包含我們的 I2V - 14B 模型，該模型能夠生成 720P 的高清視頻。經過數千輪的人工評估，該模型的性能優於閉源和開源的替代方案，達到了最先進的水平。

🎥 視頻演示

🔥 最新消息！

2025 年 2 月 25 日：👋 我們發佈了 Wan2.1 的推理代碼和權重。

📑 待辦事項列表

Wan2.1 文本到視頻
- [x] 14B 和 1.3B 模型的多 GPU 推理代碼
- [x] 14B 和 1.3B 模型的檢查點
- [x] Gradio 演示
- [x] Diffusers 集成
- [ ] ComfyUI 集成
Wan2.1 圖像到視頻
- [x] 14B 模型的多 GPU 推理代碼
- [x] 14B 模型的檢查點
- [x] Gradio 演示
- [x] Diffusers 集成
- [ ] ComfyUI 集成

🚀 快速開始

📦 安裝

克隆倉庫：

git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

安裝依賴：

# 確保 torch >= 2.4.0
pip install -r requirements.txt

📥 模型下載

模型	下載鏈接	注意事項
T2V - 14B	🤗 Huggingface 🤖 ModelScope	支持 480P 和 720P
I2V - 14B - 720P	🤗 Huggingface 🤖 ModelScope	支持 720P
I2V - 14B - 480P	🤗 Huggingface 🤖 ModelScope	支持 480P
T2V - 1.3B	🤗 Huggingface 🤖 ModelScope	支持 480P

💡 注意：1.3B 模型能夠生成 720P 分辨率的視頻。然而，由於在該分辨率下的訓練有限，與 480P 相比，結果通常不太穩定。為了獲得最佳性能，我們建議使用 480P 分辨率。

使用 🤗 huggingface - cli 下載模型：

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local-dir ./Wan2.1-I2V-14B-720P-Diffusers

使用 🤖 modelscope - cli 下載模型：

pip install modelscope
modelscope download Wan-AI/Wan2.1-I2V-14B-720P-Diffusers --local_dir ./Wan2.1-I2V-14B-720P-Diffusers

💻 運行圖像到視頻生成

與文本到視頻類似，圖像到視頻也分為有無提示擴展步驟的過程。具體參數及其對應設置如下：

任務	480P 分辨率	720P 分辨率	模型
i2v - 14B	❌	✔️	Wan2.1 - I2V - 14B - 720P
i2v - 14B	✔️	❌	Wan2.1 - T2V - 14B - 480P

(1) 無提示擴展

單 GPU 推理

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

💡 對於圖像到視頻任務，size 參數表示生成視頻的面積，寬高比遵循原始輸入圖像的寬高比。

使用 FSDP + xDiT USP 的多 GPU 推理

pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

Wan 也可以直接使用 🤗 Diffusers 運行！

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel

# 可用模型：Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "Wan-AI/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = (
    "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
    "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
)
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
    image=image, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, num_frames=81, guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=16)

(2) 使用提示擴展

使用 Qwen/Qwen2.5 - VL - 7B - Instruct 進行本地提示擴展運行：

python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

使用 dashscope 進行遠程提示擴展運行：

DASH_API_KEY=your_key python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

(3) 運行本地 Gradio

cd gradio
# 如果只在 Gradio 中使用 480P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P

# 如果只在 Gradio 中使用 720P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

# 如果在 Gradio 中同時使用 480P 和 720P 模型
DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P --ckpt_dir_720p ./Wan2.1-I2V-14B-720P

👨‍⚖️ 人工評估

我們進行了廣泛的人工評估，以評估圖像到視頻模型的性能，結果如下表所示。結果清楚地表明，Wan2.1 優於閉源和開源模型。

💪 不同 GPU 上的計算效率

我們在不同的 GPU 上測試了不同 Wan2.1 模型的計算效率，結果如下表所示。結果以 總時間 (s) / 峰值 GPU 顯存 (GB) 的格式呈現。

此表中測試的參數設置如下： (1) 對於 8 個 GPU 上的 1.3B 模型，設置 --ring_size 8 和 --ulysses_size 1； (2) 對於 1 個 GPU 上的 14B 模型，使用 --offload_model True； (3) 對於單個 4090 GPU 上的 1.3B 模型，設置 --offload_model True --t5_cpu； (4) 對於所有測試，均未應用提示擴展，即未啟用 --use_prompt_extend。

📚 Wan2.1 介紹

Wan2.1 基於主流的擴散變壓器範式設計，通過一系列創新在生成能力方面取得了顯著進展。這些創新包括我們新穎的時空變分自編碼器 (VAE)、可擴展的訓練策略、大規模數據構建和自動評估指標。這些貢獻共同提升了模型的性能和通用性。

(1) 3D 變分自編碼器

我們提出了一種新穎的 3D 因果 VAE 架構，稱為 Wan - VAE，專門為視頻生成而設計。通過結合多種策略，我們提高了時空壓縮率，減少了內存使用，並確保了時間因果性。與其他開源 VAE 相比，Wan - VAE 在性能效率方面顯示出顯著優勢。此外，我們的 Wan - VAE 可以對無限長度的 1080P 視頻進行編碼和解碼，而不會丟失歷史時間信息，特別適合視頻生成任務。

(2) 視頻擴散 DiT

Wan2.1 在主流擴散變壓器範式內使用流匹配框架進行設計。我們的模型架構使用 T5 編碼器對多語言文本輸入進行編碼，每個變壓器塊中的交叉注意力將文本嵌入到模型結構中。此外，我們使用一個帶有線性層和 SiLU 層的 MLP 來處理輸入的時間嵌入，並分別預測六個調製參數。這個 MLP 在所有變壓器塊中共享，每個塊學習一組不同的偏差。我們的實驗結果表明，在相同的參數規模下，這種方法顯著提高了性能。

模型	維度	輸入維度	輸出維度	前饋維度	頻率維度	頭數	層數
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

數據

我們策劃並去重了一個包含大量圖像和視頻數據的候選數據集。在數據策劃過程中，我們設計了一個四步的數據清理過程，重點關注基本維度、視覺質量和運動質量。通過強大的數據處理管道，我們可以輕鬆獲得高質量、多樣化和大規模的圖像和視頻訓練集。

與 SOTA 的比較

我們將 Wan2.1 與領先的開源和閉源模型進行了比較，以評估其性能。我們使用精心設計的 1035 個內部提示，在 14 個主要維度和 26 個子維度上進行了測試。然後，我們通過對每個維度的分數進行加權計算來計算總分，權重來自匹配過程中的人類偏好。詳細結果如下表所示。這些結果表明，我們的模型與開源和閉源模型相比具有優越的性能。

📝 引用

如果您覺得我們的工作有幫助，請引用我們：

@article{wan2.1,
    title   = {Wan: Open and Advanced Large-Scale Video Generative Models},
    author  = {Wan Team},
    journal = {},
    year    = {2025}
}

📄 許可證

本倉庫中的模型遵循 Apache 2.0 許可證。我們對您生成的內容不主張任何權利，允許您自由使用這些內容，但需確保您的使用符合本許可證的規定。您對模型的使用負全部責任，不得使用模型分享任何違反適用法律、對個人或群體造成傷害、傳播用於傷害的個人信息、傳播錯誤信息或針對弱勢群體的內容。有關完整的限制列表和您的權利詳情，請參閱許可證的全文。