LTX-Video開源視頻生成模型 - 即時生成高質量視頻，支持雙場景轉換

首頁

LTX Video

由Lightricks開發

首個基於DiT的視頻生成模型，能夠即時生成高質量視頻，支持文本轉視頻和圖像+文本轉視頻兩種場景。

文本生成視頻英語開源協議:其他 #高分辨率視頻生成 #即時渲染 #DiT架構

下載量 165.42k

發布時間 : 10/31/2024

模型概述

LTX-視頻是首個基於DiT的視頻生成模型，能夠以30幀每秒的速度生成1216×704分辨率的高質量視頻。該模型在多樣化視頻的大規模數據集上訓練，可生成具有真實感和多樣化內容的高分辨率視頻。

模型特點

即時視頻生成

能夠以30幀每秒的速度生成高分辨率視頻，速度比觀看還快。

高質量輸出

生成1216×704分辨率的高質量視頻，具有真實感和多樣化內容。

多場景支持

支持文本轉視頻以及圖像+文本轉視頻兩種使用場景。

多樣化訓練數據

在多樣化視頻的大規模數據集上訓練，能夠生成多樣化的視頻內容。

模型能力

文本轉視頻

圖像+文本轉視頻

高分辨率視頻生成

即時視頻生成

使用案例

影視製作

電影片段生成

根據劇本描述生成電影或電視劇風格的視頻片段。

生成具有電影感的視頻片段，如示例中的獄警場景和悲傷表情的女性場景。

廣告創意

廣告視頻生成

根據產品描述生成廣告視頻。

生成高質量的產品展示視頻，如示例中的城市景觀和河流場景。

教育

教學視頻生成

根據教學內容生成教育視頻。

生成清晰、生動的教學視頻，如示例中的自然景觀和城市景觀。

🚀 LTX-Video模型卡片

LTX-Video是首個基於DiT的視頻生成模型，能夠即時生成高質量視頻。它可以以1216×704的分辨率、30 FPS的幀率快速生成視頻，速度之快甚至超過觀看速度。該模型在大規模、多樣化的視頻數據集上進行訓練，能夠生成具有逼真且豐富內容的高分辨率視頻。我們提供了適用於文本到視頻以及圖像+文本到視頻場景的模型。代碼庫可在此處獲取。

示例動圖

模型生成示例展示


一位留著棕色長髮、皮膚白皙的女子對著另一位留著金色長髮的女子微笑…… 一位留著棕色長髮、皮膚白皙的女子對著另一位留著金色長髮的女子微笑。棕色頭髮的女子穿著黑色夾克，右臉頰上有一顆小到幾乎難以察覺的痣。拍攝角度為特寫，聚焦在棕色頭髮女子的臉上。光線溫暖而自然，可能來自夕陽，給場景披上了一層柔和的光芒。該場景看起來像是真實的生活片段。	一名女子在夜晚從停在城市街道上的白色吉普車上下來…… 一名女子在夜晚從停在城市街道上的白色吉普車上下來，然後走上樓梯並敲門。這名女子穿著深色夾克和牛仔褲，背對著鏡頭從停在街道左側的吉普車上下來；她步伐穩定，手臂在身體兩側微微擺動；街道燈光昏暗，路燈在潮溼的路面上投下一片片光影；一名穿著深色夾克和牛仔褲的男子從相反方向走過吉普車；攝像機從後面跟隨女子走上一組通往綠色門建築的樓梯；她到達樓梯頂部後向左轉，繼續朝建築走去；她走到門口，用右手敲門；攝像機保持靜止，聚焦在門口；該場景是真實生活片段。	一位梳著金色髮髻、穿著黑色亮片連衣裙和珍珠耳環的女子…… 一位梳著金色髮髻、穿著黑色亮片連衣裙和珍珠耳環的女子低頭，臉上露出悲傷的表情。攝像機保持靜止，聚焦在女子的臉上。燈光昏暗，在她臉上投下柔和的陰影。該場景似乎來自電影或電視劇。	攝像機掃過一片被雪覆蓋的山脈…… 攝像機掃過一片被雪覆蓋的山脈，展現出廣闊的雪山和山谷。山脈被厚厚的積雪覆蓋，有些地方几乎呈白色，而有些地方則略帶灰色調。山峰參差不齊，有的高聳入雲，有的則較為圓潤。山谷又深又窄，陡峭的山坡也被雪覆蓋。前景中的樹木大多光禿禿的，只有少數樹枝上還留著幾片葉子。天空陰沉沉的，厚厚的雲層遮住了太陽。整體給人一種寧靜祥和的感覺，被雪覆蓋的山脈見證了大自然的力量和美麗。
一位皮膚白皙、穿著藍色夾克和帶面紗黑帽子的女子…… 一位皮膚白皙、穿著藍色夾克和帶面紗黑帽子的女子低頭看向右側，然後在說話時抬起頭。她梳著棕色髮髻，眉毛淺棕色，夾克裡面穿著白色領口襯衫；說話時攝像機一直對著她的臉；背景有些模糊，但可以看到樹木和穿著古裝的人；該場景是真實生活片段。	一個男人在光線昏暗的房間裡對著老式電話說話…… 一個男人在光線昏暗的房間裡對著老式電話說話，然後掛斷電話，低頭露出悲傷的表情。他用右手將黑色旋轉電話貼在右耳，左手拿著一個裝有琥珀色液體的岩石杯。他穿著棕色西裝外套，裡面是白色襯衫，左手無名指上戴著一枚金戒指。他的短髮梳理得很整齊，皮膚白皙，眼睛周圍有明顯的皺紋。攝像機保持靜止，聚焦在他的臉和上半身。房間很暗，只有左邊屏幕外的溫暖光源照亮，在他身後的牆上投下陰影。該場景似乎來自電影。	一名獄警打開牢房的門…… 一名獄警打開牢房的門，發現一個年輕人和一個女人坐在桌旁。獄警穿著深藍色制服，左胸有徽章，用右手拿著鑰匙打開牢房門並拉開；他留著棕色短髮，皮膚白皙，表情平淡。年輕人穿著黑白條紋襯衫，坐在鋪著白色桌布的桌子前，面向女人；他留著棕色短髮，皮膚白皙，表情平淡。女人穿著深藍色襯衫，坐在年輕人對面，臉轉向他；她留著金色短髮，皮膚白皙。攝像機保持靜止，從稍右的中距離拍攝場景。房間光線昏暗，只有一盞燈具照亮桌子和兩個人物。牆壁由大的灰色混凝土塊組成，背景中可以看到一扇金屬門。該場景是真實生活片段。	一個臉上有血、穿著白色背心的女人…… 一個臉上有血、穿著白色背心的女人低頭看向右側，然後在說話時抬起頭。她的黑髮向後梳，皮膚白皙，臉和胸部都沾滿了血。拍攝角度為特寫，聚焦在女人的臉和上半身。燈光昏暗，呈藍色調，營造出一種憂鬱而緊張的氛圍。該場景似乎來自電影或電視劇。
一個頭發花白、留著鬍鬚、穿著灰色襯衫的男人…… 一個頭發花白、留著鬍鬚、穿著灰色襯衫的男人低頭看向右側，然後向左轉頭。拍攝角度為特寫，聚焦在男人的臉上。燈光昏暗，帶有綠色色調。該場景似乎是真實生活片段。	一條清澈的藍綠色河流穿過岩石峽谷…… 一條清澈的藍綠色河流穿過岩石峽谷，從一個小瀑布上傾瀉而下，在底部形成一個水潭。河流是場景的主要焦點，清澈的河水倒映著周圍的樹木和岩石。峽谷壁陡峭多石，上面生長著一些植被。樹木大多是松樹，綠色的針葉與棕色和灰色的岩石形成鮮明對比。整個場景給人一種寧靜祥和的感覺。	一個穿著西裝的男人走進房間，和兩個坐在沙發上的女人說話…… 一個穿著西裝的男人走進房間，和兩個坐在沙發上的女人說話。男人穿著深色西裝，繫著金色領帶，從左邊走進房間，朝畫面中心走去。他留著灰色短髮，皮膚白皙，表情嚴肅。他走近沙發時，右手放在椅子背上。背景中，兩個女人坐在淺色沙發上。左邊的女人穿著淺藍色毛衣，留著金色短髮。右邊的女人穿著白色毛衣，留著金色短髮。攝像機保持靜止，男人走進房間時聚焦在他身上。房間光線明亮，溫暖的色調反射在牆壁和傢俱上。該場景似乎來自電影或電視劇。	海浪拍打著岸邊黑暗、參差不齊的岩石…… 海浪拍打著岸邊黑暗、參差不齊的岩石，白色的泡沫濺向空中。岩石呈深灰色，邊緣鋒利，有很深的裂縫。海水是清澈的藍綠色，海浪拍打岩石的地方泛起白色泡沫。天空呈淺灰色，地平線上點綴著幾朵白雲。
攝像機掃過一座有圓形建築的城市景觀…… 攝像機從左到右掃過一座有圓形建築的城市景觀，展示了建築物的頂部和位於中心的圓形建築。建築物有各種灰色和白色調，圓形建築有綠色屋頂。拍攝角度較高，俯瞰城市。光線明亮，太陽從左上方照射，建築物投下陰影。該場景是計算機生成的圖像。	一個男人走向窗戶，向外看，然後轉身…… 一個男人走向窗戶，向外看，然後轉身。他留著黑色短髮，皮膚黝黑，穿著棕色外套，裡面搭配紅灰色圍巾。他從左向右走向窗戶，目光盯著外面的某個東西。攝像機從後面以中等距離跟隨他。房間光線明亮，白色牆壁，大窗戶上掛著白色窗簾。他走近窗戶時，頭微微向左轉，然後又向右轉。然後他整個身體向右轉，面向窗戶。他站在窗戶前時，攝像機保持靜止。該場景是真實生活片段。	兩名穿著深藍色制服和配套帽子的警察…… 兩名穿著深藍色制服和配套帽子的警察從畫面左側的門進入光線昏暗的房間。第一名警察留著棕色短髮，有小鬍子，先走進來，後面跟著他的搭檔，搭檔剃著光頭，留著山羊鬍。兩名警察表情嚴肅，步伐穩定地向房間深處走去。攝像機保持靜止，他們進來時從稍低的角度拍攝。房間有裸露的磚牆和波紋金屬天花板，背景中可以看到一扇帶柵欄的窗戶。燈光較暗，在警察臉上投下陰影，強調了嚴峻的氛圍。該場景似乎來自電影或電視劇。	一個留著棕色短髮、穿著栗色無袖上衣的女人…… 一個留著棕色短髮、穿著栗色無袖上衣和銀色項鍊的女人邊說話邊穿過房間，然後一個留著粉色頭髮、穿著白色襯衫的女人出現在門口大喊。第一個女人從左向右走，表情嚴肅；她皮膚白皙，眉毛微微皺起。第二個女人站在門口，張著嘴大喊；她皮膚白皙，眼睛睜得很大。房間光線昏暗，背景中可以看到一個書架。攝像機跟隨第一個女人走動，然後切換到第二個女人臉的特寫。該場景是真實生活片段。

🚀 快速開始

模型與工作流

名稱	說明	inference.py配置	ComfyUI工作流（推薦）
ltxv-13b-0.9.7-dev	質量最高，但需要更多的VRAM	ltxv-13b-0.9.7-dev.yaml	ltxv-13b-i2v-base.json
ltxv-13b-0.9.7-mix	在同一多尺度渲染工作流中混合ltxv-13b-dev和ltxv-13b-distilled，以平衡速度和質量	N/A	ltxv-13b-i2v-mixed-multiscale.json
ltxv-13b-0.9.7-distilled	速度更快，VRAM使用更少，與13b相比質量略有下降。適合快速迭代	ltxv-13b-0.9.7-distilled.yaml	ltxv-13b-dist-i2v-base.json
ltxv-13b-0.9.7-distilled-lora128	LoRA，使ltxv-13b-dev表現得像蒸餾模型	N/A	N/A
ltxv-13b-0.9.7-fp8	ltxv-13b的量化版本	即將推出	ltxv-13b-i2v-base-fp8.json
ltxv-13b-0.9.7-distilled-fp8	ltxv-13b-distilled的量化版本	即將推出	ltxv-13b-dist-i2v-base-fp8.json
ltxv-2b-0.9.6	質量不錯，比ltxv-13b需要更少的VRAM	ltxv-2b-0.9.6-dev.yaml	ltxvideo-i2v.json
ltxv-2b-0.9.6-distilled	速度快15倍，能夠即時運行，所需步驟更少，無需STG/CFG	ltxv-2b-0.9.6-distilled.yaml	ltxvideo-i2v-distilled.json

模型詳情

屬性	詳情
開發者	Lightricks
模型類型	基於擴散的文本到視頻和圖像到視頻生成模型
支持語言	英語

使用方法

直接使用

你可以在許可範圍內使用該模型：

2B版本0.9：許可協議
2B版本0.9.1：許可協議
2B版本0.9.5：許可協議
2B版本0.9.6-dev：許可協議
2B版本0.9.6-distilled：許可協議
13B版本0.9.7-dev：許可協議
13B版本0.9.7-dev-fp8：許可協議
13B版本0.9.7-distilled：許可協議
13B版本0.9.7-distilled-fp8：許可協議
13B版本0.9.7-distilled-lora128：許可協議
時間上採樣器版本0.9.7：許可協議
空間上採樣器版本0.9.7：許可協議

一般提示

⚠️ 重要提示

該模型適用於分辨率能被32整除、幀數能被8整除加1（例如257）的情況。如果分辨率或幀數不能被32或8 + 1整除，輸入將用 -1 填充，然後裁剪到所需的分辨率和幀數。

該模型在分辨率低於720 x 1280且幀數少於257時效果最佳。

提示詞應為英文，越詳細越好。例如：The turquoise waves crash against the dark, jagged rocks of the shore, sending white foam spraying into the air. The scene is dominated by the stark contrast between the bright blue water and the dark, almost black rocks. The water is a clear, turquoise color, and the waves are capped with white foam. The rocks are dark and jagged, and they are covered in patches of green moss. The shore is lined with lush green vegetation, including trees and bushes. In the background, there are rolling hills covered in dense forest. The sky is cloudy, and the light is dim.

在線演示

可以通過以下鏈接立即訪問該模型：

ComfyUI使用

要在ComfyUI中使用我們的模型，請遵循ComfyUI倉庫中的說明。

本地運行

安裝

代碼庫在Python 3.10.5、CUDA版本12.2環境下進行了測試，支持PyTorch >= 2.1.2。

git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video

# 創建虛擬環境
python -m venv env
source env/bin/activate
python -m pip install -e .\[inference-script\]

推理

要使用我們的模型，請參考inference.py中的推理代碼：

文本到視頻生成：

python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml

圖像到視頻生成：

python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml

Diffusers 🧨

LTX Video與Diffusers Python庫兼容，支持文本到視頻和圖像到視頻生成。在嘗試以下示例之前，請確保安裝了diffusers：

pip install -U git+https://github.com/huggingface/diffusers

💻 使用示例

基礎用法

文本到視頻：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 704, 512
downscale_factor = 2 / 3
num_frames = 121

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
latents = pipe(
    conditions=None,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)

圖像到視頻：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_image

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
video = [image]
condition1 = LTXVideoCondition(video=video, frame_index=0)

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 832, 480
downscale_factor = 2 / 3
num_frames = 96

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
latents = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)

視頻到視頻：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_video

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

def round_to_nearest_resolution_acceptable_by_vae(height, width):
    height = height - (height % pipe.vae_temporal_compression_ratio)
    width = width - (width % pipe.vae_temporal_compression_ratio)
    return height, width

video = load_video(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
)[:21]  # Use only the first 21 frames as conditioning
condition1 = LTXVideoCondition(video=video, frame_index=0)

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 768, 1152
downscale_factor = 2 / 3
num_frames = 161

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
latents = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)