LTX-Video开源视频生成模型 - 实时生成高质量视频，支持双场景转换

首页

LTX Video

由 Lightricks 开发

首个基于DiT的视频生成模型，能够实时生成高质量视频，支持文本转视频和图像+文本转视频两种场景。

文本生成视频英语开源协议:其他 #高分辨率视频生成 #实时渲染 #DiT架构

下载量 165.42k

发布时间 : 10/31/2024

模型简介

LTX-视频是首个基于DiT的视频生成模型，能够以30帧每秒的速度生成1216×704分辨率的高质量视频。该模型在多样化视频的大规模数据集上训练，可生成具有真实感和多样化内容的高分辨率视频。

模型特点

实时视频生成

能够以30帧每秒的速度生成高分辨率视频，速度比观看还快。

高质量输出

生成1216×704分辨率的高质量视频，具有真实感和多样化内容。

多场景支持

支持文本转视频以及图像+文本转视频两种使用场景。

多样化训练数据

在多样化视频的大规模数据集上训练，能够生成多样化的视频内容。

模型能力

文本转视频

图像+文本转视频

高分辨率视频生成

实时视频生成

使用案例

影视制作

电影片段生成

根据剧本描述生成电影或电视剧风格的视频片段。

生成具有电影感的视频片段，如示例中的狱警场景和悲伤表情的女性场景。

广告创意

广告视频生成

根据产品描述生成广告视频。

生成高质量的产品展示视频，如示例中的城市景观和河流场景。

教育

教学视频生成

根据教学内容生成教育视频。

生成清晰、生动的教学视频，如示例中的自然景观和城市景观。

🚀 LTX-Video模型卡片

LTX-Video是首个基于DiT的视频生成模型，能够实时生成高质量视频。它可以以1216×704的分辨率、30 FPS的帧率快速生成视频，速度之快甚至超过观看速度。该模型在大规模、多样化的视频数据集上进行训练，能够生成具有逼真且丰富内容的高分辨率视频。我们提供了适用于文本到视频以及图像+文本到视频场景的模型。代码库可在此处获取。

示例动图

模型生成示例展示


一位留着棕色长发、皮肤白皙的女子对着另一位留着金色长发的女子微笑…… 一位留着棕色长发、皮肤白皙的女子对着另一位留着金色长发的女子微笑。棕色头发的女子穿着黑色夹克，右脸颊上有一颗小到几乎难以察觉的痣。拍摄角度为特写，聚焦在棕色头发女子的脸上。光线温暖而自然，可能来自夕阳，给场景披上了一层柔和的光芒。该场景看起来像是真实的生活片段。	一名女子在夜晚从停在城市街道上的白色吉普车上下来…… 一名女子在夜晚从停在城市街道上的白色吉普车上下来，然后走上楼梯并敲门。这名女子穿着深色夹克和牛仔裤，背对着镜头从停在街道左侧的吉普车上下来；她步伐稳定，手臂在身体两侧微微摆动；街道灯光昏暗，路灯在潮湿的路面上投下一片片光影；一名穿着深色夹克和牛仔裤的男子从相反方向走过吉普车；摄像机从后面跟随女子走上一组通往绿色门建筑的楼梯；她到达楼梯顶部后向左转，继续朝建筑走去；她走到门口，用右手敲门；摄像机保持静止，聚焦在门口；该场景是真实生活片段。	一位梳着金色发髻、穿着黑色亮片连衣裙和珍珠耳环的女子…… 一位梳着金色发髻、穿着黑色亮片连衣裙和珍珠耳环的女子低头，脸上露出悲伤的表情。摄像机保持静止，聚焦在女子的脸上。灯光昏暗，在她脸上投下柔和的阴影。该场景似乎来自电影或电视剧。	摄像机扫过一片被雪覆盖的山脉…… 摄像机扫过一片被雪覆盖的山脉，展现出广阔的雪山和山谷。山脉被厚厚的积雪覆盖，有些地方几乎呈白色，而有些地方则略带灰色调。山峰参差不齐，有的高耸入云，有的则较为圆润。山谷又深又窄，陡峭的山坡也被雪覆盖。前景中的树木大多光秃秃的，只有少数树枝上还留着几片叶子。天空阴沉沉的，厚厚的云层遮住了太阳。整体给人一种宁静祥和的感觉，被雪覆盖的山脉见证了大自然的力量和美丽。
一位皮肤白皙、穿着蓝色夹克和带面纱黑帽子的女子…… 一位皮肤白皙、穿着蓝色夹克和带面纱黑帽子的女子低头看向右侧，然后在说话时抬起头。她梳着棕色发髻，眉毛浅棕色，夹克里面穿着白色领口衬衫；说话时摄像机一直对着她的脸；背景有些模糊，但可以看到树木和穿着古装的人；该场景是真实生活片段。	一个男人在光线昏暗的房间里对着老式电话说话…… 一个男人在光线昏暗的房间里对着老式电话说话，然后挂断电话，低头露出悲伤的表情。他用右手将黑色旋转电话贴在右耳，左手拿着一个装有琥珀色液体的岩石杯。他穿着棕色西装外套，里面是白色衬衫，左手无名指上戴着一枚金戒指。他的短发梳理得很整齐，皮肤白皙，眼睛周围有明显的皱纹。摄像机保持静止，聚焦在他的脸和上半身。房间很暗，只有左边屏幕外的温暖光源照亮，在他身后的墙上投下阴影。该场景似乎来自电影。	一名狱警打开牢房的门…… 一名狱警打开牢房的门，发现一个年轻人和一个女人坐在桌旁。狱警穿着深蓝色制服，左胸有徽章，用右手拿着钥匙打开牢房门并拉开；他留着棕色短发，皮肤白皙，表情平淡。年轻人穿着黑白条纹衬衫，坐在铺着白色桌布的桌子前，面向女人；他留着棕色短发，皮肤白皙，表情平淡。女人穿着深蓝色衬衫，坐在年轻人对面，脸转向他；她留着金色短发，皮肤白皙。摄像机保持静止，从稍右的中距离拍摄场景。房间光线昏暗，只有一盏灯具照亮桌子和两个人物。墙壁由大的灰色混凝土块组成，背景中可以看到一扇金属门。该场景是真实生活片段。	一个脸上有血、穿着白色背心的女人…… 一个脸上有血、穿着白色背心的女人低头看向右侧，然后在说话时抬起头。她的黑发向后梳，皮肤白皙，脸和胸部都沾满了血。拍摄角度为特写，聚焦在女人的脸和上半身。灯光昏暗，呈蓝色调，营造出一种忧郁而紧张的氛围。该场景似乎来自电影或电视剧。
一个头发花白、留着胡须、穿着灰色衬衫的男人…… 一个头发花白、留着胡须、穿着灰色衬衫的男人低头看向右侧，然后向左转头。拍摄角度为特写，聚焦在男人的脸上。灯光昏暗，带有绿色色调。该场景似乎是真实生活片段。	一条清澈的蓝绿色河流穿过岩石峡谷…… 一条清澈的蓝绿色河流穿过岩石峡谷，从一个小瀑布上倾泻而下，在底部形成一个水潭。河流是场景的主要焦点，清澈的河水倒映着周围的树木和岩石。峡谷壁陡峭多石，上面生长着一些植被。树木大多是松树，绿色的针叶与棕色和灰色的岩石形成鲜明对比。整个场景给人一种宁静祥和的感觉。	一个穿着西装的男人走进房间，和两个坐在沙发上的女人说话…… 一个穿着西装的男人走进房间，和两个坐在沙发上的女人说话。男人穿着深色西装，系着金色领带，从左边走进房间，朝画面中心走去。他留着灰色短发，皮肤白皙，表情严肃。他走近沙发时，右手放在椅子背上。背景中，两个女人坐在浅色沙发上。左边的女人穿着浅蓝色毛衣，留着金色短发。右边的女人穿着白色毛衣，留着金色短发。摄像机保持静止，男人走进房间时聚焦在他身上。房间光线明亮，温暖的色调反射在墙壁和家具上。该场景似乎来自电影或电视剧。	海浪拍打着岸边黑暗、参差不齐的岩石…… 海浪拍打着岸边黑暗、参差不齐的岩石，白色的泡沫溅向空中。岩石呈深灰色，边缘锋利，有很深的裂缝。海水是清澈的蓝绿色，海浪拍打岩石的地方泛起白色泡沫。天空呈浅灰色，地平线上点缀着几朵白云。
摄像机扫过一座有圆形建筑的城市景观…… 摄像机从左到右扫过一座有圆形建筑的城市景观，展示了建筑物的顶部和位于中心的圆形建筑。建筑物有各种灰色和白色调，圆形建筑有绿色屋顶。拍摄角度较高，俯瞰城市。光线明亮，太阳从左上方照射，建筑物投下阴影。该场景是计算机生成的图像。	一个男人走向窗户，向外看，然后转身…… 一个男人走向窗户，向外看，然后转身。他留着黑色短发，皮肤黝黑，穿着棕色外套，里面搭配红灰色围巾。他从左向右走向窗户，目光盯着外面的某个东西。摄像机从后面以中等距离跟随他。房间光线明亮，白色墙壁，大窗户上挂着白色窗帘。他走近窗户时，头微微向左转，然后又向右转。然后他整个身体向右转，面向窗户。他站在窗户前时，摄像机保持静止。该场景是真实生活片段。	两名穿着深蓝色制服和配套帽子的警察…… 两名穿着深蓝色制服和配套帽子的警察从画面左侧的门进入光线昏暗的房间。第一名警察留着棕色短发，有小胡子，先走进来，后面跟着他的搭档，搭档剃着光头，留着山羊胡。两名警察表情严肃，步伐稳定地向房间深处走去。摄像机保持静止，他们进来时从稍低的角度拍摄。房间有裸露的砖墙和波纹金属天花板，背景中可以看到一扇带栅栏的窗户。灯光较暗，在警察脸上投下阴影，强调了严峻的氛围。该场景似乎来自电影或电视剧。	一个留着棕色短发、穿着栗色无袖上衣的女人…… 一个留着棕色短发、穿着栗色无袖上衣和银色项链的女人边说话边穿过房间，然后一个留着粉色头发、穿着白色衬衫的女人出现在门口大喊。第一个女人从左向右走，表情严肃；她皮肤白皙，眉毛微微皱起。第二个女人站在门口，张着嘴大喊；她皮肤白皙，眼睛睁得很大。房间光线昏暗，背景中可以看到一个书架。摄像机跟随第一个女人走动，然后切换到第二个女人脸的特写。该场景是真实生活片段。

🚀 快速开始

模型与工作流

名称	说明	inference.py配置	ComfyUI工作流（推荐）
ltxv-13b-0.9.7-dev	质量最高，但需要更多的VRAM	ltxv-13b-0.9.7-dev.yaml	ltxv-13b-i2v-base.json
ltxv-13b-0.9.7-mix	在同一多尺度渲染工作流中混合ltxv-13b-dev和ltxv-13b-distilled，以平衡速度和质量	N/A	ltxv-13b-i2v-mixed-multiscale.json
ltxv-13b-0.9.7-distilled	速度更快，VRAM使用更少，与13b相比质量略有下降。适合快速迭代	ltxv-13b-0.9.7-distilled.yaml	ltxv-13b-dist-i2v-base.json
ltxv-13b-0.9.7-distilled-lora128	LoRA，使ltxv-13b-dev表现得像蒸馏模型	N/A	N/A
ltxv-13b-0.9.7-fp8	ltxv-13b的量化版本	即将推出	ltxv-13b-i2v-base-fp8.json
ltxv-13b-0.9.7-distilled-fp8	ltxv-13b-distilled的量化版本	即将推出	ltxv-13b-dist-i2v-base-fp8.json
ltxv-2b-0.9.6	质量不错，比ltxv-13b需要更少的VRAM	ltxv-2b-0.9.6-dev.yaml	ltxvideo-i2v.json
ltxv-2b-0.9.6-distilled	速度快15倍，能够实时运行，所需步骤更少，无需STG/CFG	ltxv-2b-0.9.6-distilled.yaml	ltxvideo-i2v-distilled.json

模型详情

属性	详情
开发者	Lightricks
模型类型	基于扩散的文本到视频和图像到视频生成模型
支持语言	英语

使用方法

直接使用

你可以在许可范围内使用该模型：

2B版本0.9：许可协议
2B版本0.9.1：许可协议
2B版本0.9.5：许可协议
2B版本0.9.6-dev：许可协议
2B版本0.9.6-distilled：许可协议
13B版本0.9.7-dev：许可协议
13B版本0.9.7-dev-fp8：许可协议
13B版本0.9.7-distilled：许可协议
13B版本0.9.7-distilled-fp8：许可协议
13B版本0.9.7-distilled-lora128：许可协议
时间上采样器版本0.9.7：许可协议
空间上采样器版本0.9.7：许可协议

一般提示

⚠️ 重要提示

该模型适用于分辨率能被32整除、帧数能被8整除加1（例如257）的情况。如果分辨率或帧数不能被32或8 + 1整除，输入将用 -1 填充，然后裁剪到所需的分辨率和帧数。

该模型在分辨率低于720 x 1280且帧数少于257时效果最佳。

提示词应为英文，越详细越好。例如：The turquoise waves crash against the dark, jagged rocks of the shore, sending white foam spraying into the air. The scene is dominated by the stark contrast between the bright blue water and the dark, almost black rocks. The water is a clear, turquoise color, and the waves are capped with white foam. The rocks are dark and jagged, and they are covered in patches of green moss. The shore is lined with lush green vegetation, including trees and bushes. In the background, there are rolling hills covered in dense forest. The sky is cloudy, and the light is dim.

在线演示

可以通过以下链接立即访问该模型：

ComfyUI使用

要在ComfyUI中使用我们的模型，请遵循ComfyUI仓库中的说明。

本地运行

安装

代码库在Python 3.10.5、CUDA版本12.2环境下进行了测试，支持PyTorch >= 2.1.2。

git clone https://github.com/Lightricks/LTX-Video.git
cd LTX-Video

# 创建虚拟环境
python -m venv env
source env/bin/activate
python -m pip install -e .\[inference-script\]

推理

要使用我们的模型，请参考inference.py中的推理代码：

文本到视频生成：

python inference.py --prompt "PROMPT" --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml

图像到视频生成：

python inference.py --prompt "PROMPT" --input_image_path IMAGE_PATH --height HEIGHT --width WIDTH --num_frames NUM_FRAMES --seed SEED --pipeline_config configs/ltxv-13b-0.9.7-distilled.yaml

Diffusers 🧨

LTX Video与Diffusers Python库兼容，支持文本到视频和图像到视频生成。在尝试以下示例之前，请确保安装了diffusers：

pip install -U git+https://github.com/huggingface/diffusers

💻 使用示例

基础用法

文本到视频：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 704, 512
downscale_factor = 2 / 3
num_frames = 121

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
latents = pipe(
    conditions=None,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)

图像到视频：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_image

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png")
video = [image]
condition1 = LTXVideoCondition(video=video, frame_index=0)

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 832, 480
downscale_factor = 2 / 3
num_frames = 96

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
latents = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)

视频到视频：

import torch
from diffusers import LTXConditionPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.pipeline_ltx_condition import LTXVideoCondition
from diffusers.utils import export_to_video, load_video

pipe = LTXConditionPipeline.from_pretrained("Lightricks/LTX-Video-0.9.7-dev", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline.from_pretrained("Lightricks/ltxv-spatial-upscaler-0.9.7", vae=pipe.vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe_upsample.to("cuda")
pipe.vae.enable_tiling()

def round_to_nearest_resolution_acceptable_by_vae(height, width):
    height = height - (height % pipe.vae_temporal_compression_ratio)
    width = width - (width % pipe.vae_temporal_compression_ratio)
    return height, width

video = load_video(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
)[:21]  # Use only the first 21 frames as conditioning
condition1 = LTXVideoCondition(video=video, frame_index=0)

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
expected_height, expected_width = 768, 1152
downscale_factor = 2 / 3
num_frames = 161

# Part 1. Generate video at smaller resolution
downscaled_height, downscaled_width = int(expected_height * downscale_factor), int(expected_width * downscale_factor)
downscaled_height, downscaled_width = round_to_nearest_resolution_acceptable_by_vae(downscaled_height, downscaled_width)
latents = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=downscaled_width,
    height=downscaled_height,
    num_frames=num_frames,
    num_inference_steps=30,
    generator=torch.Generator().manual_seed(0),
    output_type="latent",
).frames

# Part 2. Upscale generated video using latent upsampler with fewer inference steps
# The available latent upsampler upscales the height/width by 2x
upscaled_height, upscaled_width = downscaled_height * 2, downscaled_width * 2
upscaled_latents = pipe_upsample(
    latents=latents,
    output_type="latent"
).frames

# Part 3. Denoise the upscaled video with few steps to improve texture (optional, but recommended)
video = pipe(
    conditions=[condition1],
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    num_frames=num_frames,
    denoise_strength=0.4,  # Effectively, 4 inference steps out of 10
    num_inference_steps=10,
    latents=upscaled_latents,
    decode_timestep=0.05,
    image_cond_noise_scale=0.025,
    generator=torch.Generator().manual_seed(0),
    output_type="pil",
).frames[0]

# Part 4. Downscale the video to the expected resolution
video = [frame.resize((expected_width, expected_height)) for frame in video]

export_to_video(video, "output.mp4", fps=24)