TDM_CogVideoX - 2B_LoRAオープンソース動画生成モデル - 4ステップの推論で25倍高速化して高品質動画を生成

ホーム

TDM CogVideoX 2B LoRA

Luo-Yihongによって開発

TDMは軌跡分布マッチング技術を用いて効率的な少ステップ拡散を実現するモデルで、4ステップの推論で高品質な動画を生成でき、元のモデルに比べて25倍の高速化を実現しつつ性能を損ないません。

テキスト生成ビデオオープンソースライセンス:Apache-2.0 #少ステップ拡散 #テキストから動画へ #効率的な蒸留

ダウンロード数 49

リリース時間 : 3/16/2025

モデル概要

TDMは革新的な軌跡分布マッチング技術により、教師モデル（例：CogVideoX-2B）から知識を蒸留し、極めて少ないステップ数（4ステップ）で高品質なテキストから動画への生成を実現します。

モデル特徴

超高速推論

わずか4ステップの推論で高品質な動画を生成可能で、元のモデルに比べて25倍の高速化を実現

性能非劣化

推論ステップ数を大幅に削減しながら生成品質を維持し、ユーザー調査では教師モデルの結果と区別が困難

効率的な訓練

わずか500回の訓練イテレーションとA800での2時間の訓練時間で蒸留を完了

広範な適応性

SD3、Dreamshaperなど複数のバージョンのLoRAアダプタを提供し、様々なベースモデルをサポート

モデル能力

テキストから動画生成

高品質な画像生成

少ステップ高速推論

モデル蒸留

使用事例

コンテンツ制作

ショート動画生成

ソーシャルメディア向けのショート動画コンテンツを迅速に生成

4ステップの推論で49フレームの動画を生成可能

クリエイティブビジュアライゼーション

テキスト記述を迅速に視覚コンテンツに変換

芸術的スタイルを維持しながら生成速度を大幅に向上

教育・エンターテインメント

インタラクティブストーリーテリング

ストーリーシーンのアニメーションをリアルタイム生成

ほぼリアルタイムのインタラクティブ体験を実現

🚀 TDM: 軌道分布マッチングによる少ステップ拡散モデルの学習

Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tangによる「軌道分布マッチングによる少ステップ拡散モデルの学習」の公式リポジトリです。本研究は、Pixart-αからデータフリーで知識蒸留することで、僅か500回の学習反復と2 A800時間でTDM (4 NFE) を構築し、高速な画像・動画生成を実現します。

Githubリポジトリ: https://github.com/Luo-Yihong/TDM

🚀 クイックスタート

ユーザー調査

user_study どちらが良いと思いますか？一部の画像はPixart-α (50 NFE) で生成されています。他の画像は、Pixart-αからデータフリーで蒸留された TDM (4 NFE) で生成されており、僅か500回の学習反復と2 A800時間で実現されています。

答えを見る

TDMの位置の答え (左から右): 下、下、上、下、上。

高速テキスト→動画生成

当研究で提案するTDMは、テキスト→動画生成にも容易に拡張できます。

Teacher Student

上の動画はCogVideoX - 2B (100 NFE) で生成されました。同じ時間で、TDM (4NFE) は25個の動画を生成でき、性能を損なうことなく驚くべき 25倍の高速化 を達成しています。(注: GIFのノイズは圧縮によるものです。)

💻 使用例

TDM - SD3 - LoRA

import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny, DPMSolverMultistepScheduler
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from diffusers.utils import make_image_grid
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights('Luo-Yihong/TDM_sd3_lora', adapter_name = 'tdm') # Load TDM-LoRA
pipe.set_adapters(["tdm"], [0.125])# IMPORTANT. Please set LoRA scale to 0.125.
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16) # Save GPU memory.
pipe.vae.config.shift_factor = 0.0
pipe = pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler")
pipe.scheduler.config['flow_shift'] = 6 # the flow_shift can be changed from 1 to 6.
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
generator = torch.manual_seed(8888)
image = pipe(
    prompt="A cute panda holding a sign says TDM SOTA!",
    negative_prompt="",
    num_inference_steps=4,
    height=1024,
    width=1024,
    num_images_per_prompt = 1,
    guidance_scale=1.,
    generator = generator,
).images[0]

pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers", subfolder="scheduler")
pipe.set_adapters(["tdm"], [0.]) # Unload lora
generator = torch.manual_seed(8888)
teacher_image = pipe(
    prompt="A cute panda holding a sign says TDM SOTA!",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    num_images_per_prompt = 1,
    guidance_scale=7.,
    generator = generator,
).images[0]
make_image_grid([image,teacher_image],1,2)

sd3_compare 右の画像はSD3で56 NFEで生成されたサンプル、左の画像は TDM で4 NFEで生成されたサンプルです。どちらが良いと感じますか？

TDM - Dreamshaper - v7 - LoRA

import torch
from diffusers import DiffusionPipeline, UNet2DConditionModel, DPMSolverMultistepScheduler
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
repo_name = "Luo-Yihong/TDM_dreamshaper_LoRA"
ckpt_name = "tdm_dreamshaper.pt"
pipe = DiffusionPipeline.from_pretrained('lykon/dreamshaper-7', torch_dtype=torch.float16).to("cuda")
pipe.load_lora_weights(hf_hub_download(repo_name, ckpt_name))
pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
generator = torch.manual_seed(317)
image = pipe(
    prompt="A close-up photo of an Asian lady with sunglasses",
    negative_prompt="",
    num_inference_steps=4,
    num_images_per_prompt = 1,
    generator = generator,
    guidance_scale=1.,
).images[0]
image

tdm_dreamshaper

TDM - CogVideoX - 2B - LoRA

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
pipe.vae.enable_slicing() # Save memory
pipe.vae.enable_tiling() # Save memory
pipe.load_lora_weights("Luo-Yihong/TDM_CogVideoX-2B_LoRA")
pipe.to("cuda")
prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The "
    "panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance"
)
# We train the generator on timesteps [999, 856, 665, 399].
# The official scheduler of CogVideo-X using uniform spacing, may cause inferior results.
# But TDM-LoRA still works well under 4 NFE.
# We will update the TDM-CogVideoX-LoRA soon for better performance!
generator = torch.manual_seed(8888)
frames = pipe(prompt, guidance_scale=1, 
              num_inference_steps=4, 
              num_frames=49,
              generator = generator).frames[0]
export_to_video(frames, "output-TDM.mp4", fps=8)

🔥 事前学習モデル

以下のTDM - LoRAモデルを公開しています。ぜひお試しください！

🔥 今後の予定

学習コードの公開

お問い合わせ

本研究に関する質問がある場合は、Yihong Luo (yluocg@connect.ust.hk) までお問い合わせください。

📄 ライセンス

このプロジェクトはApache 2.0ライセンスの下で公開されています。

Bibtex

@misc{luo2025tdm,
      title={Learning Few-Step Diffusion Models by Trajectory Distribution Matching}, 
      author={Yihong Luo and Tianyang Hu and Jiacheng Sun and Yujun Cai and Jing Tang},
      year={2025},
      eprint={2503.06674},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06674}, 
}