text-to-video-lvd-ms Open-source Video Generation Model - Supports Text-to-Video Conversion and Content Control

Text To Video Lvd Ms

Developed by longlian

This model combines large language models with video diffusion technology, supporting text-to-video generation and allowing control over video content through bounding box conditional input.

Text-to-Video #Text-to-Video #Bounding Box Conditional Control #Dynamic Scene Generation

Downloads 91

Release Time : 4/8/2024

Model Overview

The Large Language Model-based Video Diffusion Model (LVD) supports text-to-video generation and employs GLIGEN-style bounding box conditional input. It can be directly used with pre-trained models from the ModelScope community.

Model Features

Bounding Box Conditional Control

Supports GLIGEN-style bounding box conditional input, enabling precise control over the position and size of objects in the video.

Large Language Model Integration

Enhances prompt understanding by integrating large language models, improving the quality of text-to-video generation.

Flexible Application

Can be used standalone as a video version of GLIGEN or in combination with dynamic scene layout generators.

Model Capabilities

Text-to-Video Generation

Bounding Box Conditional Control

Dynamic Scene Generation

Use Cases

Creative Content Generation

Short Video Creation

Automatically generates short video content based on text descriptions

Can generate dynamic video scenes that match the text descriptions

Education

Educational Video Generation

Automatically generates instructional videos based on syllabi

🚀 LLM-grounded Video Diffusion Models

This project presents LLM-grounded Video Diffusion Models, which enhance video generation by leveraging large language models and bounding - box conditioning.

🚀 Quick Start

This model is based on modelscope but with additional conditioning from bounding boxes in a GLIGEN fashion. Similar to [LLM - grounded Diffusion (LMD)](https://llm - grounded - diffusion.github.io/), LLM - grounded Video Diffusion (LVD)'s boxes - to - video stage allows cross - attention - based bounding box conditioning, which uses ModelScope off - the - shelf.

This huggingface model offers an alternative: we train a GLIGEN model (i.e., transformer adapters) with ModelScope's weights without the temporal transformers blocks on SA - 1B, treating it as a SD v2.1 model that has been fine - tuned to 256x256 resolution. We then merge the adapters into ModelScope to offer conditioning. The resulting model is in this hugginface model. This can be used with cross - attention - based conditioning or on its own, similar to [LMD+](https://github.com/TonyLianLong/LLM - groundedDiffusion). This can be used with LLM - based text - to - dynamic scene layout generator in LVD, or on its own as a video version of GLIGEN.

✨ Features

Bounding Box Conditioning: Allows cross - attention - based bounding box conditioning for video generation.
Alternative Training Approach: Trains GLIGEN model adapters with ModelScope's weights on SA - 1B dataset.
Versatile Usage: Can be used with cross - attention - based conditioning, with LLM - based text - to - dynamic scene layout generator, or on its own.

📚 Documentation

Authors

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li at UC Berkeley/UCSF. ICLR 2024.

Links

[Project Page](https://llm - grounded - video - diffusion.github.io/) | [Related Project: LMD](https://llm - grounded - diffusion.github.io/) | [Citation](https://llm - grounded - video - diffusion.github.io/#citation)

📄 License

ModelScope follows CC - BY - NC 4.0 license. The gligen adapters are trained on SA - 1B, which follows SA - 1B license.

📄 Citation

Citation (LVD)

If you use our work, model, or our implementation in this repo, or find them helpful, please consider giving a citation.

@article{lian2023llmgroundedvideo,
      title={LLM-grounded Video Diffusion Models}, 
      author={Lian, Long and Shi, Baifeng and Yala, Adam and Darrell, Trevor and Li, Boyi},
      journal={arXiv preprint arXiv:2309.17444},
      year={2023},
}

@article{lian2023llmgrounded,
    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, 
    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
    journal={arXiv preprint arXiv:2305.13655},
    year={2023}
}

Citation (GLIGEN)

The adapters in this model are trained in a mannar similar to training GLIGEN adapters.

@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}

Citation (ModelScope)

ModelScope is LVD's base model.

@article{wang2023modelscope,
    title={Modelscope text-to-video technical report},
    author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
    journal={arXiv preprint arXiv:2308.06571},
    year={2023}
}
@InProceedings{VideoFusion,
    author    = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
    title     = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご