đ LLM-grounded Video Diffusion Models
This project presents LLM-grounded Video Diffusion Models, which enhance video generation by leveraging large language models and bounding - box conditioning.
đ Quick Start
This model is based on modelscope but with additional conditioning from bounding boxes in a GLIGEN fashion. Similar to [LLM - grounded Diffusion (LMD)](https://llm - grounded - diffusion.github.io/), LLM - grounded Video Diffusion (LVD)'s boxes - to - video stage allows cross - attention - based bounding box conditioning, which uses ModelScope off - the - shelf.
This huggingface model offers an alternative: we train a GLIGEN model (i.e., transformer adapters) with ModelScope's weights without the temporal transformers blocks on SA - 1B, treating it as a SD v2.1 model that has been fine - tuned to 256x256 resolution. We then merge the adapters into ModelScope to offer conditioning. The resulting model is in this hugginface model. This can be used with cross - attention - based conditioning or on its own, similar to [LMD+](https://github.com/TonyLianLong/LLM - groundedDiffusion). This can be used with LLM - based text - to - dynamic scene layout generator in LVD, or on its own as a video version of GLIGEN.
⨠Features
- Bounding Box Conditioning: Allows cross - attention - based bounding box conditioning for video generation.
- Alternative Training Approach: Trains GLIGEN model adapters with ModelScope's weights on SA - 1B dataset.
- Versatile Usage: Can be used with cross - attention - based conditioning, with LLM - based text - to - dynamic scene layout generator, or on its own.
đ Documentation
Authors
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li at UC Berkeley/UCSF. ICLR 2024.
Links
[Project Page](https://llm - grounded - video - diffusion.github.io/) | [Related Project: LMD](https://llm - grounded - diffusion.github.io/) | [Citation](https://llm - grounded - video - diffusion.github.io/#citation)
đ License
ModelScope follows CC - BY - NC 4.0 license. The gligen adapters are trained on SA - 1B, which follows SA - 1B license.
đ Citation
Citation (LVD)
If you use our work, model, or our implementation in this repo, or find them helpful, please consider giving a citation.
@article{lian2023llmgroundedvideo,
title={LLM-grounded Video Diffusion Models},
author={Lian, Long and Shi, Baifeng and Yala, Adam and Darrell, Trevor and Li, Boyi},
journal={arXiv preprint arXiv:2309.17444},
year={2023},
}
@article{lian2023llmgrounded,
title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models},
author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
journal={arXiv preprint arXiv:2305.13655},
year={2023}
}
Citation (GLIGEN)
The adapters in this model are trained in a mannar similar to training GLIGEN adapters.
@article{li2023gligen,
title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
journal={CVPR},
year={2023}
}
Citation (ModelScope)
ModelScope is LVD's base model.
@article{wang2023modelscope,
title={Modelscope text-to-video technical report},
author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
journal={arXiv preprint arXiv:2308.06571},
year={2023}
}
@InProceedings{VideoFusion,
author = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
title = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023}
}