🚀 VideoLLaMA 2:推进视频大语言模型中的时空建模和音频理解
VideoLLaMA 2是一款多模态大语言模型,在视频问答、视频字幕生成等视觉问答任务中表现出色,它提升了视频大语言模型的时空建模和音频理解能力。
🚀 快速开始
你可以参考以下代码示例,快速开始使用 VideoLLaMA 2 进行推理:
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if __name__ == "__main__":
inference()
✨ 主要特性
- 多模态大语言模型:支持视频和图像等多模态输入,可处理视觉问答任务。
- 先进的时空建模和音频理解:在视频处理中能更好地进行时空建模和音频理解。
- 丰富的模型选择:提供多种不同规模和类型的模型,如 Base 和 Chat 类型,以满足不同需求。
📦 模型信息
属性 |
详情 |
模型类型 |
多模态大语言模型、大型视频语言模型 |
训练数据集 |
OpenGVLab/VideoChat2-IT、Lin-Chen/ShareGPT4V、liuhaotian/LLaVA-Instruct-150K |
评估指标 |
准确率 |
库名称 |
transformers |
任务类型 |
视觉问答 |
💻 使用示例
基础用法
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if __name__ == "__main__":
inference()
🚀 主要结果
多项选择视频问答与视频字幕生成

开放式视频问答

📚 模型库
📄 许可证
本项目采用 Apache-2.0 许可证。
引用
如果你发现 VideoLLaMA 对你的研究和应用有帮助,请使用以下 BibTeX 进行引用:
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}