🚀 VideoLLaMA 2:推進視頻大語言模型中的時空建模和音頻理解
VideoLLaMA 2是一款多模態大語言模型,在視頻問答、視頻字幕生成等視覺問答任務中表現出色,它提升了視頻大語言模型的時空建模和音頻理解能力。
🚀 快速開始
你可以參考以下代碼示例,快速開始使用 VideoLLaMA 2 進行推理:
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if __name__ == "__main__":
inference()
✨ 主要特性
- 多模態大語言模型:支持視頻和圖像等多模態輸入,可處理視覺問答任務。
- 先進的時空建模和音頻理解:在視頻處理中能更好地進行時空建模和音頻理解。
- 豐富的模型選擇:提供多種不同規模和類型的模型,如 Base 和 Chat 類型,以滿足不同需求。
📦 模型信息
屬性 |
詳情 |
模型類型 |
多模態大語言模型、大型視頻語言模型 |
訓練數據集 |
OpenGVLab/VideoChat2-IT、Lin-Chen/ShareGPT4V、liuhaotian/LLaVA-Instruct-150K |
評估指標 |
準確率 |
庫名稱 |
transformers |
任務類型 |
視覺問答 |
💻 使用示例
基礎用法
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if __name__ == "__main__":
inference()
🚀 主要結果
多項選擇視頻問答與視頻字幕生成

開放式視頻問答

📚 模型庫
📄 許可證
本項目採用 Apache-2.0 許可證。
引用
如果你發現 VideoLLaMA 對你的研究和應用有幫助,請使用以下 BibTeX 進行引用:
@article{damonlpsg2024videollama2,
title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
journal={arXiv preprint arXiv:2406.07476},
year={2024},
url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
journal = {arXiv preprint arXiv:2306.02858},
year = {2023},
url = {https://arxiv.org/abs/2306.02858}
}