開源VideoLLaMA2-72B多模態模型 - 支持視頻圖像輸入的視覺問答對話神器

首頁

Videollama2 72B

由DAMO-NLP-SG開發

VideoLLaMA 2是一個多模態大語言模型，專注於視頻理解和時空建模，支持視頻和圖像輸入，能夠進行視覺問答和對話任務。

文本生成視頻

Transformers

英語開源協議:Apache-2.0 #多模態視頻理解 #時空建模增強 #音頻視覺融合

下載量 26

發布時間 : 8/13/2024

模型概述

VideoLLaMA 2是一個先進的多模態大語言模型，專注於視頻理解和時空建模。它結合了視覺編碼器和語言解碼器，能夠處理視頻和圖像輸入，執行視覺問答、視頻描述等任務。

模型特點

多模態理解

能夠同時處理視頻和圖像輸入，理解視覺內容並進行自然語言交互

時空建模

特別優化了對視頻中時空信息的理解和處理能力

大規模參數

72B參數的強大語言模型，提供深入的語義理解和生成能力

指令跟隨

經過指令調優，能夠準確理解和執行用戶的各種視覺相關指令

模型能力

視頻問答

圖像問答

視頻內容描述

圖像內容描述

多模態對話

時空關係理解

使用案例

視頻理解

視頻內容問答

回答關於視頻內容的各類問題，如識別物體、分析動作、理解場景等

能夠準確識別視頻中的動物及其行為，並描述視頻的整體氛圍

視頻摘要生成

自動生成視頻內容的文字描述和摘要

圖像理解

圖像內容問答

回答關於圖像內容的各類問題，如識別物體、分析場景、理解情感等

能夠準確描述圖像中人物的著裝和行為，並分析圖像的情感氛圍

🚀 VideoLLaMA 2：推進視頻大語言模型中的時空建模和音頻理解

VideoLLaMA 2 是一款多模態大語言模型，在視頻理解任務中表現出色，能夠處理視頻問答、視頻字幕生成等任務，提升了時空建模和音頻理解能力。

📄 許可證

本項目採用 Apache-2.0 許可證。

🚀 快速開始

模型信息

屬性	詳情
模型類型	多模態大語言模型、大型視頻語言模型
訓練數據	OpenGVLab/VideoChat2-IT、Lin-Chen/ShareGPT4V、liuhaotian/LLaVA-Instruct-150K
評估指標	準確率
庫名稱	transformers
任務類型	視覺問答

📰 新聞動態

[2024.06.12] 發佈 VideoLLaMA 2 的模型權重和第一版技術報告。
[2024.06.03] 發佈 VideoLLaMA 2 的訓練、評估和服務代碼。

🌎 模型庫

模型名稱	類型	視覺編碼器	語言解碼器	訓練幀數
VideoLLaMA2-7B-Base	基礎版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	對話版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	基礎版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	對話版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	基礎版	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	對話版	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	基礎版	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B (此檢查點)	對話版	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🚀 主要結果

多項選擇視頻問答與視頻字幕生成

開放式視頻問答

💻 使用示例

基礎用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 視頻推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = '視頻中有哪些動物，它們在做什麼，視頻給人的感覺如何？'
   
    # 圖像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = '圖片中的女人穿著什麼，她在做什麼，圖片給人的感覺如何？'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-72B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

📚 引用

如果您發現 VideoLLaMA 對您的研究和應用有幫助，請使用以下 BibTeX 進行引用：

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}