VideoLLaMA2-8x7B開源多模態模型 - 免費實現視頻理解、音頻處理及圖文對話

首頁

Videollama2 8x7B

由DAMO-NLP-SG開發

VideoLLaMA 2是一個多模態大語言模型，專注於視頻理解和音頻處理，能夠處理視頻和圖像輸入並生成自然語言響應。

文本生成視頻

Transformers

英語開源協議:Apache-2.0 #多模態視頻理解 #時空建模增強 #音頻視覺融合

下載量 21

發布時間 : 6/11/2024

模型概述

VideoLLaMA 2是一個先進的多模態大語言模型，專注於視頻理解任務。它結合了視覺編碼器和語言解碼器，能夠處理視頻和圖像輸入，並生成相關的自然語言響應。該模型在時空建模和音頻理解方面有顯著改進。

模型特點

時空建模能力

改進了對視頻中時空關係的理解能力

音頻理解

增強了對視頻中音頻信息的處理能力

多模態融合

有效整合視覺和語言信息進行推理

多幀處理

支持8幀或16幀視頻輸入，增強時間連續性理解

模型能力

視頻問答

圖像問答

視頻描述生成

多模態推理

時空關係理解

使用案例

視頻理解

視頻內容問答

回答關於視頻內容的各類問題

能準確識別視頻中的對象、動作和場景

視頻摘要生成

生成視頻內容的文字描述

能生成連貫準確的視頻描述

圖像理解

圖像問答

回答關於圖像內容的各類問題

能準確描述圖像中的對象、場景和情感

🚀 VideoLLaMA 2：推進視頻大語言模型中的時空建模和音頻理解

VideoLLaMA 2是一款多模態大語言模型，在視頻問答、視頻字幕生成等視覺問答任務中表現出色，它提升了視頻大語言模型的時空建模和音頻理解能力。

🚀 快速開始

你可以參考以下代碼示例，快速開始使用 VideoLLaMA 2 進行推理：

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 視頻推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # 圖像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

✨ 主要特性

多模態大語言模型：支持視頻和圖像等多模態輸入，可處理視覺問答任務。
先進的時空建模和音頻理解：在視頻處理中能更好地進行時空建模和音頻理解。
豐富的模型選擇：提供多種不同規模和類型的模型，如 Base 和 Chat 類型，以滿足不同需求。

📦 模型信息

屬性	詳情
模型類型	多模態大語言模型、大型視頻語言模型
訓練數據集	OpenGVLab/VideoChat2-IT、Lin-Chen/ShareGPT4V、liuhaotian/LLaVA-Instruct-150K
評估指標	準確率
庫名稱	transformers
任務類型	視覺問答

💻 使用示例

基礎用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 視頻推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # 圖像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

🚀 主要結果

多項選擇視頻問答與視頻字幕生成

開放式視頻問答

📚 模型庫

模型名稱	類型	視覺編碼器	語言解碼器	訓練幀數
VideoLLaMA2-7B-Base	Base	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	Chat	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	Base	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	Chat	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	Base	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B (此檢查點)	Chat	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	Base	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	Chat	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

📄 許可證

本項目採用 Apache-2.0 許可證。

引用

如果你發現 VideoLLaMA 對你的研究和應用有幫助，請使用以下 BibTeX 進行引用：

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}