VideoLLaMA2-8x7B开源多模态模型 - 免费实现视频理解、音频处理及图文对话

首页

Videollama2 8x7B

由 DAMO-NLP-SG 开发

VideoLLaMA 2是一个多模态大语言模型，专注于视频理解和音频处理，能够处理视频和图像输入并生成自然语言响应。

文本生成视频

Transformers

英语开源协议:Apache-2.0 #多模态视频理解 #时空建模增强 #音频视觉融合

下载量 21

发布时间 : 6/11/2024

模型简介

VideoLLaMA 2是一个先进的多模态大语言模型，专注于视频理解任务。它结合了视觉编码器和语言解码器，能够处理视频和图像输入，并生成相关的自然语言响应。该模型在时空建模和音频理解方面有显著改进。

模型特点

时空建模能力

改进了对视频中时空关系的理解能力

音频理解

增强了对视频中音频信息的处理能力

多模态融合

有效整合视觉和语言信息进行推理

多帧处理

支持8帧或16帧视频输入，增强时间连续性理解

模型能力

视频问答

图像问答

视频描述生成

多模态推理

时空关系理解

使用案例

视频理解

视频内容问答

回答关于视频内容的各类问题

能准确识别视频中的对象、动作和场景

视频摘要生成

生成视频内容的文字描述

能生成连贯准确的视频描述

图像理解

图像问答

回答关于图像内容的各类问题

能准确描述图像中的对象、场景和情感

🚀 VideoLLaMA 2：推进视频大语言模型中的时空建模和音频理解

VideoLLaMA 2是一款多模态大语言模型，在视频问答、视频字幕生成等视觉问答任务中表现出色，它提升了视频大语言模型的时空建模和音频理解能力。

🚀 快速开始

你可以参考以下代码示例，快速开始使用 VideoLLaMA 2 进行推理：

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 视频推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # 图像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

✨ 主要特性

多模态大语言模型：支持视频和图像等多模态输入，可处理视觉问答任务。
先进的时空建模和音频理解：在视频处理中能更好地进行时空建模和音频理解。
丰富的模型选择：提供多种不同规模和类型的模型，如 Base 和 Chat 类型，以满足不同需求。

📦 模型信息

属性	详情
模型类型	多模态大语言模型、大型视频语言模型
训练数据集	OpenGVLab/VideoChat2-IT、Lin-Chen/ShareGPT4V、liuhaotian/LLaVA-Instruct-150K
评估指标	准确率
库名称	transformers
任务类型	视觉问答

💻 使用示例

基础用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 视频推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # 图像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

🚀 主要结果

多项选择视频问答与视频字幕生成

开放式视频问答

📚 模型库

模型名称	类型	视觉编码器	语言解码器	训练帧数
VideoLLaMA2-7B-Base	Base	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	Chat	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	Base	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	Chat	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	Base	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B (此检查点)	Chat	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	Base	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	Chat	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

📄 许可证

本项目采用 Apache-2.0 许可证。

引用

如果你发现 VideoLLaMA 对你的研究和应用有帮助，请使用以下 BibTeX 进行引用：

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}