VideoLLaMA2.1-7B-16F-Base开源视频大模型 - 升级时空建模与音频理解能力

首页

Videollama2.1 7B 16F Base

由 DAMO-NLP-SG 开发

VideoLLaMA2.1是基于VideoLLaMA2的升级版本，专注于提升视频大语言模型中的时空建模与音频理解能力。

视频生成文本

Transformers

英语开源协议:Apache-2.0 #多模态视频理解 #时空建模增强 #音频视觉融合

下载量 179

发布时间 : 10/14/2024

模型简介

VideoLLaMA2.1是一个多模态大语言模型，专注于视频理解和视觉问答任务，支持对视频内容进行时空建模和音频理解。

模型特点

时空建模能力

增强了对视频中时空信息的理解和建模能力。

音频理解

提升了对视频中音频内容的理解能力。

多模态处理

能够同时处理视频和图像内容，并进行多模态推理。

模型能力

视频问答

图像问答

视频内容描述

多模态推理

使用案例

视频理解

视频内容问答

回答关于视频内容的复杂问题

在MLVU和VideoMME榜单中位列7B规模视频大模型榜首

视频内容描述

生成对视频内容的详细描述

图像理解

图像问答

回答关于图像内容的复杂问题

🚀 VideoLLaMA 2：推进视频大语言模型中的时空建模与音频理解

VideoLLaMA 2是一款多模态大语言模型，专注于视频领域，在时空建模和音频理解方面取得了显著进展，能有效处理视频问答、视频字幕生成等任务。

🚀 快速开始

本项目为视频大语言模型VideoLLaMA 2，如果你喜欢我们的项目，请在 Github 上给我们点个星 ⭐ 以获取最新更新。

📰 新闻动态

[2024.10.15] 发布 VideoLLaMA2.1-7B-16F-Base 和 VideoLLaMA2.1-7B-16F 的检查点。
[2024.08.14] 发布 VideoLLaMA2-72B-Base 和 VideoLLaMA2-72B 的检查点。
[2024.07.30] 发布 VideoLLaMA2-8x7B-Base 和 VideoLLaMA2-8x7B 的检查点。
[2024.06.25] 🔥🔥 截至6月25日，我们的 VideoLLaMA2-7B-16F 在 MLVU排行榜上的约70亿参数规模视频大语言模型中排名第一。
[2024.06.18] 🔥🔥 截至6月18日，我们的 VideoLLaMA2-7B-16F 在 VideoMME排行榜上的约70亿参数规模视频大语言模型中排名第一。
[2024.06.17] 👋👋 更新技术报告，包含最新结果和缺失的参考文献。如果您有与VideoLLaMA 2密切相关但未在论文中提及的工作，请随时告知我们。
[2024.06.14] 🔥🔥 在线演示已上线。
[2024.06.03] 发布VideoLLaMA 2的训练、评估和服务代码。

🌎 模型库

模型名称	类型	视觉编码器	语言解码器	训练帧数
VideoLLaMA2-7B-Base	基础版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	对话版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	基础版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	对话版	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	基础版	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	对话版	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	基础版	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	对话版	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2.1-7B-16F-Base (此检查点)	基础版	siglip-so400m-patch14-384	Qwen2-7B-Instruct	16
VideoLLaMA2.1-7B-16F	对话版	siglip-so400m-patch14-384	Qwen2-7B-Instruct	16

🚀 主要成果

多项选择视频问答与视频字幕生成

开放式视频问答

💻 使用示例

基础用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # 视频推理
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = '视频中有哪些动物，它们在做什么，视频给人的感觉如何？'
   
    # 图像推理
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = '图片中的女人穿着什么，她在做什么，图片给人的感觉如何？'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-16F'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

📄 许可证

本项目采用Apache-2.0许可证。

引用信息

如果您发现VideoLLaMA对您的研究和应用有帮助，请使用以下BibTeX进行引用：

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}