VideoLLaMA2-8x7B-Baseオープンソースビデオ大規模モデル - ビデオの質問応答と記述をサポートし、視聴理解を向上させます

Home

Videollama2 8x7B Base

Developed by DAMO-NLP-SG

VideoLLaMA 2は次世代の動画大規模言語モデルで、時空間モデリング能力と音声理解能力の向上に焦点を当て、マルチモーダル動画質問応答や記述タスクをサポートします。

テキスト生成ビデオ

Transformers

EnglishOpen Source License:Apache-2.0 #マルチモーダル動画理解 #時空間モデリング最適化 #音声強化分析

Downloads 20

Release Time : 6/11/2024

Model Overview

VideoLLaMA 2は動画コンテンツを処理するために特別に設計されたマルチモーダル大規模言語モデルで、動画内の時空間情報や音声コンテンツを理解・分析できます。

Model Features

強化された時空間モデリング

動画内の時空間情報の理解と処理能力が改善されました

音声理解能力

動画内の音声コンテンツを理解・分析する新機能が追加されました

マルチフレーム処理

8フレームまたは16フレームの動画コンテンツを同時に処理できます

マルチモーダル融合

視覚、音声、テキスト情報を効果的に融合して総合的な理解を行います

Model Capabilities

動画質問応答

動画記述生成

マルチモーダル理解

時空間情報分析

音声コンテンツ理解

Use Cases

動画コンテンツ理解

動画質問応答システム

動画コンテンツに関する様々な質問に答えます

複数の動画質問応答ベンチマークテストで優れた性能を発揮

動画自動記述生成

動画に詳細な文章説明を生成します

動画内の重要なイベントやシーンを正確に記述できます

マルチモーダル分析

動画コンテンツ分析

動画内の視覚情報と音声情報を総合的に分析します

複雑なマルチモーダル動画コンテンツを理解できます

🚀 VideoLLaMA 2: ビデオLLMにおける時空間モデリングと音声理解の進化

VideoLLaMA 2は、ビデオに関する時空間モデリングと音声理解を高度化したマルチモーダル大規模言語モデルです。このモデルは、ビデオに関する質問応答やキャプショニングなどのタスクにおいて高い性能を発揮します。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

もし当プロジェクトが気に入っていただけたら、Githubでスター⭐をいただけると最新情報をお届けできます。

📰 ニュース

[2024.06.12] VideoLLaMA 2のモデルウェイトと技術レポートの第1版を公開しました。
[2024.06.03] VideoLLaMA 2の学習、評価、サービングコードを公開しました。

🌎 モデルズー

モデル名	タイプ	ビジュアルエンコーダ	言語デコーダ	学習フレーム数
VideoLLaMA2-7B-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base (このチェックポイント)	ベース	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	チャット	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	ベース	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	チャット	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🚀 主な結果

選択式ビデオQAとビデオキャプショニング

自由記述式ビデオQA

💻 使用例

基本的な使用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B-Base'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

引用

もしVideoLLaMAがあなたの研究やアプリケーションに役立つと思われる場合は、以下のBibTeXを使用して引用してください。

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}