VideoLLaMA2 - 8x7Bオープンソースマルチモーダルモデル - 無料でビデオ理解、オーディオ処理、画像と文章の対話を実現

ホーム

Videollama2 8x7B

DAMO-NLP-SGによって開発

VideoLLaMA 2はマルチモーダル大規模言語モデルで、動画理解と音声処理に特化しており、動画や画像入力を処理し自然言語応答を生成できます。

テキスト生成ビデオ

Transformers

英語オープンソースライセンス:Apache-2.0 #マルチモーダル動画理解 #時空間モデリング強化 #音声視覚融合

ダウンロード数 21

リリース時間 : 6/11/2024

モデル概要

VideoLLaMA 2は動画理解タスクに特化した先進的なマルチモーダル大規模言語モデルです。視覚エンコーダーと言語デコーダーを組み合わせ、動画や画像入力を処理し、関連する自然言語応答を生成します。時空間モデリングと音声理解において顕著な改善が見られます。

モデル特徴

時空間モデリング能力

動画内の時空間関係の理解能力が改善されました

音声理解

動画内の音声情報の処理能力が強化されました

マルチモーダル融合

視覚と言語情報を効果的に統合して推論を行います

マルチフレーム処理

8フレームまたは16フレームの動画入力をサポートし、時間的連続性の理解を強化します

モデル能力

動画質問応答

画像質問応答

動画記述生成

マルチモーダル推論

時空間関係理解

使用事例

動画理解

動画内容の質問応答

動画内容に関する様々な質問に答えます

動画内のオブジェクト、動作、シーンを正確に識別できます

動画要約生成

動画内容のテキスト記述を生成します

一貫性のある正確な動画記述を生成できます

画像理解

画像質問応答

画像内容に関する様々な質問に答えます

画像内のオブジェクト、シーン、感情を正確に記述できます

🚀 VideoLLaMA 2: ビデオLLMにおける時空間モデリングと音声理解の進化

VideoLLaMA 2は、ビデオに関する時空間モデリングと音声理解を向上させたマルチモーダル大規模言語モデルです。本モデルは、様々なビデオ関連タスクにおいて高い性能を発揮します。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

もし当プロジェクトが気に入っていただけたら、最新情報を得るために Github でスター⭐をしていただけると幸いです。

📰 ニュース

[2024.06.12] VideoLLaMA 2のモデルウェイトと技術レポートの初版を公開しました。
[2024.06.03] VideoLLaMA 2の学習、評価、サービングコードを公開しました。

🌎 モデルズー

モデル名	タイプ	ビジュアルエンコーダ	言語デコーダ	学習フレーム数
VideoLLaMA2-7B-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	ベース	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B (このチェックポイント)	チャット	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	ベース	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	チャット	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🚀 主な結果

選択式ビデオQAとビデオキャプショニング

自由記述式ビデオQA

💻 使用例

基本的な使用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-8x7B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

引用

もしVideoLLaMAがあなたの研究やアプリケーションに役立った場合、以下のBibTeXを使用して引用してください。

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}