VideoLLaMA 2オープンソースマルチモーダル大規模モデル - 無料でビデオの時空特性と音声理解能力を向上させる

Home

Videollama2 7B 16F Base

Developed by DAMO-NLP-SG

VideoLLaMA 2は、動画理解における時空間モデリングと音声理解能力の向上に焦点を当てたマルチモーダル大規模言語モデルです。

テキスト生成ビデオ

Transformers

EnglishOpen Source License:Apache-2.0 #マルチモーダル動画質問応答 #時空間モデリング強化 #オーディオビジュアル融合

Downloads 64

Release Time : 6/11/2024

Model Overview

VideoLLaMA 2は、Mistral-7B-Instruct-v0.2言語デコーダーとCLIP-ViT-Large視覚エンコーダーに基づくマルチモーダル大規模言語モデルで、動画と画像の理解と質問応答タスクをサポートします。

Model Features

時空間モデリング能力

改良されたアーキテクチャ設計により、動画内の時空間情報の理解能力が強化されています。

音声理解

動画内の音声情報の理解と分析をサポートします。

マルチモーダルサポート

動画と画像の両方の理解と質問応答タスクを同時にサポートします。

Model Capabilities

動画質問応答

画像質問応答

マルチモーダル理解

時空間情報分析

Use Cases

動画理解

動画内容の質問応答

動画内容に関する質問応答を行い、動画内の物体、動作、感情を識別します。

動画内の物体や動作を正確に識別し、動画の感情的な雰囲気を説明できます。

画像理解

画像内容の質問応答

画像内容に関する質問応答を行い、画像内の物体、動作、感情を識別します。

画像内の物体や動作を正確に識別し、画像の感情的な雰囲気を説明できます。

🚀 VideoLLaMA 2: ビデオLLMにおける時空間モデリングと音声理解の進化

VideoLLaMA 2は、ビデオに関する時空間モデリングと音声理解を向上させたマルチモーダル大規模言語モデルです。ビデオに対する質問応答やキャプショニングなどのタスクに対応しています。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

もし当プロジェクトが気に入っていただけたら、Githubでスター⭐をいただけると最新のアップデートを受け取れます。

🚀 クイックスタート

推論の実行

以下のコードを実行することで、VideoLLaMA 2を使った推論を行うことができます。

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-16F-Base'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

✨ 主な機能

ビデオに対する質問応答やキャプショニングなどのタスクに対応
画像に対する質問応答も可能

📚 ドキュメント

📰 ニュース

[2024.06.12] VideoLLaMA 2のモデルウェイトと技術レポートの第1版をリリースしました。
[2024.06.03] VideoLLaMA 2の学習、評価、サービングコードをリリースしました。

🌎 モデル一覧

モデル名	タイプ	ビジュアルエンコーダ	言語デコーダ	学習フレーム数
VideoLLaMA2-7B-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base (このチェックポイント)	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	ベース	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	チャット	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	ベース	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B	チャット	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🚀 主要な結果

選択式ビデオQAとビデオキャプショニング

自由記述式ビデオQA

📄 ライセンス

このプロジェクトはApache-2.0ライセンスの下で公開されています。

引用

もしVideoLLaMAがあなたの研究やアプリケーションに役立った場合は、以下のBibTeXを使用して引用してください。

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}