オープンソースのVideoLLaMA2 - 72Bマルチモーダルモデル - ビデオと画像入力に対応した視覚的な質問応答対話の神器

Home

Videollama2 72B

Developed by DAMO-NLP-SG

VideoLLaMA 2はマルチモーダル大規模言語モデルで、動画理解と時空間モデリングに特化しており、動画や画像の入力をサポートし、視覚的質問応答や対話タスクが可能です。

テキスト生成ビデオ

Transformers

EnglishOpen Source License:Apache-2.0 #マルチモーダル動画理解 #時空間モデリング強化 #音声視覚融合

Downloads 26

Release Time : 8/13/2024

Model Overview

VideoLLaMA 2は先進的なマルチモーダル大規模言語モデルで、動画理解と時空間モデリングに焦点を当てています。視覚エンコーダーと言語デコーダーを組み合わせており、動画や画像の入力を処理し、視覚的質問応答や動画記述などのタスクを実行できます。

Model Features

マルチモーダル理解

動画と画像の入力を同時に処理し、視覚的内容を理解して自然言語で対話可能

時空間モデリング

動画中の時空間情報の理解と処理能力を特別に最適化

大規模パラメータ

72Bパラメータの強力な言語モデルで、深い意味理解と生成能力を提供

指示追従

指示チューニングを経ており、ユーザーの様々な視覚関連指示を正確に理解・実行可能

Model Capabilities

動画質問応答

画像質問応答

動画内容記述

画像内容記述

マルチモーダル対話

時空間関係理解

Use Cases

動画理解

動画内容質問応答

物体認識、動作分析、シーン理解など、動画内容に関する様々な質問に回答

動画中の動物とその行動を正確に識別し、動画全体の雰囲気を描写可能

動画要約生成

動画内容の文字記述と要約を自動生成

画像理解

画像内容質問応答

物体認識、シーン分析、感情理解など、画像内容に関する様々な質問に回答

画像中の人物の服装や行動を正確に記述し、画像の感情的な雰囲気を分析可能

🚀 VideoLLaMA 2: ビデオLLMにおける時空間モデリングと音声理解の進化

VideoLLaMA 2は、ビデオに関する時空間モデリングと音声理解を向上させたマルチモーダル大規模言語モデルです。ビデオに対する質問応答やキャプショニングなどのタスクに対応しています。

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

もし私たちのプロジェクトが気に入ったら、Githubで最新情報を得るためにスター⭐をしてください。

📰 ニュース

[2024.06.12] VideoLLaMA 2のモデルウェイトと技術レポートの最初のバージョンを公開しました。
[2024.06.03] VideoLLaMA 2のトレーニング、評価、サービングコードを公開しました。

🌎 モデル一覧

モデル名	タイプ	ビジュアルエンコーダ	言語デコーダ	トレーニングフレーム数
VideoLLaMA2-7B-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	8
VideoLLaMA2-7B-16F-Base	ベース	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-7B-16F	チャット	clip-vit-large-patch14-336	Mistral-7B-Instruct-v0.2	16
VideoLLaMA2-8x7B-Base	ベース	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-8x7B	チャット	clip-vit-large-patch14-336	Mixtral-8x7B-Instruct-v0.1	8
VideoLLaMA2-72B-Base	ベース	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8
VideoLLaMA2-72B (このチェックポイント)	チャット	clip-vit-large-patch14-336	Qwen2-72B-Instruct	8

🚀 主な結果

選択式ビデオQAとビデオキャプショニング

自由記述式ビデオQA

💻 使用例

基本的な使用法

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-72B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

引用

もしVideoLLaMAがあなたの研究やアプリケーションに役立った場合、以下のBibTeXを使用して引用してください。

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}