InternVideo2-Chat-8Bオープンソース動画理解モデル - 動画の意味理解とヒューマンマシンインタラクションを無料でサポート

ホーム

Internvideo2 Chat 8B

OpenGVLabによって開発

InternVideo2-Chat-8Bは大規模言語モデル(LLM)と動画BLIPを統合した動画理解モデルで、段階的学習スキームにより構築され、動画の意味理解と人間とのインタラクションが可能です。

ビデオ生成テキスト

Transformers

英語オープンソースライセンス:MIT #動画意味理解 #段階的学習 #マルチモーダルインタラクション

ダウンロード数 492

リリース時間 : 8/1/2024

モデル概要

このモデルはInternVideo2を動画エンコーダーとして使用し、Mistral-7Bなどの大規模言語モデルと組み合わせ、VideoLLMをファインチューニングすることで、動画の意味内容と人間とのインタラクションの親和性を向上させています。

モデル特徴

段階的学習スキーム

VideoChatの段階的学習スキームを採用し、動画BLIPモジュールとオープンソースLLMのインタラクションを訓練し、動画エンコーダーは継続的に更新されます。

高性能動画理解

MVBenchやVideoMMEなどのベンチマークテストで優れた性能を発揮し、動画内容を正確に理解し意味分析が可能です。

マルチモーダルインタラクション

動画とテキスト入力を組み合わせ、動画内容の説明や質問応答などの複雑なマルチモーダルタスクをサポートします。

モデル能力

動画内容理解

動画質問応答

動画内容説明

マルチモーダルインタラクション

使用事例

動画分析

動画内容説明

動画内容について、動作の詳細やシーン情報などを詳細に説明します。

動画では、山の景色を一望できる屋上で女性がヨガを練習している様子が映っています。彼女はまず手と膝で支える姿勢から始め、その後ダウンドッグの姿勢に移行し、最終的に立位の姿勢で終了しています。

動画質問応答

登場人物の服装や動作の詳細など、動画内容に関する特定の質問に回答します。

動画中の女性は黒いタンクトップとグレーのヨガパンツを着用しています。

人間とのインタラクション

自然言語インタラクション

自然言語を通じてモデルとインタラクションし、動画内容の詳細情報を取得できます。

🚀 InternVideo2-Chat-8B

このモデルは、InternVideo2に埋め込まれたセマンティクスをさらに豊かにし、人間とのコミュニケーションにおける使いやすさを向上させるために開発されました。InternVideo2をVideoLLMに組み込み、LLMとビデオBLIPを用いて調整しています。VideoChatの段階的学習スキームを採用し、InternVideo2をビデオエンコーダーとして使用し、オープンソースのLLMと通信するためのビデオブリップをトレーニングします。トレーニング中にはビデオエンコーダーも更新されます。詳細なトレーニング方法はVideoChatに記載されています。

このモデルのBaseLLMはMistral-7Bです。使用する前に、Mistral-7Bのアクセス許可を取得していることを確認してください。まだ取得していない場合は、Mistral-7Bにアクセスして許可を取得し、HF_tokenを環境変数に追加してください。

[📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

🚀 クイックスタート

モデルの使用方法

このプロジェクトとBaseLLMのアクセス許可を申請します。
HFユーザーアクセストークンを環境変数に設定します。

export HF_TOKEN=hf_....

"hf_"で始まるトークンの取得方法がわからない場合は、How to Get HF User access Tokenを参照してください。 3. transformers >= 4.39.0 と peft==0.5.0 がインストールされていることを確認します。

pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops

pip_requirements から必要なPythonパッケージをインストールします。 4. ビデオ入力で推論を行います。

import os
token = os.environ['HF_TOKEN']
import torch

tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVideo2-Chat-8B',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).cuda()

from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")

def get_index(num_frames, num_segments):
    seg_size = float(num_frames - 1) / num_segments
    start = int(seg_size / 2)
    offsets = np.array([
        start + int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    return offsets


def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape
        
    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

video_path = "yoga.mp4"
# sample uniformly 8 frames from the video
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)

chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
# The video shows a woman performing yoga on a rooftop with a beautiful view of the mountains in the background. She starts by standing on her hands and knees, then moves into a downward dog position, and finally ends with a standing position. Throughout the video, she maintains a steady and fluid movement, focusing on her breath and alignment. The video is a great example of how yoga can be practiced in different environments and how it can be a great way to connect with nature and find inner peace.

response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
# The woman in the video is wearing a black tank top and grey yoga pants.
print(response)

✨ 主な機能

このモデルは、InternVideo2をベースに構築され、ビデオとの対話型理解を強化しています。ビデオに対する自然言語の質問に回答することができ、ビデオの内容を説明することも可能です。

📦 インストール

上記の「クイックスタート」セクションで説明した手順に従って、必要なパッケージをインストールしてください。

💻 使用例

基本的な使用法

上記の「クイックスタート」セクションのコードを参照してください。このコードでは、ビデオを入力として与え、モデルにビデオの内容を説明させたり、ビデオ内の人物が何を着ているかを尋ねたりすることができます。

📈 性能

モデル	MVBench	VideoMME(w/o sub)
InternVideo2-Chat-8B	60.3	41.9
InternVideo2-Chat-8B-HD	65.4	46.1
InternVideo2-Chat-8B-HD-F16	67.5	49.4
InternVideo2-Chat-8B-InternLM	61.9	49.1

✏️ 引用

この研究があなたの研究に役立った場合は、InternVideoとVideoChatを引用していただけると幸いです。

@article{wang2024internvideo2,
  title={Internvideo2: Scaling video foundation models for multimodal video understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

@article{li2023videochat,
  title={Videochat: Chat-centric video understanding},
  author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}