LLaVA - NeXT - Video - 34B - hfオープンソースマルチモーダルチャットボット - 無料でデプロイして超強力な動画理解を実現

ホーム

Llava NeXT Video 34B Hf

llava-hfによって開発

LLaVA-NeXT-Videoはオープンソースのマルチモーダルチャットボットで、動画と画像データの混合トレーニングにより優れた動画理解能力を備えています。

テキスト生成ビデオ

Transformers

英語#マルチモーダル動画理解 #命令追従型対話 #動画質問応答システム

ダウンロード数 2,232

リリース時間 : 6/6/2024

モデル概要

LLaVA-NeXTを基に構築された動画理解モデルで、動画と画像データの混合でチューニングされ、VideoMMEベンチマークでリーダー的なパフォーマンスを示します。

モデル特徴

動画理解能力

32フレームを均等にサンプリングして動画コンテンツを処理し、優れた動画理解能力を備えています

マルチモーダル命令追従

動画と画像に基づくマルチモーダル命令を理解し実行できます

オープンソースモデルのリーダー

現在VideoMMEベンチマークでオープンソースモデルとしてトップの地位にあります

モデル能力

動画コンテンツ理解

マルチモーダル対話

動画質問応答

動画コンテンツ記述

使用事例

動画コンテンツ分析

動画質問応答システム

動画コンテンツに基づいてユーザーの質問に答えます

VideoMMEベンチマークで優れた成績を収めています

動画コンテンツ要約

動画コンテンツの文章記述と要約を生成します

教育応用

教育動画分析

学生が教育動画の内容を理解し質問に答えるのを支援します

🚀 LLaVA-NeXT-Videoモデルカード

Google Colabの無料枠でLlavaを実行するGoogle Colabデモもチェックしてください：

免責事項: LLaVa-NeXT-Videoをリリースしたチームはこのモデルのモデルカードを作成していないため、このモデルカードはHugging Faceチームによって作成されました。

📄 モデルの詳細

属性	详情
モデルタイプ	LLaVA-Next-Videoは、マルチモーダルな命令追従データでLLMをファインチューニングすることで学習されたオープンソースのチャットボットです。このモデルは、LLaVa-NeXTをベースに、ビデオと画像のデータの混合で調整することで、より良いビデオ理解能力を達成しています。ビデオは、クリップごとに32フレームに均一にサンプリングされました。このモデルは、VideoMMEベンチのオープンソースモデルの中で現在のSOTAです。ベースのLLMは lmsys/vicuna-7b-v1.5 です。
モデル作成日	LLaVA-Next-Video-7Bは2024年4月に学習されました。
詳細情報の論文またはリソース	https://github.com/LLaVA-VL/LLaVA-NeXT

llava_next_video_arch

📚 学習データセット

画像

LAION/CC/SBUからの558Kのフィルタリングされた画像テキストペアで、BLIPによってキャプション付けされました。
158KのGPT生成のマルチモーダル命令追従データ。
500Kの学術タスク指向のVQAデータの混合。
50KのGPT-4Vデータの混合。
40KのShareGPTデータ。

ビデオ

100KのVideoChatGPT-Instruct。

📊 評価データセット

3つの学術的なVQAベンチマークと1つのキャプショニングベンチマークを含む4つのベンチマークのコレクションです。

🚀 モデルの使い方

まず、transformers >= 4.42.0 がインストールされていることを確認してください。このモデルは、マルチビジュアルおよびマルチプロンプト生成をサポートしています。つまり、プロンプトに複数の画像/ビデオを渡すことができます。また、正しいプロンプトテンプレート (USER: xxx\nASSISTANT:) に従い、画像/ビデオをクエリしたい場所にトークン <image> または <video> を追加することを確認してください。

以下は、GPUデバイスで float16 精度で生成を実行するサンプルスクリプトです。

import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-34B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


# define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image", "video") 
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

画像を入力とした推論

上記のようにモデルをロードした後、以下のコードを使用して画像から生成を行います。

import requests
from PIL import Image

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(text=prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

画像とビデオを入力とした推論

上記のようにモデルをロードした後、以下のコードを使用して画像とビデオから一度に生成を行います。

conversation_1 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What's the content of the image>"},
          {"type": "image"},
        ],
    }
]
conversation_2 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Why is this video funny?"},
          {"type": "video"},
        ],
    },
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)

s = processor(text=[prompt_1, prompt_2], images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)

transformers>=v4.48 から、会話履歴に画像/ビデオのURLまたはローカルパスを渡し、チャットテンプレートに残りの処理を任せることもできます。ビデオの場合は、ビデオからサンプリングする num_frames を指定する必要もあります。指定しない場合、ビデオ全体がロードされます。チャットテンプレートは、画像/ビデオをロードし、torch.Tensor 形式の入力を返します。これを直接 model.generate() に渡すことができます。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "video", "path": "my_video.mp4"},
            {"type": "text", "text": "What is shown in this image and video?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

モデルの最適化

`bitsandbytes` ライブラリを使用した4ビット量子化

まず、bitsandbytes をインストールしてください (pip install bitsandbytes)。CUDA互換のGPUデバイスにアクセスできることを確認してください。上記のスニペットを以下のように変更するだけです。

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

Flash-Attention 2を使用して生成をさらに高速化

まず、flash-attn をインストールしてください。そのパッケージのインストールについては、Flash Attentionのオリジナルリポジトリを参照してください。上記のスニペットを以下のように変更するだけです。

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

🔒 ライセンス

✏️ 引用

もしあなたの研究で私たちの論文やコードが役立つと思われる場合は、以下のように引用してください。

@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}