LLaVA-Next-Inst-It-Vicuna-7Bオープンソースモデル - 多モーダルなインスタンス理解を強化し、実際のアプリケーションのパフォーマンスを向上させる

ホーム

Llava Next Inst It Vicuna 7B

Inst-ITによって開発

LLaVA-Next-Inst-It-Vicuna-7B は、マルチモーダルインスタンスレベルの理解において優れた性能を発揮するモデルで、明示的な視覚プロンプト命令チューニングによりマルチモーダルインスタンス理解を強化します。

Safetensors

英語オープンソースライセンス:Apache-2.0 #インスタンスレベルの視覚理解 #マルチモーダル命令チューニング #ビデオフレームの細粒度分析

ダウンロード数 14

リリース時間 : 12/5/2024

モデル概要

このモデルはLLaVA-NeXTアーキテクチャを基盤とし、Vicuna-7B言語モデルを組み合わせ、マルチモーダルインスタンスレベルの理解タスクに特化しており、画像と動画の細粒度分析をサポートします。

モデル特徴

マルチモーダルインスタンスレベルの理解

明示的な視覚プロンプト命令チューニングにより、画像や動画内のインスタンスに対する細粒度の理解能力を強化します。

Set-of-Marks視覚プロンプトのサポート

Set-of-Marks視覚プロンプトを利用して、より正確なインスタンス参照と分析が可能です。

ビデオフレームのタイムスタンプ参照

タイムスタンプを使用して動画内の特定フレームを参照し、時系列を意識したマルチモーダル理解を実現します。

モデル能力

画像インスタンスレベルの記述

動画時系列分析

マルチモーダル質問応答

細粒度視覚理解

オープンエンドテキスト生成

使用事例

画像理解

画像インスタンス記述

画像内の特定インスタンスを詳細に記述し、インスタンスIDによる参照をサポートします。

Inst-IT-Bench-I-OEデータセットで68.6%の精度を達成。

動画理解

動画時系列分析

動画内の特定時間点における内容変化を分析し、タイムスタンプ参照をサポートします。

Inst-IT-Bench-V-OEデータセットで49.3%の精度を達成。

マルチモーダル質問応答

画像質問応答

画像内容に関する複雑な質問に回答し、インスタンスレベルの詳細を含みます。

GQAデータセットで65.9%の精度を達成。

🚀 LLaVA-Next-Inst-It-Vicuna-7B

LLaVA-Next-Inst-It-Vicuna-7Bは、インスタンスレベルの理解に優れたマルチモーダルモデルです。論文 Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning で紹介されています。

ホームページ | コード | 論文 | arXiv

🚀 クイックスタート

インストール

私たちのコードはLLaVA-NeXTに基づいています。実行する前に、環境を準備するためにLLaVA-NeXTをインストールしてください。

pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

モデルの読み込み

from llava.model.builder import load_pretrained_model
from llava.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from llava.mm_utils import (
    KeywordsStoppingCriteria,
    get_model_name_from_path,
    tokenizer_image_token,
    process_images
)
from llava.conversation import SeparatorStyle, conv_templates

overwrite_config = {}
overwrite_config["mm_spatial_pool_stride"] = 2
overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
overwrite_config["mm_pooling_position"] = 'after'
overwrite_config["mm_newline_position"] = 'no_token'

model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
model_name = get_model_name_from_path(model_path)

tokenizer, model, image_processor, max_length = load_pretrained_model(
            model_path=model_path, 
            model_base=None, 
            model_name=model_name,
            device_map="auto", 
            torch_dtype='bfloat16', 
            overwrite_config=overwrite_config,
            attn_implementation='sdpa')

画像推論

SoMsなしでの推論

私たちのモデルは、[Set-of-Marks](https://arxiv.org/abs/2310.11441) のビジュアルプロンプトなしで画像の推論を行うことができます。この場合、ベースモデルの [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) と同じ方法で使用できます。 ```python import torch import requests from PIL import Image

img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true" image = Image.open(requests.get(img_url, stream=True).raw) image_tensor = process_images([image], image_processor, model.config).bfloat16() image_sizes = [image.size]

question = "Describe this image." question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1' conv = conv_templates[conv_template].copy() conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode(): output_ids = model.generate( inputs=input_ids, images=image_tensor, attention_mask=attention_masks, modalities="image", image_sizes=image_sizes, use_cache=True, stopping_criteria=[stopping_criteria], max_new_tokens=4096 )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(pred)

</details>

<details>
<summary>SoMsありでの推論</summary>
私たちのモデルは、[Set-of-Marks](https://arxiv.org/abs/2310.11441) のビジュアルプロンプトが提供された場合、より細粒度の理解を行います。IDを使用して、関心のあるインスタンスを参照することができます。前の推論コードと比較して、以下のコードは、入力画像がSet-of-Marksでビジュアルプロンプトされていることを除いて、変更はありません。画像に対してSoMsを生成する方法については、[このリンク](https://github.com/microsoft/SoM) を参照してください。
```python
import torch
import requests
from PIL import Image

img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]

# You can use [id] to refer to the instances that you are interested in
question = "Describe [8] in detail."
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=image_tensor,
        attention_mask=attention_masks,
        modalities="image",
        image_sizes=image_sizes,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

動画推論

動画の場合、各フレームをリストにまとめます。特定のタイムスタンプを参照するには、の形式を使用できます（例：<1>）。

SoMsなしでの推論

私たちのモデルは、[Set-of-Marks](https://arxiv.org/abs/2310.11441) のビジュアルプロンプトなしで動画の推論を行うことができます。この場合、ベースモデルの [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) と同じ方法で使用できます。 ```python import torch import requests from PIL import Image

frame_urls = [ "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true", "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true" ] video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls] video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda() video = video.bfloat16() videos = [video]

question = "Describe the video." # overall video caption question = "What happens at frame <1>?" # caption a specific moment question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1' conv = conv_templates[conv_template].copy() conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode(): output_ids = model.generate( inputs=input_ids, images=videos, attention_mask=attention_masks, modalities="video", use_cache=True, stopping_criteria=[stopping_criteria], max_new_tokens=4096 )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(pred)

</details>

<details>
<summary>SoMsありでの推論</summary>
私たちのモデルは、[Set-of-Marks](https://arxiv.org/abs/2310.11441) のビジュアルプロンプトが提供された場合、より細粒度の理解を行います。IDを使用して、関心のあるインスタンスを参照することができます。前の推論コードと比較して、以下のコードは、入力動画がSet-of-Marksでビジュアルプロンプトされていることを除いて、変更はありません。動画に対してSoMsを生成する方法については、[SAM2](https://github.com/facebookresearch/sam2) および [SoM](https://github.com/microsoft/SoM) を参照してください。
```python
import torch
import requests
from PIL import Image

frame_urls = [
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]

# You can use [id] to refer to the instances that you are interested in
question = "Is [3] visible at <1>?"
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=videos,
        attention_mask=attention_masks,
        modalities="video",
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

🔧 技術詳細

アーキテクチャ：clip-vit-large-patch14-336 + Vicuna-7B
初期化モデル：LLaVA-NeXT
データ：LLaVA-NeXT-Data / Inst-IT-Dataset
精度：bfloat16

📄 ライセンス

このプロジェクトは、Apache-2.0ライセンスの下で提供されています。

お問い合わせ

ご質問やご提案があれば、お気軽にお問い合わせください。

メール (Wujian Peng)：wjpeng24@m.fudan.edu.cn
メール (Lingchen Meng)：lcmeng20@fudan.edu.cn

引用

@article{peng2024inst,
  title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
  author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2412.03565},
  year={2024}
}

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

大規模言語モデル