LLaVA-Next-Inst-It-Vicuna-7B Open Source Model - Enhance Multi-modal Instance Understanding and Improve Practical Application Performance

Llava Next Inst It Vicuna 7B

Developed by Inst-IT

LLaVA-Next-Inst-It-Vicuna-7B is a model that excels in multimodal instance-level understanding, enhanced through explicit visual prompt instruction tuning.

Safetensors

EnglishOpen Source License:Apache-2.0 #Instance-level visual understanding #Multimodal instruction tuning #Fine-grained video frame analysis

Downloads 14

Release Time : 12/5/2024

Model Overview

This model is based on the LLaVA-NeXT architecture, combined with the Vicuna-7B language model, focusing on multimodal instance-level understanding tasks, supporting fine-grained analysis of images and videos.

Model Features

Multimodal instance-level understanding

Enhances fine-grained understanding of instances in images and videos through explicit visual prompt instruction tuning.

Supports Set-of-Marks visual prompts

Enables more precise instance referencing and analysis using Set-of-Marks visual prompts.

Video frame timestamp referencing

Supports referencing specific frames in videos via timestamps, enabling temporally-aware multimodal understanding.

Model Capabilities

Instance-level image description

Temporal video analysis

Multimodal Q&A

Fine-grained visual understanding

Open-ended text generation

Use Cases

Image understanding

Image instance description

Provides detailed descriptions of specific instances in images, supporting referencing via instance IDs.

Achieves 68.6% accuracy on the Inst-IT-Bench-I-OE dataset.

Video understanding

Temporal video analysis

Analyzes content changes at specific timestamps in videos, supporting timestamp referencing.

Achieves 49.3% accuracy on the Inst-IT-Bench-V-OE dataset.

Multimodal Q&A

Image Q&A

Answers complex questions about image content, including instance-level details.

Achieves 65.9% accuracy on the GQA dataset.

🚀 LLaVA-Next-Inst-It-Vicuna-7B

LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that shines in instance-level understanding. It is introduced in the paper Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning. This model can handle various multimodal tasks effectively.

Homepage | Code | Paper | arXiv

✨ Features

Architecture: clip-vit-large-patch14-336 + Vicuna-7B
Initialized Model: LLaVA-NeXT
Data: LLaVA-NeXT-Data / Inst-IT-Dataset
Precision: bfloat16

📦 Installation

Our code is based on LLaVA-NeXT. Before running, please install LLaVA-NeXT to prepare the environment:

pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

💻 Usage Examples

Basic Usage

Load Model

from llava.model.builder import load_pretrained_model
from llava.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from llava.mm_utils import (
    KeywordsStoppingCriteria,
    get_model_name_from_path,
    tokenizer_image_token,
    process_images
)
from llava.conversation import SeparatorStyle, conv_templates

overwrite_config = {}
overwrite_config["mm_spatial_pool_stride"] = 2
overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
overwrite_config["mm_pooling_position"] = 'after'
overwrite_config["mm_newline_position"] = 'no_token'

model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
model_name = get_model_name_from_path(model_path)

tokenizer, model, image_processor, max_length = load_pretrained_model(
            model_path=model_path, 
            model_base=None, 
            model_name=model_name,
            device_map="auto", 
            torch_dtype='bfloat16', 
            overwrite_config=overwrite_config,
            attn_implementation='sdpa')

Advanced Usage

Image Inference

Inference without SoMs

Our model can perform inference on images without Set-of-Marks visual prompts. In this case, it can be used in the same way as its base mode LLaVA-NeXT.

import torch
import requests
from PIL import Image

img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]

question = "Describe this image."
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=image_tensor,
        attention_mask=attention_masks,
        modalities="image",
        image_sizes=image_sizes,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Inference with SoMs

Our model performs more fine-grained understanding when Set-of-Marks visual prompts are provided. You can refer to the instances that you are interested in using their IDs. Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks. Refer to this link to learn how to generate SoMs for an image.

import torch
import requests
from PIL import Image

img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]

# You can use [id] to refer to the instances that you are interested in
question = "Describe [8] in detail."
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=image_tensor,
        attention_mask=attention_masks,
        modalities="image",
        image_sizes=image_sizes,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Video Inference

Inference without SoMs

Our model can perform inference on videos without Set-of-Marks visual prompts. In this case, it can be used in the same way as its base mode LLaVA-NeXT.

import torch
import requests
from PIL import Image

frame_urls = [
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]

question = "Describe the video."  # overall video caption
question = "What happens at frame <1>?"  # caption a specific moment
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=videos,
        attention_mask=attention_masks,
        modalities="video",
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Inference with SoMs

Our model performs more fine-grained understanding when Set-of-Marks visual prompts are provided. You can refer to the instances that you are interested in using their IDs. Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks. Refer to SAM2 and SoM to learn how to generate SoMs for a video.

import torch
import requests
from PIL import Image

frame_urls = [
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
    "https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]

# You can use [id] to refer to the instances that you are interested in
question = "Is [3] visible at <1>?"
question = DEFAULT_IMAGE_TOKEN + "\n" + question

conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        inputs=input_ids,
        images=videos,
        attention_mask=attention_masks,
        modalities="video",
        use_cache=True,
        stopping_criteria=[stopping_criteria],
        max_new_tokens=4096
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

📚 Documentation

Model Information

Property	Details
Model Type	multimodal
Training Data	Inst-IT/Inst-IT-Dataset, lmms-lab/LLaVA-NeXT-Data
Base Model	liuhaotian/llava-v1.6-vicuna-7b
Pipeline Tag	video-text-to-text
Tags	multimodal, fine-grained, instance-understanding

Results

The model has been evaluated on multiple datasets, and the accuracy metrics are as follows:

Task Type	Dataset Name	Accuracy
multimodal	Inst-IT-Bench-I-OE	68.6
multimodal	Inst-IT-Bench-I-MC	63
multimodal	AI2D	71
multimodal	MMMU	37.4
multimodal	POPE	87.2
multimodal	GQA	65.9
multimodal	MM-Vet	38.1
multimodal	Inst-IT-Bench-V-OE	49.3
multimodal	Inst-IT-Bench-V-MC	42.1
multimodal	ActNet-QA	53.7
multimodal	EgoSchema	57.8
multimodal	NextQA	70.2
multimodal	VideoMME	44.3
multimodal	TempoCompass	59.8

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご