🚀 LLaVA-Next-Inst-It-Vicuna-7B
LLaVA-Next-Inst-It-Vicuna-7B is a multimodal model that shines in instance-level understanding. It is introduced in the paper Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning. This model can handle various multimodal tasks effectively.
Homepage | Code | Paper | arXiv
✨ Features
- Architecture: clip-vit-large-patch14-336 + Vicuna-7B
- Initialized Model: LLaVA-NeXT
- Data: LLaVA-NeXT-Data / Inst-IT-Dataset
- Precision: bfloat16
📦 Installation
Our code is based on LLaVA-NeXT. Before running, please install LLaVA-NeXT to prepare the environment:
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
💻 Usage Examples
Basic Usage
Load Model
from llava.model.builder import load_pretrained_model
from llava.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from llava.mm_utils import (
KeywordsStoppingCriteria,
get_model_name_from_path,
tokenizer_image_token,
process_images
)
from llava.conversation import SeparatorStyle, conv_templates
overwrite_config = {}
overwrite_config["mm_spatial_pool_stride"] = 2
overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
overwrite_config["mm_pooling_position"] = 'after'
overwrite_config["mm_newline_position"] = 'no_token'
model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, max_length = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=model_name,
device_map="auto",
torch_dtype='bfloat16',
overwrite_config=overwrite_config,
attn_implementation='sdpa')
Advanced Usage
Image Inference
Inference without SoMs
Our model can perform inference on images without Set-of-Marks visual prompts. In this case, it can be used in the same way as its base mode LLaVA-NeXT.
import torch
import requests
from PIL import Image
img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]
question = "Describe this image."
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=image_tensor,
attention_mask=attention_masks,
modalities="image",
image_sizes=image_sizes,
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
Inference with SoMs
Our model performs more fine-grained understanding when Set-of-Marks visual prompts are provided.
You can refer to the instances that you are interested in using their IDs.
Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
Refer to this link to learn how to generate SoMs for an image.
import torch
import requests
from PIL import Image
img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]
question = "Describe [8] in detail."
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=image_tensor,
attention_mask=attention_masks,
modalities="image",
image_sizes=image_sizes,
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
Video Inference
Inference without SoMs
Our model can perform inference on videos without Set-of-Marks visual prompts. In this case, it can be used in the same way as its base mode LLaVA-NeXT.
import torch
import requests
from PIL import Image
frame_urls = [
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]
question = "Describe the video."
question = "What happens at frame <1>?"
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=videos,
attention_mask=attention_masks,
modalities="video",
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
Inference with SoMs
Our model performs more fine-grained understanding when Set-of-Marks visual prompts are provided.
You can refer to the instances that you are interested in using their IDs.
Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
Refer to SAM2 and SoM to learn how to generate SoMs for a video.
import torch
import requests
from PIL import Image
frame_urls = [
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]
question = "Is [3] visible at <1>?"
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=videos,
attention_mask=attention_masks,
modalities="video",
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
📚 Documentation
Model Information
Property |
Details |
Model Type |
multimodal |
Training Data |
Inst-IT/Inst-IT-Dataset, lmms-lab/LLaVA-NeXT-Data |
Base Model |
liuhaotian/llava-v1.6-vicuna-7b |
Pipeline Tag |
video-text-to-text |
Tags |
multimodal, fine-grained, instance-understanding |
Results
The model has been evaluated on multiple datasets, and the accuracy metrics are as follows:
Task Type |
Dataset Name |
Accuracy |
multimodal |
Inst-IT-Bench-I-OE |
68.6 |
multimodal |
Inst-IT-Bench-I-MC |
63 |
multimodal |
AI2D |
71 |
multimodal |
MMMU |
37.4 |
multimodal |
POPE |
87.2 |
multimodal |
GQA |
65.9 |
multimodal |
MM-Vet |
38.1 |
multimodal |
Inst-IT-Bench-V-OE |
49.3 |
multimodal |
Inst-IT-Bench-V-MC |
42.1 |
multimodal |
ActNet-QA |
53.7 |
multimodal |
EgoSchema |
57.8 |
multimodal |
NextQA |
70.2 |
multimodal |
VideoMME |
44.3 |
multimodal |
TempoCompass |
59.8 |
📄 License
This project is licensed under the Apache-2.0 license.