Llava Next Inst It Vicuna 7B
LLaVA-Next-Inst-It-Vicuna-7B 是一款在多模態實例級理解方面表現卓越的模型,通過顯式視覺提示指令調優增強多模態實例理解。
下載量 14
發布時間 : 12/5/2024
模型概述
該模型基於 LLaVA-NeXT 架構,結合 Vicuna-7B 語言模型,專注於多模態實例級理解任務,支持圖像和視頻的細粒度分析。
模型特點
多模態實例級理解
通過顯式視覺提示指令調優,增強對圖像和視頻中實例的細粒度理解能力。
支持 Set-of-Marks 視覺提示
可以利用 Set-of-Marks 視覺提示進行更精確的實例引用和分析。
視頻幀時間戳引用
支持通過時間戳引用視頻中的特定幀,實現時序感知的多模態理解。
模型能力
圖像實例級描述
視頻時序分析
多模態問答
細粒度視覺理解
開放式文本生成
使用案例
圖像理解
圖像實例描述
對圖像中的特定實例進行詳細描述,支持通過實例 ID 引用。
在 Inst-IT-Bench-I-OE 數據集上達到 68.6% 的準確率。
視頻理解
視頻時序分析
分析視頻中特定時間點的內容變化,支持時間戳引用。
在 Inst-IT-Bench-V-OE 數據集上達到 49.3% 的準確率。
多模態問答
圖像問答
回答關於圖像內容的複雜問題,包括實例級細節。
在 GQA 數據集上達到 65.9% 的準確率。
🚀 LLaVA-Next-Inst-It-Vicuna-7B
LLaVA-Next-Inst-It-Vicuna-7B 是一個在實例級理解方面表現出色的多模態模型。該模型在論文 Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning 中被提出。
屬性 | 詳情 |
---|---|
模型類型 | clip-vit-large-patch14-336 + Vicuna-7B |
初始化模型 | LLaVA-NeXT |
訓練數據 | LLaVA-NeXT-Data / Inst-IT-Dataset |
精度 | bfloat16 |
🚀 快速開始
📦 安裝指南
我們的代碼基於 LLaVA-NeXT,在運行之前,請安裝 LLaVA-NeXT 來準備環境:
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
💻 使用示例
加載模型
from llava.model.builder import load_pretrained_model
from llava.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from llava.mm_utils import (
KeywordsStoppingCriteria,
get_model_name_from_path,
tokenizer_image_token,
process_images
)
from llava.conversation import SeparatorStyle, conv_templates
overwrite_config = {}
overwrite_config["mm_spatial_pool_stride"] = 2
overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
overwrite_config["mm_pooling_position"] = 'after'
overwrite_config["mm_newline_position"] = 'no_token'
model_path = "Inst-IT/LLaVA-Next-Inst-It-Vicuna-7B"
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, max_length = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=model_name,
device_map="auto",
torch_dtype='bfloat16',
overwrite_config=overwrite_config,
attn_implementation='sdpa')
圖像推理
無 Set-of-Marks 的推理
我們的模型可以在沒有 Set-of-Marks 視覺提示的圖像上進行推理,在這種情況下,它可以像其基礎模型 LLaVA-NeXT 一樣使用。
import torch
import requests
from PIL import Image
img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]
question = "Describe this image."
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=image_tensor,
attention_mask=attention_masks,
modalities="image",
image_sizes=image_sizes,
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
有 Set-of-Marks 的推理
當提供 Set-of-Marks 視覺提示時,我們的模型可以進行更細粒度的理解。你可以使用實例的 ID 來引用你感興趣的實例。與之前的推理代碼相比,以下代碼除了輸入圖像使用了 Set-of-Marks 進行視覺提示外,沒有其他修改。請參考 此鏈接 瞭解如何為圖像生成 Set-of-Marks。
import torch
import requests
from PIL import Image
img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
image = Image.open(requests.get(img_url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config).bfloat16()
image_sizes = [image.size]
# You can use [id] to refer to the instances that you are interested in
question = "Describe [8] in detail."
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=image_tensor,
attention_mask=attention_masks,
modalities="image",
image_sizes=image_sizes,
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
視頻推理
對於視頻,我們將每一幀組織成一個列表。你可以使用格式 <t> 來引用特定的時間戳(例如 <1>)。
無 Set-of-Marks 的推理
我們的模型可以在沒有 Set-of-Marks 視覺提示的視頻上進行推理,在這種情況下,它可以像其基礎模型 LLaVA-NeXT 一樣使用。
import torch
import requests
from PIL import Image
frame_urls = [
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]
question = "Describe the video." # overall video caption
question = "What happens at frame <1>?" # caption a specific moment
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=videos,
attention_mask=attention_masks,
modalities="video",
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
有 Set-of-Marks 的推理
當提供 Set-of-Marks 視覺提示時,我們的模型可以進行更細粒度的理解。你可以使用實例的 ID 來引用你感興趣的實例。與之前的推理代碼相比,以下代碼除了輸入視頻使用了 Set-of-Marks 進行視覺提示外,沒有其他修改。請參考 SAM2 和 SoM 瞭解如何為視頻生成 Set-of-Marks。
import torch
import requests
from PIL import Image
frame_urls = [
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
]
video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
video = video.bfloat16()
videos = [video]
# You can use [id] to refer to the instances that you are interested in
question = "Is [3] visible at <1>?"
question = DEFAULT_IMAGE_TOKEN + "\n" + question
conv_template = 'vicuna_v1'
conv = conv_templates[conv_template].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
attention_masks = input_ids.ne(pad_token_ids).long().cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
inputs=input_ids,
images=videos,
attention_mask=attention_masks,
modalities="video",
use_cache=True,
stopping_criteria=[stopping_criteria],
max_new_tokens=4096
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)
📞 聯繫我們
如果您有任何問題或建議,請隨時與我們聯繫:
- 郵箱(彭武健):wjpeng24@m.fudan.edu.cn
- 郵箱(孟令晨):lcmeng20@fudan.edu.cn
📄 許可證
本項目採用 Apache-2.0 許可證。
📚 引用
@article{peng2024inst,
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2412.03565},
year={2024}
}
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98