InstructBLIP Open-Source AI Model - Fuse Vision and Language, Describe and Answer Questions According to Image-Text Instructions!

Instructblip Flan T5 Xxl 8bit Nf4

Developed by Mediocreatmybest

InstructBLIP is the vision-instruction-tuned version of BLIP-2, combining vision and language models to generate descriptions or answer questions based on images and text instructions.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Visual Instruction Tuning #Image Caption Generation #Multimodal Interaction

Downloads 22

Release Time : 8/21/2023

Model Overview

This model uses Flan-T5-xxl as the language model and achieves general vision-language task processing capabilities through instruction tuning.

Model Features

Visual Instruction Tuning

Enables the model to understand and execute complex image-based instructions through instruction tuning.

Multimodal Processing

Simultaneously processes visual and language inputs to achieve cross-modal understanding.

8-bit Quantization Support

Supports 8-bit/nf4 quantization using bitsandbytes to reduce resource requirements.

Model Capabilities

Image Caption Generation

Visual Question Answering

Cross-modal Understanding

Instruction Following

Use Cases

Image Understanding

Image Anomaly Detection

Identify and describe unusual elements in images

Accurately points out anomalous elements in images

Assistive Technology

Visual Assistance

Describe image content for visually impaired individuals

Generates detailed and accurate image descriptions

🚀 InstructBLIP model

The InstructBLIP model uses Flan - T5 - xxl as its language model, aiming to solve vision - language tasks and provide high - quality image - text interaction capabilities.

🚀 Quick Start

Quantization with bitsandbytes
8 - bit / nf4 / Safetensors
-Mediocre 🥱

InstructBLIP model using Flan-T5-xxl as language model. InstructBLIP was introduced in the paper InstructBLIP: Towards General - purpose Vision - Language Models with Instruction Tuning by Dai et al.

Disclaimer: The team releasing InstructBLIP did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

InstructBLIP is a visual instruction tuned version of [BLIP - 2](https://huggingface.co/docs/transformers/main/model_doc/blip - 2). Refer to the paper for details.

InstructBLIP architecture

💻 Usage Examples

Basic Usage

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
import requests

model = InstructBlipForConditionalGeneration.from_pretrained("Salesforce/instructblip-flan-t5-xxl")
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-flan-t5-xxl")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

url = "https://raw.githubusercontent.com/salesforce/LAVIS/main/docs/_static/Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "What is unusual about this image?"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device)

outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=5,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

Advanced Usage

For code examples, we refer to the documentation.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご