instructblip-flan-t5-xxl_8bit Open-source Vision-Language Model - Generate Image Descriptions and Answer Visual Questions for Free

Instructblip Flan T5 Xxl 8bit

Developed by Mediocreatmybest

BLIP-2 is a vision-language model based on Flan T5-xxl, pretrained by freezing the image encoder and large language model, supporting tasks like image caption generation and visual question answering.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Image Caption Generation #Visual Question Answering #Multimodal Fusion

Downloads 18

Release Time : 8/8/2023

Model Overview

The BLIP-2 model consists of a CLIP image encoder, a query transformer, and a large language model (Flan T5-xxl). It bridges the gap between visual and language modalities by training the query transformer to achieve image-to-text generation tasks.

Model Features

Multimodal Pretraining

Combines visual encoders with large language models to achieve cross-modal understanding and generation.

Parameter Efficiency

Only the query transformer (Q-Former) is trained, while the image encoder and language model parameters remain frozen.

Zero-shot Capability

The pretrained model can be directly applied to downstream tasks (e.g., VQA) without fine-tuning.

Model Capabilities

Image Caption Generation

Visual Question Answering (VQA)

Image-based Dialogue Generation

Use Cases

Content Generation

Automatic Image Annotation

Generates natural language descriptions for images.

Can produce text descriptions that accurately reflect image content.

Intelligent Interaction

Visual Question Answering System

Answers natural language questions about image content.

Can correctly answer questions like 'How many dogs are in the picture?'

🚀 BLIP-2, Flan T5-xxl, pre-trained only

The BLIP-2 model leverages the Flan T5-xxl large language model. It can be used for tasks like image captioning, visual question answering, etc.

🚀 Quick Start

You can use the raw model for conditional text generation given an image and optional text. Check the model hub to find fine - tuned versions for tasks that interest you.

✨ Features

Multi - task Capability: Can be used for image captioning, visual question answering (VQA), and chat - like conversations.
Bridge between Image and Language: The Querying Transformer (Q - Former) bridges the gap between the image encoder and the large language model.

📚 Documentation

Model description

BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model.

The authors initialize the weights of the image encoder and large language model from pre - trained checkpoints and keep them frozen while training the Querying Transformer. The Querying Transformer, a BERT - like Transformer encoder, maps a set of "query tokens" to query embeddings, which bridge the gap between the embedding space of the image encoder and the large language model.

The goal for the model is simply to predict the next text token, given the query embeddings and the previous text.

BLIP2 Architecture

This allows the model to be used for tasks like:

image captioning
visual question answering (VQA)
chat - like conversations by feeding the image and the previous conversation as prompt to the model

Direct Use and Downstream Use

You can use the raw model for conditional text generation given an image and optional text. See the model hub to look for fine - tuned versions on a task that interests you.

Bias, Risks, Limitations, and Ethical Considerations

BLIP2 - FlanT5 uses off - the - shelf Flan - T5 as the language model. It inherits the same risks and limitations from Flan - T5:

Language models, including Flan - T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan - T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

BLIP2 is fine - tuned on image - text datasets (e.g. [LAION](https://laion.ai/blog/laion - 400 - open - dataset/)) collected from the internet. As a result, the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

BLIP2 has not been tested in real - world applications. It should not be directly deployed in any applications. Researchers should first carefully assess the safety and fairness of the model in relation to the specific context they’re being deployed within.

💻 Usage Examples

Basic Usage

For code examples, we refer to the documentation, or refer to the snippets below depending on your usecase:

Running the model on CPU

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Advanced Usage

Running the model on GPU

In full precision

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In 8 - bit precision (`int8`)

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご