Open-source BLIP-VQA-Base Vision-Language Model - Free Deployment to Solve Visual Question-Answering Problems

Blip Vqa Base

Developed by Salesforce

BLIP is a unified vision-language pretraining framework, excelling in visual question answering tasks through joint language-image training to achieve multimodal understanding and generation capabilities

Text-to-Image

Transformers

Open Source License:Bsd-3-clause #Multimodal Understanding and Generation #Zero-shot Transfer #Image-Text Interaction

Downloads 1.9M

Release Time : 12/12/2022

Model Overview

A visual question answering model based on ViT architecture, capable of understanding image content and answering related questions, supporting both conditional and unconditional image caption generation

Model Features

Unified Understanding and Generation

Supports both vision-language understanding and generation tasks simultaneously, breaking the limitations of traditional models with single capabilities

Caption Bootstrapping Mechanism

Enhances training data quality effectively by using a generator to synthesize descriptive texts and a filter to eliminate noisy data

Zero-shot Transfer Capability

Demonstrates excellent generalization performance in new domains such as video-language tasks

Model Capabilities

Image Content Understanding

Visual Question Answering

Image Caption Generation

Multimodal Reasoning

Use Cases

Intelligent Assistance

Assistance for the Visually Impaired

Describes image content to visually impaired users through a question-answer format

Accurately identifies the number of objects in an image (e.g., correctly identifying 1 dog in the example)

Content Moderation

Image Content Review

Automatically analyzes image content and answers specific questions

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP is a new Vision-Language Pre-training (VLP) framework that can flexibly transfer to both vision-language understanding and generation tasks, achieving state-of-the-art results on a wide range of vision-language tasks.


Pull figure from BLIP official repo

📚 Documentation

Authors from the paper write in the abstract:

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

💻 Usage Examples

You can use this model for conditional and un-conditional image captioning

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 1

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 1

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("ybelkada/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("ybelkada/blip-vqa-base", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> 1

⚠️ Important Note

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📄 License

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Blip Vqa Base

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

📚 Documentation

💻 Usage Examples

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

⚠️ Important Note

📄 License

In half precision (`float16`)