BLIP Open-Source Vision-Language Model - Free Support for Image Captioning and Visual Question Answering Tasks

Zcabnzh Bp

Developed by nanxiz

BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation and visual question answering, with performance enhanced by innovative data filtering mechanisms

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal understanding and generation #Vision-language pretraining #Noisy data filtering

Downloads 19

Release Time : 7/8/2024

Model Overview

An image caption generation model pretrained on the COCO dataset, utilizing a ViT large backbone network, supporting both conditional and unconditional image caption generation

Model Features

Unified Vision-Language Framework

Supports both vision-language understanding and generation tasks, enabling unified modeling for multiple tasks

Efficient Data Filtering

Automatically cleans noisy web data through a 'caption generation-filtering' mechanism, improving training data quality

Zero-shot Transfer Capability

Demonstrates excellent zero-shot transfer performance on video-language tasks

Model Capabilities

Image caption generation

Visual question answering

Image-text retrieval

Multimodal understanding

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptive text for social media images

Improves CIDEr score by 2.8% on the COCO dataset

Assistive Technology

Assistance for Visually Impaired

Converts visual content into textual descriptions

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).


Pull figure from BLIP official repo

🚀 Quick Start

This model can be used for conditional and un-conditional image captioning. You can follow the usage section below to start using it.

✨ Features

Flexible Transfer: BLIP is a new VLP framework that can flexibly transfer to both vision-language understanding and generation tasks.
Effective Utilization of Noisy Data: It effectively utilizes noisy web data by bootstrapping captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State-of-the-Art Results: Achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval, image captioning, and VQA.
Strong Generalization Ability: Demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

📚 Documentation

TL;DR

Authors from the paper write in the abstract:

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

💻 Usage Examples

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a woman sitting on the beach with her dog

📄 License

The model is released under the bsd-3-clause license.

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Zcabnzh Bp

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

📚 Documentation

TL;DR

💻 Usage Examples

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

📄 License

BibTex and citation info

In half precision (`float16`)