๐ BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
This is a model card for image captioning pretrained on the COCO dataset, using a base architecture with a ViT large backbone.
 |
Pull figure from BLIP official repo |
๐ Quick Start
The BLIP model is designed for unified vision - language understanding and generation tasks. It can be effectively used for image captioning.
โจ Features
- Flexible Transfer: Transfers flexibly to both vision - language understanding and generation tasks.
- Effective Use of Noisy Data: Utilizes noisy web data by bootstrapping captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
- State - of - the - Art Results: Achieves state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval, image captioning, and VQA.
- Strong Generalization: Demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner.
๐ฆ Installation
The model can be installed using the transformers
library. You can install it via pip if not already installed:
pip install transformers
๐ป Usage Examples
Basic Usage
This example shows how to use the model for conditional and unconditional image captioning on CPU.
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Advanced Usage
Running the model on GPU in full precision
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Running the model on GPU in half precision (float16
)
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
๐ Documentation
TL;DR
The authors from the paper write in the abstract:
Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner. Code, models, and datasets are released.
๐ง Technical Details
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact peopleโs lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
๐ License
This model is released under the BSD 3 - Clause license.
BibTex and citation info
@misc{https://doi.org/10.48550/arxiv.2201.12086,
doi = {10.48550/ARXIV.2201.12086},
url = {https://arxiv.org/abs/2201.12086},
author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}