BLIP Open-Source Image Captioning Model - Free Implementation of Conditional and Unconditional Image Captioning Generation

Blip Image Captioning Large

Developed by Salesforce

BLIP is a unified vision-language pretraining framework, excelling at image caption generation tasks, supporting both conditional and unconditional image caption generation.

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal image caption generation #Vision-language joint pretraining #Zero-shot transfer capability

Downloads 2.5M

Release Time : 12/13/2022

Model Overview

An image caption generation model pretrained on the COCO dataset, using a large ViT backbone network, capable of generating natural language descriptions for input images.

Model Features

Unified vision-language framework

Simultaneously supports vision-language understanding and generation tasks with flexible transfer capabilities

Bootstrapped captioning technique

Effectively utilizes web data by generating synthetic captions through a captioner and filtering noise with a filter

Multi-task adaptation

Applicable to various tasks including image-text retrieval, image caption generation, and visual question answering

Model Capabilities

Image caption generation

Conditional image captioning

Unconditional image captioning

Vision-language understanding

Use Cases

Content generation

Automatic image tagging

Automatically generates descriptive text for images in photo libraries

Improves image retrieval efficiency and accessibility

Assistive technology

Visual impairment assistance

Describes image content for visually impaired users

Enhances accessibility of digital content

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).


Pull figure from BLIP official repo

🚀 Quick Start

This model can be used for both conditional and un-conditional image captioning. You can run it on CPU or GPU with different precision settings.

✨ Features

Flexible Transfer: BLIP is a new VLP framework that can transfer flexibly to both vision - language understanding and generation tasks.
Effective Use of Noisy Data: It effectively utilizes noisy web data by bootstrapping the captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State - of - the - Art Results: Achieves state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval, image captioning, and VQA.
Strong Generalization: Demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner.

📦 Installation

The installation is mainly about getting the necessary Python libraries. You need to have transformers, requests, and Pillow installed. You can install them using pip:

pip install transformers requests pillow

💻 Usage Examples

Basic Usage

You can use this model for conditional and un - conditional image captioning.

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Advanced Usage

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📚 Documentation

TL;DR

Authors from the paper write in the abstract:

Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner. Code, models, and datasets are released.

🔧 Technical Details

This section is not provided in the original document, so it is skipped.

📄 License

The model is released under the bsd - 3 - clause license.

⚠️ Important Note

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📚 BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Blip Image Captioning Large

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

📦 Installation

💻 Usage Examples

Basic Usage

Running the model on CPU

Advanced Usage

Running the model on GPU

In full precision

In half precision (float16)

📚 Documentation

TL;DR

🔧 Technical Details

📄 License

⚠️ Important Note

📚 BibTex and citation info

In half precision (`float16`)