BLIP Open-Source Image Captioning Model - Free Deployment with Support for Conditional and Unconditional Text Generation

Image Captioning With Blip

Developed by Vidensogende

BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal Understanding Generation #Vision-Language Pretraining #Zero-shot Transfer

Downloads 16

Release Time : 12/7/2023

Model Overview

A vision-language model pretrained on the COCO dataset, utilizing a ViT large backbone network, capable of generating natural language descriptions for input images

Model Features

Unified Vision-Language Framework

Supports both vision-language understanding and generation tasks, with flexible transfer capabilities

Guided Annotation Strategy

Effectively utilizes noisy web data through generator and filter mechanisms to improve data quality

Multi-task Adaptability

Applicable to various vision-language tasks such as image retrieval and visual question answering

Model Capabilities

Image Caption Generation

Conditional Text Generation

Vision-Language Understanding

Zero-shot Transfer Learning

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptive text for social media images

Enhances content accessibility and search efficiency

Assisting Visually Impaired Users

Converts visual content into spoken descriptions

Improves digital content accessibility

Multimodal Applications

Visual Question Answering System

Answers user questions based on image content

Improves accuracy by 1.6% on VQA tasks

Cross-modal Retrieval

Enables bidirectional retrieval between images and text

Increases average Recall@1 by 2.7%

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a model card for image captioning pretrained on the COCO dataset - base architecture (with ViT large backbone). It aims to achieve unified vision - language understanding and generation.


Pull figure from BLIP official repo

🚀 Quick Start

This model can be used for conditional and un - conditional image captioning. You can run it on both CPU and GPU.

✨ Features

Flexible Transfer: BLIP is a new VLP framework that can flexibly transfer to both vision - language understanding and generation tasks.
Effective Use of Noisy Data: It effectively utilizes noisy web data by bootstrapping captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State - of - the - Art Results: Achieves state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval, image captioning, and VQA.
Strong Generalization: Demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner.

📚 Documentation

TL;DR

Authors from the paper write in the abstract:

Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner. Code, models, and datasets are released.

💻 Usage Examples

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📄 License

The model is released under the bsd - 3 - clause license.

🔧 Technical Details

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Image Captioning With Blip

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

📚 Documentation

TL;DR

💻 Usage Examples

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

📄 License

🔧 Technical Details

BibTex and citation info

In half precision (`float16`)