BLIP (image-caption-large-copy) Open-Source Vision-Language Model - Generate Precise Image Descriptions for Free

Image Caption Large Copy

Developed by Sof22

BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal Image-Text Understanding #Zero-shot Transfer #High-quality Description Generation

Downloads 1,042

Release Time : 9/19/2023

Model Overview

This model is a COCO dataset-pretrained image captioning model, employing a ViT large backbone architecture, supporting both conditional and unconditional image caption generation

Model Features

Unified Vision-Language Framework

Flexibly transferable to vision-language understanding and generation tasks

Guided Annotation Strategy

Generates synthetic captions through an annotator and filters out low-quality samples, effectively utilizing noisy web data

Multi-task Support

Supports various tasks including vision-language retrieval, image captioning, and visual question answering

Model Capabilities

Image Captioning

Vision-Language Understanding

Multimodal Task Processing

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptions for images in social media or content management systems

Improves content accessibility and search engine optimization

Assistive Technology

Visual Impairment Assistance

Generates textual descriptions of images for visually impaired users

Enhances digital content accessibility

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a pre-trained image captioning model on the COCO dataset, with a base architecture (using a ViT large backbone). It can be used for both conditional and unconditional image captioning tasks.


Pull figure from BLIP official repo

📚 Documentation

TL;DR

Authors from the paper write in the abstract:

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

🚀 Quick Start

You can use this model for conditional and un-conditional image captioning

💻 Usage Examples

Basic Usage

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Advanced Usage

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a woman sitting on the beach with her dog

📄 License

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Property	Details
Model Type	Image captioning model
Training Data	COCO dataset

⚠️ Important Note

This is the BLIP salesforce large image captioning model with small adjustments to the paramaters on the back end for testing - note in particular the length of reply is increased.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Image Caption Large Copy

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

📚 Documentation

TL;DR

🚀 Quick Start

💻 Usage Examples

Basic Usage

Using the Pytorch model

Running the model on CPU

Advanced Usage

Running the model on GPU

In full precision

In half precision (float16)

📄 License

In half precision (`float16`)