blip-image-captioning-large Open Source Model - Free Image Caption Generation and Understanding

Blip Image Captioning Large

Developed by movementso

BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal Image Captioning #Zero-shot Transfer #Visual Language Generation

Downloads 18

Release Time : 6/25/2023

Model Overview

A vision-language model pretrained on the COCO dataset, capable of generating natural language descriptions for images, supporting both conditional and unconditional image caption generation

Model Features

Unified Vision-Language Framework

Supports both vision-language understanding and generation tasks with flexible transfer capabilities

Guided Annotation Strategy

Generates synthetic captions through annotators and filters out low-quality samples to effectively utilize noisy web data

Multi-task Adaptability

Applicable to various tasks including image-text retrieval, image caption generation, and visual question answering

Model Capabilities

Image Caption Generation

Vision-Language Understanding

Conditional Image Captioning

Unconditional Image Captioning

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptive text for images

Achieves a 2.8% improvement in CIDEr metric on the COCO dataset

Assistive Technology

Visual Impairment Assistance

Describes image content for visually impaired users

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).


Pull figure from BLIP official repo

🚀 Quick Start

This is a model card for image captioning pretrained on the COCO dataset, with a base architecture using a ViT large backbone. It can be used for both conditional and unconditional image captioning tasks.

✨ Features

Flexible Transfer: BLIP is a new Vision-Language Pre-training (VLP) framework that can flexibly transfer to both vision-language understanding and generation tasks.
Effective Use of Noisy Data: It effectively utilizes noisy web data by bootstrapping captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State-of-the-Art Results: Achieves state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval, image captioning, and VQA.
Strong Generalization: Demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.

📚 Documentation

TL;DR

Authors from the paper write in the abstract:

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.

💻 Usage Examples

Basic Usage

You can use this model for conditional and un-conditional image captioning.

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📄 License

This model is released under the bsd-3-clause license.

📚 BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Blip Image Captioning Large

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

📚 Documentation

TL;DR

💻 Usage Examples

Basic Usage

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

📄 License

📚 BibTex and citation info

In half precision (`float16`)