blip-image-captioning-large Open Source Model - A Vision-Language Tool for Precise Image Caption Generation

Blip Image Captioning Large

Developed by drgary

A vision-language model pre-trained on the COCO dataset, excelling in generating accurate image descriptions

Image-to-Text Open Source License:Bsd-3-clause #Vision-Language Joint Pre-training #Multimodal Understanding and Generation #Noisy Data Filtering

Downloads 23

Release Time : 2/7/2025

Model Overview

BLIP is a unified vision-language pre-training framework capable of handling both vision-language understanding and generation tasks. This model employs a ViT-large backbone network and demonstrates excellent performance in image caption generation tasks.

Model Features

Unified Vision-Language Framework

Supports both vision-language understanding and generation tasks, enabling unified multi-task processing

High-Quality Data Generation

Effectively utilizes web data through a 'caption generation-denoising filtering' mechanism to improve training quality

Zero-shot Transfer Capability

Demonstrates strong zero-shot transfer capability in video-language tasks

Model Capabilities

Image Caption Generation

Conditional Text Generation

Vision-Language Understanding

Use Cases

Content Generation

Automatic Image Tagging

Automatically generates descriptive text for images

CIDEr score improved by 2.8% on the COCO dataset

Assistive Technology

Visual Impairment Assistance

Generates textual descriptions of images for visually impaired users

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a model card for image captioning pretrained on the COCO dataset, using a base architecture with a ViT large backbone.


Pull figure from BLIP official repo

🚀 Quick Start

The BLIP model is designed for unified vision - language understanding and generation tasks. It can be effectively used for image captioning.

✨ Features

Flexible Transfer: Transfers flexibly to both vision - language understanding and generation tasks.
Effective Use of Noisy Data: Utilizes noisy web data by bootstrapping captions, with a captioner generating synthetic captions and a filter removing the noisy ones.
State - of - the - Art Results: Achieves state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval, image captioning, and VQA.
Strong Generalization: Demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner.

📦 Installation

The model can be installed using the transformers library. You can install it via pip if not already installed:

pip install transformers

💻 Usage Examples

Basic Usage

This example shows how to use the model for conditional and unconditional image captioning on CPU.

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Advanced Usage

Running the model on GPU in full precision

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU in half precision (`float16`)

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a woman sitting on the beach with her dog

📚 Documentation

TL;DR

The authors from the paper write in the abstract:

Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video - language tasks in a zero - shot manner. Code, models, and datasets are released.

🔧 Technical Details

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

📄 License

This model is released under the BSD 3 - Clause license.

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご