Open-source Pic2Story model - Implements image description generation and understanding based on BLIP, free and easy to use

Pic2story

Developed by abhijit2111

BLIP is a unified vision-language pretraining framework, excelling in image captioning and understanding tasks, effectively utilizing noisy web data through guided caption generation

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Multimodal understanding and generation #Zero-shot video transfer #Noisy data filtering

Downloads 140

Release Time : 4/9/2024

Model Overview

This model is a pretrained image captioning model based on the COCO dataset, using a ViT-large backbone architecture, supporting both conditional and unconditional image caption generation

Model Features

Unified vision-language framework

Flexibly transferable to vision-language understanding and generation tasks

Guided caption generation

Effectively utilizes noisy web data through caption generator and filter

Multi-task adaptation

Supports various tasks including image captioning, image-text retrieval, and visual question answering

Model Capabilities

Image captioning

Vision-language understanding

Conditional text generation

Unconditional text generation

Use Cases

Content generation

Automatic image tagging

Generate descriptive text for images

2.8% improvement in CIDEr score on COCO dataset

Information retrieval

Image-text retrieval

Match relevant images based on text queries

2.7% improvement in average recall@1

Intelligent Q&A

Visual question answering

Answer questions about image content

1.6% improvement in VQA score

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP is a model for image captioning, which can be flexibly transferred to both vision - language understanding and generation tasks, achieving state - of - the - art results on a wide range of vision - language tasks.

🚀 Quick Start

This is the BLIP Salesforce large image captioning model with small adjustments to the parameters on the back end for testing. Notably, the length of the reply is increased. It's a model card for image captioning pretrained on the COCO dataset - base architecture (with ViT large backbone).


Pull figure from BLIP official repo

✨ Features

Authors from the paper write in the abstract:

Vision - Language Pre - training (VLP) has advanced the performance for many vision - language tasks. However, most existing pre - trained models only excel in either understanding - based tasks or generation - based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image - text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision - language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state - of - the - art results on a wide range of vision - language tasks, such as image - text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero - shot manner. Code, models, and datasets are released.

💻 Usage Examples

Basic Usage

You can use this model for conditional and unconditional image captioning.

Using the Pytorch model

Running the model on CPU

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU

In full precision

Click to expand

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

In half precision (`float16`)

Click to expand

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📚 Documentation

BibTex and citation info

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

📄 License

This model is under the bsd - 3 - clause license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Pic2story

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

🚀 Quick Start

✨ Features

💻 Usage Examples

Basic Usage

Using the Pytorch model

Running the model on CPU

Running the model on GPU

In full precision

In half precision (float16)

📚 Documentation

BibTex and citation info

📄 License

In half precision (`float16`)