BLIP-Large Fine-tuned Open-source Model - Alleviate Description Hallucinations and Accurately Generate Image Captions

Blip Image Captioning Large Mocha

Developed by moranyanuka

This is the official fine-tuned version of the BLIP-Large model, optimized using the MOCHa reinforcement learning framework on the MS-COCO dataset to mitigate open-vocabulary description hallucination issues

Image-to-Text

Transformers

Open Source License:MIT #Anti-hallucination Image Captioning #Open-vocabulary Generation #Reinforcement Learning Fine-tuning

Downloads 188

Release Time : 12/19/2023

Model Overview

An image captioning generation model based on the BLIP-Large architecture, supporting both conditional and unconditional image caption generation

Model Features

MOCHa Fine-tuning

Fine-tuned on the MS-COCO dataset using the MOCHa reinforcement learning framework

Mitigating Description Hallucination

Specifically optimized to address open-vocabulary description hallucination issues

Dual-mode Generation

Supports both conditional and unconditional image caption generation methods

Model Capabilities

Image Caption Generation

Conditional Text Generation

Vision-Language Understanding

Use Cases

Image Understanding

Automatic Image Tagging

Generates accurate descriptive text for images

Produces natural language descriptions that match image content

Assisting Visually Impaired Users

Converts visual content into textual descriptions

Helps visually impaired individuals understand image content

Content Creation

Social Media Content Generation

Automatically generates captions for uploaded images

Improves content creation efficiency

🚀 Mocha Checkpoint for BLIP-Large Model

This is the official checkpoint of the BLIP-Large model. It has been finetuned on MS-COCO using the MOCHa RL framework. The model is introduced in the paper Mitigating Open-Vocabulary Caption Hallucinations.

Project Page

🚀 Quick Start

You can use this model for conditional and un-conditional image captioning.

💻 Usage Examples

Basic Usage

This example shows how to run the model on CPU:

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Advanced Usage

Running the model on GPU in full precision

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Running the model on GPU in half precision (`float16`)

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and a dog on the beach

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> there is a woman and a dog on the beach at sunset

📄 License

This project is licensed under the MIT license.

📚 Documentation

BibTeX

@misc{benkish2024mitigating,
      title={Mitigating Open-Vocabulary Caption Hallucinations}, 
      author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
      year={2024},
      eprint={2312.03631},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご