vlrm-blip2-opt-2.7b Open-source Image Captioning Model - Generate Long and Comprehensive Image Description Information

Vlrm Blip2 Opt 2.7b

Developed by sashakunitsyn

A BLIP-2 OPT-2.7B model fine-tuned with reinforcement learning, capable of generating long and comprehensive image descriptions

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Reinforcement Learning Fine-tuning #Long-text Image Captioning #Zero-shot Generation

Downloads 398

Release Time : 4/2/2024

Model Overview

This model is a vision-language model based on the BLIP-2 OPT-2.7B architecture, fine-tuned with reinforcement learning methods, focusing on image caption generation tasks. Compared to the original model, it can generate more detailed and comprehensive descriptions.

Model Features

Reinforcement Learning Fine-tuning

Optimized with reinforcement learning methods, enabling the model to generate longer and more comprehensive image descriptions

No Additional Computational Overhead

Compared to the original model, the improved model enhances performance while maintaining the same computational resource requirements

Modular Loading

Supports loading only the fine-tuned layer weights, allowing flexible application to the original model

Model Capabilities

Image Caption Generation

Vision-Language Understanding

Multimodal Processing

Use Cases

Image Understanding

Automatic Image Tagging

Generate detailed descriptions for images, useful for content management systems

Generates more comprehensive and longer descriptions compared to the original model

Assisting Visually Impaired Users

Provide detailed image descriptions for visually impaired users

Offers richer scene information

Content Creation

Social Media Content Generation

Automatically generate engaging descriptions for social media images

Generates more attractive long descriptions

🚀 VLRM

This repository provides the weights of the BLIP - 2 OPT - 2.7B model fine - tuned using the reinforcement learning method presented in the paper VLRM: Vision - Language Models act as Reward Models for Image Captioning. The RL - tuned model can generate longer and more comprehensive descriptions without any additional computational cost compared to the original model.

🚀 Quick Start

This section will guide you through running the VLRM model. There are two options available for you to choose from.

💻 Usage Examples

Option 1: Load the whole model from this repo

You can load the entire fine - tuned model from this repository.

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'

Option 2: Load the original model first and then the RL - tuned weights

Since the fine - tuned layers account for a small part of the whole model, you can load the original model first and then the RL - tuned weights.

Step 1. Load the original model

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman sitting on the beach with a dog'

Step 2. Load the RL - tuned weights

Available checkpoints:

vlrm-blip2-opt-2.7b.pt (VLRM in the paper)
vlrm-rs-blip2-opt-2.7b.pt (VLRM - RS in the paper)

from huggingface_hub import hf_hub_download
finetuned_weights_state_dict = torch.load(hf_hub_download(repo_id="sashakunitsyn/vlrm-blip2-opt-2.7b", filename="vlrm-blip2-opt-2.7b.pt"))
model.load_state_dict(finetuned_weights_state_dict, strict=False)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'

📚 Documentation

You can find other details in the GitHub Repository (to be done).

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご