vlrm-blip2-opt-2.7b開源圖像描述模型 - 生成長且全面的圖像描述信息

首頁

Vlrm Blip2 Opt 2.7b

由sashakunitsyn開發

通過強化學習方法微調的BLIP-2 OPT-2.7B模型，能夠生成長且全面的圖像描述

圖像生成文本

Transformers

英語開源協議:MIT #強化學習微調 #長文本圖像描述 #零樣本生成

下載量 398

發布時間 : 4/2/2024

模型概述

該模型是基於BLIP-2 OPT-2.7B架構，通過強化學習方法微調的視覺語言模型，專注於圖像描述生成任務，相比原始模型能生成更詳細、更全面的描述。

模型特點

強化學習微調

通過強化學習方法優化，使模型能生成更長且更全面的圖像描述

無需額外計算開銷

相比原始模型，改進後的模型在保持相同計算資源需求的情況下提升性能

模塊化加載

支持僅加載微調層權重，可靈活應用於原始模型

模型能力

圖像描述生成

視覺語言理解

多模態處理

使用案例

圖像理解

自動圖像標註

為圖像生成詳細描述，可用於內容管理系統

相比原始模型生成更全面、更長的描述

輔助視覺障礙人士

為視覺障礙用戶提供詳細的圖像描述

提供更豐富的場景信息

內容創作

社交媒體內容生成

為社交媒體圖片自動生成吸引人的描述

生成更吸引人的長描述

🚀 VLRM

本倉庫包含了通過論文 VLRM: Vision-Language Models act as Reward Models for Image Captioning 中介紹的強化學習方法微調的 BLIP - 2 OPT - 2.7B 模型的權重。與原始模型相比，經過強化學習微調的模型能夠在零計算開銷的情況下生成更長、更全面的描述。

你可以在 GitHub 倉庫 (待完成) 中找到其他詳細信息。

🚀 快速開始

💻 使用示例

基礎用法

你可以從本倉庫加載整個模型：

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'

高級用法

由於微調層在整個模型中所佔比例較小，你可以先加載原始模型，然後加載經過強化學習微調的權重。

步驟 1. 加載原始模型

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman sitting on the beach with a dog'

步驟 2. 加載經過強化學習微調的權重

可用的檢查點：

vlrm-blip2-opt-2.7b.pt (論文中的 VLRM)
vlrm-rs-blip2-opt-2.7b.pt (論文中的 VLRM - RS)

from huggingface_hub import hf_hub_download
finetuned_weights_state_dict = torch.load(hf_hub_download(repo_id="sashakunitsyn/vlrm-blip2-opt-2.7b", filename="vlrm-blip2-opt-2.7b.pt"))
model.load_state_dict(finetuned_weights_state_dict, strict=False)

out = model.generate(**inputs, max_new_tokens=60)
processor.decode(out[0], skip_special_tokens=True).strip()
>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'