BLIP-Large微調版開源模型 - 緩解描述幻覺精準實現圖像字幕生成

首頁

Blip Image Captioning Large Mocha

由moranyanuka開發

這是BLIP-Large模型的官方微調版本，採用MOCHa強化學習框架在MS-COCO數據集上進行微調，旨在緩解開放詞彙描述幻覺問題

圖像生成文本

Transformers

開源協議:MIT #抗幻覺圖像描述 #開放詞彙生成 #強化學習微調

下載量 188

發布時間 : 12/19/2023

模型概述

基於BLIP-Large架構的圖像描述生成模型，支持條件式與非條件式圖像描述生成

模型特點

MOCHa微調

採用MOCHa強化學習框架在MS-COCO數據集上進行微調

緩解描述幻覺

專門針對開放詞彙描述幻覺問題進行優化

雙模式生成

支持條件式與非條件式兩種圖像描述生成方式

模型能力

圖像描述生成

條件式文本生成

視覺語言理解

使用案例

圖像理解

自動圖像標註

為圖像生成準確的描述性文本

生成符合圖像內容的自然語言描述

輔助視覺障礙人士

將視覺內容轉換為文字描述

幫助視覺障礙者理解圖像內容

內容創作

社交媒體內容生成

為上傳的圖片自動生成配文

提高內容創作效率

🚀 BLIP-Large模型的Mocha檢查點

本項目是BLIP-Large模型的官方檢查點，它在MS-COCO數據集上使用MOCHa強化學習框架進行了微調。相關研究成果發表於論文Mitigating Open-Vocabulary Caption Hallucinations。

項目主頁

🚀 快速開始

你可以使用此模型進行有條件和無條件的圖像字幕生成。

💻 使用示例

基礎用法

使用PyTorch模型

在CPU上運行模型

點擊展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

高級用法

在GPU上運行模型

全精度運行

點擊展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

半精度（`float16`）運行

點擊展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and a dog on the beach

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> there is a woman and a dog on the beach at sunset

📚 詳細文檔

BibTeX引用

@misc{benkish2024mitigating,
      title={Mitigating Open-Vocabulary Caption Hallucinations}, 
      author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
      year={2024},
      eprint={2312.03631},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}