BLIP-Large微調整版オープンソースモデル - 描写ハローションを軽減し、画像キャプション生成を正確に実現

ホーム

Blip Image Captioning Large Mocha

moranyanukaによって開発

これはBLIP-Largeモデルの公式ファインチューニング版で、MOCHa強化学習フレームワークを用いてMS-COCOデータセットでファインチューニングされ、開放語彙記述の幻覚問題を緩和することを目的としています

画像生成テキスト

Transformers

オープンソースライセンス:MIT #幻覚抑制画像記述 #開放語彙生成 #強化学習ファインチューニング

ダウンロード数 188

リリース時間 : 12/19/2023

モデル概要

BLIP-Largeアーキテクチャに基づく画像記述生成モデルで、条件付きと非条件付きの画像記述生成をサポートします

モデル特徴

MOCHaファインチューニング

MOCHa強化学習フレームワークを用いてMS-COCOデータセットでファインチューニングされています

記述幻覚の緩和

開放語彙記述の幻覚問題に特化して最適化されています

デュアルモード生成

条件付きと非条件付きの2つの画像記述生成方式をサポートします

モデル能力

画像記述生成

条件付きテキスト生成

視覚言語理解

使用事例

画像理解

自動画像タグ付け

画像に対して正確な記述テキストを生成します

画像内容に合致する自然言語記述を生成します

視覚障害者支援

視覚コンテンツをテキスト記述に変換します

視覚障害者が画像内容を理解するのを支援します

コンテンツ作成

ソーシャルメディアコンテンツ生成

アップロードした画像に対して自動的にキャプションを生成します

コンテンツ作成効率を向上させます

🚀 BLIP-LargeモデルのMochaチェックポイント

Mitigating Open-Vocabulary Caption Hallucinationsで紹介された、MOCHa RLフレームワークを用いてMS-COCOで微調整されたBLIP-Largeモデルの公式チェックポイントです。

プロジェクトページ

🚀 クイックスタート

このモデルは、条件付きおよび条件なしの画像キャプショニングに使用できます。

💻 使用例

基本的な使用法

CPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

高度な使用法

GPUでモデルを実行する場合

フル精度での実行

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

半精度 (`float16`) での実行

クリックして展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and a dog on the beach

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> there is a woman and a dog on the beach at sunset

📄 ライセンス

このプロジェクトはMITライセンスの下で提供されています。

引用

@misc{benkish2024mitigating,
      title={Mitigating Open-Vocabulary Caption Hallucinations}, 
      author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
      year={2024},
      eprint={2312.03631},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}