blip-image-captioning-base-mochaオープンソース画像記述モデル - 幻覚問題を緩和して画像を正確に記述する

ホーム

Blip Image Captioning Base Mocha

moranyanukaによって開発

BLIP基礎モデルの公式チェックポイント。MOCHA強化学習フレームワークを用いてMS-COCOデータセットでファインチューニングされ、オープン語彙記述における幻覚問題を緩和

画像生成テキスト

Transformers

オープンソースライセンス:MIT #幻覚抑制画像記述 #強化学習ファインチューニング #オープン語彙生成

ダウンロード数 88

リリース時間 : 12/19/2023

モデル概要

このモデルはBLIPアーキテクチャに基づく画像からテキストへの生成モデルで、画像記述生成に特化しています。MOCHA強化学習フレームワークによるファインチューニングにより、記述中の幻覚問題を効果的に削減しました。

モデル特徴

MOCHA強化学習ファインチューニング

MOCHAフレームワークを用いたファインチューニングにより、オープン語彙記述における幻覚問題を効果的に緩和

デュアルモード生成

条件付きと非条件付きの2種類の画像記述生成方式をサポート

マルチ精度サポート

CPU、GPU上で動作可能で、フル精度と半精度(float16)モードをサポート

モデル能力

画像記述生成

条件付きテキスト生成

非条件付きテキスト生成

多言語画像理解

使用事例

コンテンツ生成

自動画像タグ付け

ソーシャルメディアやコンテンツ管理システムの画像に対して自動的に記述文を生成

正確で幻覚のない画像記述を生成

視覚障害者支援

視覚障害ユーザーに画像内容のテキスト記述を提供

アクセシビリティ向上、視覚内容の理解支援

コンピュータビジョン研究

視覚言語モデル研究

視覚言語タスクのベースラインモデルまたは比較モデルとして

MOCHAで最適化されたベンチマーク性能を提供

🚀 BLIP-BaseモデルのMochaチェックポイント

Mitigating Open-Vocabulary Caption Hallucinationsで紹介された、MOCHa RLフレームワークを使用してMS-COCOでファインチューニングされたBLIP-Baseモデルの公式チェックポイントです。

プロジェクトページ

🚀 クイックスタート

このモデルは、条件付きおよび条件なしの画像キャプショニングに使用できます。

💻 使用例

基本的な使用法

CPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

GPUでモデルを実行する場合

フル精度での実行

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

半精度 (`float16`) での実行

クリックして展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-base-mocha")
model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-base-mocha", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog on the beach

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with a dog

📄 ライセンス

このプロジェクトはMITライセンスの下で提供されています。

📚 参考文献

@misc{benkish2024mitigating,
      title={Mitigating Open-Vocabulary Caption Hallucinations}, 
      author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
      year={2024},
      eprint={2312.03631},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}