blip-image-captioning-base-football-finetunedオープンソースモデル - 無料でデプロイして精度良くサッカー画像の説明を生成

ホーム

Blip Image Captioning Base Football Finetuned

ybelkadaによって開発

COCOで事前学習され、サッカーデータセットでファインチューンされた視覚-言語モデルで、画像キャプション生成に優れています

画像生成テキスト

Transformers

オープンソースライセンス:Bsd-3-clause #視覚-言語事前学習 #画像キャプション生成 #マルチタスク統一フレームワーク

ダウンロード数 71

リリース時間 : 1/17/2023

モデル概要

BLIPは統一された視覚-言語事前学習フレームワークで、画像理解とキャプション生成タスクに優れています。このバージョンはサッカーデータセットでファインチューンされた画像キャプション生成モデルです。

モデル特徴

統一視覚-言語フレームワーク

視覚理解と言語生成タスクを同時にサポート

ガイド付きアノテーション戦略

合成キャプション生成とフィルタリングメカニズムによりノイズデータを効果的に活用

サッカーシーン最適化

サッカーデータセットでファインチューンされ、スポーツシーンの記述がより正確

モデル能力

画像キャプション生成

条件付きテキスト生成

視覚-言語理解

使用事例

スポーツメディア

サッカー試合画像自動アノテーション

スポーツニュースの試合画像に記述テキストを生成

スポーツコンテンツ生産効率の向上

アクセシビリティ技術

視覚支援アプリケーション

視覚障害者向けに画像内容を記述

デジタルコンテンツのアクセシビリティ向上

🚀 BLIP: 統一されたビジョン言語理解と生成のための言語画像事前学習のブートストラッピング

このモデルは、COCOデータセットで事前学習され、サッカーデータセットでファインチューニングされた画像キャプショニング用のモデルです。（ViTベースのバックボーンを持つベースアーキテクチャ）

ファインチューニング用のGoogle Colabノートブック: https://colab.research.google.com/drive/1lbqiSiA0sDF7JDWPeS0tccrM85LloVha?usp=sharing


BLIP公式リポジトリからの画像

📚 概要

論文の著者は、概要で以下のように書いています。

ビジョン言語事前学習（VLP）は、多くのビジョン言語タスクの性能を向上させています。しかし、ほとんどの既存の事前学習モデルは、理解ベースのタスクまたは生成ベースのタスクのどちらか一方でのみ優れた性能を発揮します。さらに、性能の向上は主に、ウェブから収集されたノイズの多い画像テキストペアを用いたデータセットの拡大によって達成されていますが、これは最適ではない監督情報のソースです。本論文では、ビジョン言語理解と生成タスクの両方に柔軟に適用できる新しいVLPフレームワークであるBLIPを提案します。BLIPは、キャプショナーが合成キャプションを生成し、フィルターがノイズの多いものを除去することで、ノイズの多いウェブデータを効果的に利用します。我々は、画像テキスト検索（平均recall@1で+2.7%）、画像キャプショニング（CIDErで+2.8%）、VQA（VQAスコアで+1.6%）など、幅広いビジョン言語タスクで最先端の結果を達成しました。BLIPはまた、ゼロショットでビデオ言語タスクに直接適用した場合にも強い汎化能力を示します。コード、モデル、およびデータセットが公開されています。

📦 インストール

このモデルは、条件付きおよび条件なしの画像キャプショニングに使用できます。

💻 使用例

基本的な使用法

CPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("ybelkada/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("ybelkada/blip-image-captioning-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件付き画像キャプショニング
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 条件なし画像キャプショニング
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

GPUでモデルを実行する場合

フル精度で実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesfoce/blip-image-captioning-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件付き画像キャプショニング
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 条件なし画像キャプショニング
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

半精度（`float16`）で実行する場合

クリックして展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# 条件付き画像キャプショニング
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# 条件なし画像キャプショニング
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach with her dog

📄 ライセンス

このモデルは、BSD 3条項ライセンスの下で公開されています。

📚 BibTexと引用情報

@misc{https://doi.org/10.48550/arxiv.2201.12086,
  doi = {10.48550/ARXIV.2201.12086},
  
  url = {https://arxiv.org/abs/2201.12086},
  
  author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}