BLIP - 2オープンソースビジュアル言語モデル - 画像から文章へのスマート生成を無料で実現する

ホーム

Blip2 Test

advaitadaseinによって開発

BLIP-2はOPT-2.7bを基にした視覚-言語モデルで、画像エンコーダーと大規模言語モデルを凍結し、クエリ変換器を訓練することで画像からテキストを生成します。

画像生成テキスト

Transformers

英語オープンソースライセンス:MIT #画像キャプション生成 #視覚的質問応答 #マルチモーダル事前学習

ダウンロード数 18

リリース時間 : 9/15/2023

モデル概要

BLIP-2は先進的な視覚-言語モデルで、画像キャプション生成や視覚的質問応答などのタスクを実行できます。クエリ変換器を通じて画像エンコーダーと大規模言語モデルを接続し、効率的なクロスモーダル理解を実現します。

モデル特徴

凍結事前学習モデル

画像エンコーダーと大規模言語モデルを凍結したまま、軽量なクエリ変換器のみを訓練することで、学習効率を向上

クロスモーダル理解

クエリ変換器を介して視覚と言語モダリティを橋渡し、高品質な画像からテキストへの変換を実現

多機能アプリケーション

画像キャプション生成、視覚的質問応答、チャット型インタラクションなど様々なタスクをサポート

モデル能力

画像キャプション生成

視覚的質問応答(VQA)

画像対話インタラクション

クロスモーダル理解

使用事例

コンテンツ生成

自動画像タグ付け

画像に対して詳細なテキスト説明を生成

視覚障害者支援やコンテンツ管理システムに活用可能

インテリジェントインタラクション

視覚的質問応答システム

画像内容に関する自然言語質問に回答

教育や小売などのシナリオにおけるインテリジェントアシスタントとして利用可能

🚀 BLIP-2, OPT-2.7b, 事前学習のみ

BLIP-2モデルは、OPT-2.7b（27億のパラメータを持つ大規模言語モデル）を活用しています。このモデルは、Liらによる論文 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models で紹介され、最初はこのリポジトリで公開されました。

免責事項: BLIP-2を公開したチームはこのモデルのモデルカードを作成していないため、このモデルカードはHugging Faceチームによって作成されました。

🚀 クイックスタート

このモデルは画像と任意のテキストを与えて条件付きテキスト生成に使用できます。関心のあるタスクでファインチューニングされたバージョンをモデルハブで探すことができます。

✨ 主な機能

画像キャプショニング
視覚的質問応答 (VQA)
画像と以前の会話をプロンプトとしてモデルに入力することによるチャットのような会話

📚 ドキュメント

モデルの説明

BLIP-2は3つのモデルで構成されています：CLIPのような画像エンコーダ、Querying Transformer (Q-Former)、および大規模言語モデル。

著者らは、画像エンコーダと大規模言語モデルの重みを事前学習されたチェックポイントから初期化し、それらを固定したままQuerying Transformerを学習させます。Querying Transformerは、BERTのようなTransformerエンコーダであり、一連の「クエリトークン」をクエリ埋め込みにマッピングし、画像エンコーダの埋め込み空間と大規模言語モデルの間のギャップを埋めます。

モデルの目標は、クエリ埋め込みと以前のテキストを与えて、次のテキストトークンを予測することです。

drawing

直接利用と下流利用

画像と任意のテキストを与えて条件付きテキスト生成に生のモデルを使用することができます。関心のあるタスクでファインチューニングされたバージョンをモデルハブで探すことができます。

バイアス、リスク、制限、および倫理的考慮事項

BLIP2-OPTは、オフザシェルフのOPTを言語モデルとして使用しています。これは、Metaのモデルカードで述べられているのと同じリスクと制限を引き継いでいます。

トレーニングデータの多様性（またはその欠如）がモデルの品質に下流的な影響を与える他の大規模言語モデルと同様に、OPT-175Bにはバイアスと安全性の面で制限があります。OPT-175Bはまた、生成の多様性と幻覚の面で品質問題を抱えることもあります。一般的に、OPT-175Bは、現代の大規模言語モデルを悩ませる数多くの問題から免れることはできません。

BLIP2は、インターネットから収集された画像-テキストデータセット（例：LAION ）でファインチューニングされています。その結果、モデル自体は同等の不適切なコンテンツを生成したり、基盤となるデータに内在するバイアスを再現したりする可能性があります。

BLIP2は実世界のアプリケーションでテストされていません。いかなるアプリケーションにも直接展開すべきではありません。研究者は、モデルを展開する特定のコンテキストに関連して、まず慎重にモデルの安全性と公正性を評価する必要があります。

💻 使用例

基本的な使用法

コード例については、ドキュメントを参照してください。

CPUでモデルを実行する

クリックして展開

import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

GPUでモデルを実行する

完全精度で

クリックして展開

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

半精度 (`float16`) で

クリックして展開

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

8ビット精度 (`int8`) で

クリックして展開

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())