instructblip - flan - t5 - xxl_8bitオープンソースビジュアル言語モデル - 画像説明の無料生成、ビジュアルな質問に回答する

ホーム

Instructblip Flan T5 Xxl 8bit

Mediocreatmybestによって開発

BLIP-2はFlan T5-xxlを基にした視覚-言語モデルで、画像エンコーダーと大規模言語モデルを凍結して事前学習を行い、画像キャプション生成や視覚的質問応答などのタスクをサポートします。

画像生成テキスト

Transformers

英語オープンソースライセンス:MIT #画像キャプション生成 #視覚的質問応答 #マルチモーダル融合

ダウンロード数 18

リリース時間 : 8/8/2023

モデル概要

BLIP-2モデルはCLIP画像エンコーダー、クエリトランスフォーマー、大規模言語モデル（Flan T5-xxl）で構成され、クエリトランスフォーマーを訓練することで視覚と言語モダリティのギャップを埋め、画像からテキストを生成するタスクを実現します。

モデル特徴

マルチモーダル事前学習

視覚エンコーダーと大規模言語モデルを組み合わせ、クロスモーダルな理解と生成を実現

パラメータ効率

クエリトランスフォーマー（Q-Former）のみを訓練し、画像エンコーダーと言語モデルのパラメータは凍結

ゼロショット能力

事前学習モデルを微調せずに下流タスク（例：VQA）に直接使用可能

モデル能力

画像キャプション生成

視覚的質問応答（VQA）

画像に基づく対話生成

使用事例

コンテンツ生成

自動画像タグ付け

画像に対して自然言語の説明を生成

画像内容に合ったテキスト説明を生成可能

インテリジェントインタラクション

視覚的質問応答システム

画像内容に関する自然言語の質問に回答

'画像の中に犬は何匹いますか？'などの質問に正しく回答可能

🚀 BLIP-2, Flan T5-xxl, 事前学習のみ

BLIP-2モデルは、Flan T5-xxl（大規模言語モデル）を活用しています。このモデルは、Liらによる論文BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Modelsで紹介され、最初はこのリポジトリで公開されました。

免責事項: BLIP-2を公開したチームはこのモデルのモデルカードを作成していないため、このモデルカードはHugging Faceチームによって作成されています。

🚀 クイックスタート

このモデルは、画像と任意のテキストを与えた条件付きテキスト生成に使用できます。詳細については、以下のセクションを参照してください。

✨ 主な機能

モデルの概要

BLIP-2は、3つのモデルで構成されています。CLIPのような画像エンコーダ、Querying Transformer（Q-Former）、および大規模言語モデルです。

著者らは、画像エンコーダと大規模言語モデルの重みを事前学習済みのチェックポイントから初期化し、それらを凍結したまま、Querying Transformerを学習させます。Querying Transformerは、BERTのようなTransformerエンコーダで、一連の「クエリトークン」をクエリ埋め込みにマッピングします。これにより、画像エンコーダの埋め込み空間と大規模言語モデルの間のギャップを埋めます。

このモデルの目標は、クエリ埋め込みと以前のテキストを与えて、次のテキストトークンを予測することです。

drawing

これにより、このモデルは以下のようなタスクに使用できます。

画像キャプション生成
視覚的質問応答（VQA）
画像と以前の会話をプロンプトとしてモデルに入力することによるチャットのような会話

直接利用と下流利用

画像と任意のテキストを与えた条件付きテキスト生成に、生のモデルを使用できます。興味のあるタスクに関する微調整済みのバージョンを探すには、モデルハブを参照してください。

バイアス、リスク、制限、および倫理的な考慮事項

BLIP2-FlanT5は、オフザシェルフのFlan-T5を言語モデルとして使用しています。これは、Flan-T5と同じリスクと制限を引き継いでいます。

Raeら（2021）によると、Flan-T5を含む言語モデルは、有害な方法でのテキスト生成に潜在的に使用される可能性があります。Flan-T5は、アプリケーション固有の安全性と公正性の懸念を事前に評価せずに、直接アプリケーションで使用してはなりません。

BLIP2は、インターネットから収集された画像-テキストデータセット（例：LAION）で微調整されています。その結果、モデル自体は、同等の不適切なコンテンツを生成したり、基盤となるデータに内在するバイアスを再現したりする潜在的な脆弱性があります。

BLIP2は、実世界のアプリケーションでテストされていません。いかなるアプリケーションにも直接展開してはなりません。研究者は、まずモデルの安全性と公正性を、展開する特定のコンテキストに関連して慎重に評価する必要があります。

💻 使用例

基本的な使用法

CPUでモデルを実行する

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

高度な使用法

GPUでモデルを実行する

フル精度で

クリックして展開

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

半精度（`float16`）で

クリックして展開

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

8ビット精度（`int8`）で

クリックして展開

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))