blip-large-long-capオープンソース画像記述生成器 - テキストから画像生成のプロンプトやデータセットのアノテーションに無料で使用可能

ホーム

Blip Large Long Cap

unographyによって開発

BLIPをファインチューニングした長文画像説明ジェネレーターで、テキストから画像へのプロンプトや画像データセットのアノテーションに適しています

画像生成テキスト

Transformers

オープンソースライセンス:Bsd-3-clause #長文画像説明 #テキストから画像へのプロンプト生成 #画像データセットのアノテーション

ダウンロード数 26.87k

リリース時間 : 4/16/2024

モデル概要

このモデルはBLIPアーキテクチャをベースにファインチューニングされた画像説明生成モデルで、特に長文説明の生成に最適化されており、テキストから画像への生成プロンプトや画像データセットのアノテーションタスクに適しています。

モデル特徴

長文説明生成

特に長文画像説明の生成に最適化されており、最大300トークンまでの長さに対応

多様なシーンに対応

自然風景や人物の活動など、様々なシーンの画像説明生成に適用可能

条件付き・無条件生成

条件付きおよび無条件の画像説明生成モードをサポート

モデル能力

画像からテキストへの変換

長文説明生成

画像内容分析

多様なシーンにおける画像理解

使用事例

テキストから画像への生成

AI絵画プロンプト生成

テキストから画像生成システムに詳細な記述的プロンプトを提供

AI絵画システムで使用可能な詳細なプロンプトテキストを生成

画像データセットのアノテーション

自動画像アノテーション

画像データセットに詳細な記述的アノテーションを生成

手作業によるアノテーション作業を削減し、データセットのアノテーション効率を向上

🚀 LongCap: 画像の長文キャプション生成に最適化されたBLIPモデル。テキストから画像生成のプロンプトやテキストから画像生成データセットのキャプション作成に適しています。

LongCapは、画像の長文キャプションを生成するために微調整されたモデルです。テキストから画像生成のプロンプトやテキストから画像生成データセットのキャプション作成に役立ちます。

🚀 クイックスタート

このモデルは、条件付きおよび非条件付きの画像キャプション生成に使用できます。

💻 使用例

基本的な使用法

CPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

GPUでモデルを実行する場合

フル精度での実行

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

半精度（`float16`）での実行

クリックして展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

📄 ライセンス

このモデルはBSD 3条項ライセンスの下で提供されています。

📚 ドキュメント

モデル情報

属性	详情
パイプラインタグ	画像からテキスト
タグ	画像キャプショニング
言語	英語
データセット	unography/laion-14k-GPT4V-LIVIS-Captions
推論パラメータ	最大長: 300