blip-long-capオープンソース画像記述モデル - 無料で詳細な長文を生成し、文章から画像の生成やデータセットのアノテーションに活用

ホーム

Blip Long Cap

unographyによって開発

BLIPアーキテクチャをベースにファインチューニングした画像説明生成モデルで、詳細な長文説明の生成に優れており、テキストから画像生成プロンプトや画像データセットのアノテーションに適しています

画像生成テキスト

Transformers

オープンソースライセンス:Bsd-3-clause #長文画像説明 #テキストから画像生成プロンプト #多詳細認識

ダウンロード数 704

リリース時間 : 4/29/2024

モデル概要

このモデルはBLIPアーキテクチャをベースにファインチューニングされた画像からテキストへのモデルで、詳細で正確な画像の長文説明生成に特化しています。画像に対して豊富なテキスト説明を生成するのに適しており、特にテキストから画像生成モデルのプロンプトソースや画像データセットの自動アノテーションに最適です。

モデル特徴

長文説明生成

最大250文字の詳細な画像説明を生成可能で、標準的な画像説明モデルの出力長を大幅に上回ります

高品質トレーニングデータ

GPT4Vで生成されたLAION-14Kデータセットを使用してファインチューニングされており、説明の品質が高い

多様なシーン対応

単純な物体から複雑なシーンまで、様々な画像シーンでの説明生成に適用可能

モデル能力

画像説明生成

テキストから画像生成プロンプト生成

画像データセット自動アノテーション

使用事例

コンテンツ作成

テキストから画像生成プロンプト生成

Stable Diffusionなどのテキストから画像生成モデル向けに詳細で正確なプロンプトを生成

画像内容に合致した詳細なプロンプトを生成し、テキストから画像モデルの出力品質を向上

データアノテーション

画像データセット自動アノテーション

大規模画像データセットに対して自動的に詳細な説明を生成

手動アノテーションコストを大幅に削減し、アノテーション効率を向上

🚀 LongCap：画像の長文キャプション生成用にファインチューニングされたBLIP

画像の長文キャプション生成に最適化されたモデルで、テキストから画像生成のプロンプトやテキストから画像生成データセットのキャプション付けに適しています。

🚀 クイックスタート

このモデルは、条件付きおよび非条件付きの画像キャプション生成に使用できます。

💻 使用例

基本的な使用法

CPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

高度な使用法

フル精度でGPUでモデルを実行する場合

クリックして展開

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

半精度（`float16`）でGPUでモデルを実行する場合

クリックして展開

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

📄 ライセンス

このモデルはBSD 3条項ライセンスの下で提供されています。

Property	Details
パイプラインタグ	画像からテキスト
言語	英語
データセット	unography/laion-14k-GPT4V-LIVIS-Captions
推論パラメータ	最大長: 250, ビームサーチのビーム数: 3, 繰り返しペナルティ: 2.5