blip-long-cap開源圖像描述模型 - 免費生成詳細長文，用於文生圖和數據集標註

首頁

Blip Long Cap

由unography開發

基於BLIP架構微調的圖像描述生成模型，擅長生成詳細的長文本描述，適用於文生圖提示和圖像數據集標註

圖像生成文本

Transformers

開源協議:Bsd-3-clause #長文本圖像描述 #文生圖提示生成 #多細節識別

下載量 704

發布時間 : 4/29/2024

模型概述

該模型是在BLIP架構基礎上微調的圖像到文本模型，專注於生成詳細、準確的圖像長描述。適用於為圖像生成豐富的文本描述，特別適合作為文生圖模型的提示詞來源或用於圖像數據集的自動標註。

模型特點

長描述生成

能夠生成最多250個字符的詳細圖像描述，遠超標準圖像描述模型的輸出長度

高質量訓練數據

使用GPT4V生成的LAION-14K數據集進行微調，描述質量高

多場景適用

適用於各種圖像場景的描述生成，從簡單物體到複雜場景

模型能力

圖像描述生成

文生圖提示詞生成

圖像數據集自動標註

使用案例

內容創作

文生圖提示詞生成

為文生圖模型(如Stable Diffusion)生成詳細、準確的提示詞

生成更符合圖像內容的詳細提示，提高文生圖模型輸出質量

數據標註

圖像數據集自動標註

為大規模圖像數據集自動生成詳細描述

顯著減少人工標註成本，提高標註效率

🚀 LongCap：微調版 BLIP，用於生成圖像長描述，適用於文本到圖像生成的提示詞和圖像數據集的描述

LongCap是基於BLIP模型微調而來，能夠為圖像生成詳細的長描述，這些描述可作為文本到圖像生成的提示詞，也可用於為圖像數據集添加描述。

🚀 快速開始

你可以使用此模型進行有條件和無條件的圖像描述生成。

💻 使用示例

基礎用法

在CPU上運行模型

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

高級用法

在GPU上以全精度運行模型

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

在GPU上以半精度（`float16`）運行模型

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

📄 許可證

本項目採用BSD 3條款許可證。

📋 模型信息

屬性	詳情
模型類型	圖像描述生成模型
訓練數據	unography/laion-14k-GPT4V-LIVIS-Captions
推理參數	最大長度：250；束搜索數量：3；重複懲罰：2.5