blip-long-cap开源图像描述模型 - 免费生成详细长文，用于文生图和数据集标注

首页

Blip Long Cap

由 unography 开发

基于BLIP架构微调的图像描述生成模型，擅长生成详细的长文本描述，适用于文生图提示和图像数据集标注

图像生成文本

Transformers

开源协议:Bsd-3-clause #长文本图像描述 #文生图提示生成 #多细节识别

下载量 704

发布时间 : 4/29/2024

模型简介

该模型是在BLIP架构基础上微调的图像到文本模型，专注于生成详细、准确的图像长描述。适用于为图像生成丰富的文本描述，特别适合作为文生图模型的提示词来源或用于图像数据集的自动标注。

模型特点

长描述生成

能够生成最多250个字符的详细图像描述，远超标准图像描述模型的输出长度

高质量训练数据

使用GPT4V生成的LAION-14K数据集进行微调，描述质量高

多场景适用

适用于各种图像场景的描述生成，从简单物体到复杂场景

模型能力

图像描述生成

文生图提示词生成

图像数据集自动标注

使用案例

内容创作

文生图提示词生成

为文生图模型(如Stable Diffusion)生成详细、准确的提示词

生成更符合图像内容的详细提示，提高文生图模型输出质量

数据标注

图像数据集自动标注

为大规模图像数据集自动生成详细描述

显著减少人工标注成本，提高标注效率

🚀 LongCap：微调版 BLIP，用于生成图像长描述，适用于文本到图像生成的提示词和图像数据集的描述

LongCap是基于BLIP模型微调而来，能够为图像生成详细的长描述，这些描述可作为文本到图像生成的提示词，也可用于为图像数据集添加描述。

🚀 快速开始

你可以使用此模型进行有条件和无条件的图像描述生成。

💻 使用示例

基础用法

在CPU上运行模型

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

高级用法

在GPU上以全精度运行模型

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

在GPU上以半精度（`float16`）运行模型

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

📄 许可证

本项目采用BSD 3条款许可证。

📋 模型信息

属性	详情
模型类型	图像描述生成模型
训练数据	unography/laion-14k-GPT4V-LIVIS-Captions
推理参数	最大长度：250；束搜索数量：3；重复惩罚：2.5