BLIP - Long Cap Open - Source Image Captioning Model: Free Generation of Detailed Long Texts for Text - to

Blip Long Cap

Developed by unography

An image captioning model fine-tuned based on the BLIP architecture, excelling at generating detailed long-text descriptions, suitable for text-to-image prompts and image dataset annotation

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Long-text image description #Text-to-image prompt generation #Multi-detail recognition

Downloads 704

Release Time : 4/29/2024

Model Overview

This model is a vision-to-text model fine-tuned on the BLIP architecture, specializing in generating detailed and accurate long image descriptions. Ideal for producing rich textual descriptions of images, particularly suitable as a source of prompts for text-to-image models or for automatic annotation of image datasets.

Model Features

Long description generation

Capable of generating detailed image descriptions up to 250 characters, far exceeding the output length of standard image captioning models

High-quality training data

Fine-tuned using GPT4V-generated LAION-14K dataset, ensuring high description quality

Multi-scenario applicability

Suitable for description generation across various image scenarios, from simple objects to complex scenes

Model Capabilities

Image caption generation

Text-to-image prompt generation

Automatic image dataset annotation

Use Cases

Content creation

Text-to-image prompt generation

Generates detailed and accurate prompts for text-to-image models (e.g., Stable Diffusion)

Produces more detailed prompts that better match image content, improving output quality of text-to-image models

Data annotation

Automatic image dataset annotation

Automatically generates detailed descriptions for large-scale image datasets

Significantly reduces manual annotation costs and improves annotation efficiency

🚀 LongCap: Finetuned BLIP for generating long captions of images, suitable for prompts for text-to-image generation and captioning text-to-image datasets

LongCap is a finetuned model based on BLIP, designed to generate long captions for images. It's well - suited for providing prompts in text - to - image generation and captioning text - to - image datasets.

🚀 Quick Start

This model can be used for both conditional and unconditional image captioning.

💻 Usage Examples

Basic Usage

Running the model on CPU

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

Advanced Usage

Running the model on GPU

In full precision

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

In half precision (`float16`)

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250, num_beams=3, repetition_penalty=2.5)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the sand, interacting with a dog wearing a blue and white checkered collar. the dog is positioned to the left of the woman, who is holding something in their hand. the background features a serene beach setting with waves crashing onto the shore. there are no other animals or people visible in the image. the time of day appears to be either early morning or late afternoon, based on the lighting and shadows.

📄 License

This project is licensed under the BSD 3 - Clause License.

📋 Model Information

Property	Details
Model Type	Image - to - text
Training Data	unography/laion - 14k - GPT4V - LIVIS - Captions

🧪 Inference Parameters

Parameter	Value
max_length	250
num_beams	3
repetition_penalty	2.5

🖼️ Demo Widgets

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご