blip-large-long-cap Open Source Image Caption Generator - Free for Text-to-Image Prompts and Dataset Annotation

Blip Large Long Cap

Developed by unography

A long-text image description generator fine-tuned based on BLIP, suitable for text-to-image prompts and image dataset annotation

Image-to-Text

Transformers

Open Source License:Bsd-3-clause #Long text image description #Text-to-image prompt generation #Image dataset annotation

Downloads 26.87k

Release Time : 4/16/2024

Model Overview

This model is an image description generation model fine-tuned based on the BLIP architecture, specifically optimized for generating long-text descriptions, suitable for text-to-image generation prompts and image dataset annotation tasks.

Model Features

Long-text description generation

Specially optimized for generating long-text image descriptions, with a maximum length of up to 300 tokens

Multi-scenario application

Suitable for image description generation in various scenarios, including natural scenes, human activities, etc.

Conditional and unconditional generation

Supports both conditional and unconditional image description generation modes

Model Capabilities

Image-to-text

Long-text description generation

Image content analysis

Multi-scenario image understanding

Use Cases

Text-to-image generation

AI painting prompt generation

Provides detailed descriptive prompts for text-to-image generation systems

Generates detailed prompt texts usable for AI painting systems

Image dataset annotation

Automatic image annotation

Generates detailed descriptive annotations for image datasets

Reduces manual annotation workload and improves dataset annotation efficiency

🚀 LongCap: Finetuned BLIP for Image Long - Caption Generation

LongCap is a finetuned version of BLIP designed to generate long captions for images. It is well - suited for generating prompts for text - to - image generation and captioning text - to - image datasets.

🚀 Quick Start

LongCap can be used for both conditional and unconditional image captioning.

✨ Features

Pipeline Tag: Image - to - text
Tags: Image - captioning
Supported Languages: English
License: BSD - 3 - Clause
Sample Widgets:
Datasets: unography/laion - 14k - GPT4V - LIVIS - Captions
Inference Parameters: max_length = 300

Property	Details
Pipeline Tag	Image - to - text
Tags	Image - captioning
Languages	English
License	BSD - 3 - Clause
Datasets	unography/laion - 14k - GPT4V - LIVIS - Captions
Inference Parameters	`max_length = 300`

💻 Usage Examples

Basic Usage

Running the model on CPU

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

Advanced Usage

Running the model on GPU

In full precision

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

In half precision (`float16`)

import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("unography/blip-large-long-cap")
model = BlipForConditionalGeneration.from_pretrained("unography/blip-large-long-cap", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
pixel_values = inputs.pixel_values
out = model.generate(pixel_values=pixel_values, max_length=250)
print(processor.decode(out[0], skip_special_tokens=True))
>>> a woman sitting on the beach, wearing a checkered shirt and a dog collar. the woman is interacting with the dog, which is positioned towards the left side of the image. the setting is a beachfront with a calm sea and a golden hue.

📄 License

This project is licensed under the BSD - 3 - Clause license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご