Florence-2-SD3-Captioner Open-Source Image Captioning Model - Generate High-Quality Image Text Descriptions for Free

Florence 2 SD3 Captioner

Developed by gokaygokay

Florence-2-SD3-Captioner is an image caption generation model based on the Florence-2 architecture, specifically designed for generating high-quality image captions.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Image Caption Generation #Multimodal Understanding #Art Content Analysis

Downloads 80.06k

Release Time : 6/24/2024

Model Overview

This model combines visual and language processing capabilities to generate detailed and accurate descriptive text from input images, suitable for scenarios such as artistic creation and content generation.

Model Features

High-quality Image Captioning

Capable of generating detailed and accurate image captions, suitable for artistic creation and content generation.

Multi-task Support

Supports various task prompts, such as detailed descriptions, keyword extraction, etc.

Efficient Inference

Optimizes inference speed using technologies like flash_attn.

Model Capabilities

Image Caption Generation

Multi-task Processing

High-quality Text Output

Use Cases

Artistic Creation

Artwork Description Generation

Generates detailed descriptive text for artworks, facilitating archiving and display.

Produces natural and accurate descriptive text.

Content Generation

Social Media Content Generation

Generates engaging captions for social media images.

Enhances content appeal and readability.

🚀 Florence-2-SD3-Captioner

A model that performs image - text - to - text tasks, leveraging the transformers library.

🚀 Quick Start

This project is designed for image - text - to - text tasks. You can quickly start using it by following the steps below.

📦 Installation

First, you need to install the necessary dependencies. Run the following command:

pip install -q datasets flash_attn timm einops

💻 Usage Examples

Basic Usage

The following is a basic example of using the model to describe an image:

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("gokaygokay/Florence-2-SD3-Captioner", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("gokaygokay/Florence-2-SD3-Captioner", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

from PIL import Image
import requests
import copy

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
answer = run_example("<DESCRIPTION>", "Describe this image in great detail.", image)
final_answer = answer['<DESCRIPTION>']
print(final_answer)

# 'Captured at eye-level on a sunny day, a light blue Volkswagen Beetle is parked on a cobblestone street. The beetle is parked in front of a yellow building with two brown doors. The door on the right side of the frame is white, while the left side is a darker shade of blue. The car is facing the camera, and the car is positioned in the middle of the street.'

📄 License

This project is licensed under the Apache 2.0 license.

📚 Documentation

Dataset Information

Property	Details
Training Data	google/docci, google/imageinwords, ProGamerGov/synthetic - dataset - 1m - dalle3 - high - quality - captions
Library Name	transformers
Pipeline Tag	image - text - to - text
Tags	art

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご