Florence-2-Flux-Large Open-source Vision-Language Model - Achieve image understanding and text generation for free.

Florence 2 Flux Large

Developed by gokaygokay

A vision-language model based on Microsoft Florence-2-large, excelling in image understanding and text generation tasks

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Image-text generation #Multimodal understanding #High-precision description

Downloads 14.96k

Release Time : 8/25/2024

Model Overview

This is a multimodal model based on the Florence-2 architecture, capable of processing image and text inputs to generate high-quality text descriptions and responses.

Model Features

Multimodal understanding

Capable of processing both image and text inputs, understanding visual content and generating relevant text

High-quality description generation

Can generate detailed and accurate image descriptions

Strong task adaptability

Can adapt to different vision-language tasks through task prompts

Model Capabilities

Image understanding

Text generation

Image caption generation

Visual question answering

Use Cases

Content understanding and generation

Image caption generation

Generate detailed and accurate textual descriptions for images

Produces natural language descriptions that match the image content

Visual question answering

Answer natural language questions about image content

Provides accurate and relevant answers

Assistive tools

Visual content analysis

Analyze image content and extract key information

Structured output of important elements and relationships in the image

🚀 Florence-2-Flux-Large

An image-text-to-text model based on microsoft/Florence-2-large, enabling various art-related tasks.

🚀 Quick Start

To quickly get started with the Florence-2-Flux-Large model, you need to install the necessary dependencies and run the example code.

📦 Installation

First, install the required libraries using the following command:

pip install -q datasets flash_attn timm einops

💻 Usage Examples

Basic Usage

The following is a basic example demonstrating how to use the Florence-2-Flux-Large model for image description tasks:

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("gokaygokay/Florence-2-Flux-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("gokaygokay/Florence-2-Flux-Large", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, text_input, image):
    prompt = task_prompt + text_input

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        repetition_penalty=1.10,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

from PIL import Image
import requests
import copy

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
answer = run_example("<DESCRIPTION>", "Describe this image in great detail.", image)

final_answer = answer["<DESCRIPTION>"]
print(final_answer)

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Pipeline Tag	image-text-to-text
Tags	art
Base Model	microsoft/Florence-2-large
Datasets	kadirnar/fluxdev_controlnet_16k

📄 License

This project is licensed under the Apache-2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご