CogFlorence-2-Large-Freeze Open-source Model - Free and Precise Image-to-Text Function Realization

Cogflorence 2 Large Freeze

Developed by thwri

This is a fine-tuned version of the microsoft/Florence-2-large model, trained on a subset of 38,000 images from the Ejafa/ye-pop dataset, using CogVLM2-generated annotations, focusing on image-to-text tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Fine Annotation #Multimodal Understanding #Artistic Image Analysis

Downloads 419

Release Time : 7/4/2024

Model Overview

This model is a vision-language model capable of generating detailed textual descriptions from input images. It is fine-tuned on Florence-2-large, enhancing its image annotation capabilities.

Model Features

High-Quality Image Annotation

Capable of generating detailed and accurate image descriptions, capturing key elements and details in the image.

Large-Scale Data Fine-Tuning

Trained on 38,000 diverse images, improving the model's generalization ability.

Frozen Visual Encoder

Keeps visual encoder parameters unchanged during training, focusing on optimizing text generation capabilities.

Model Capabilities

Image Understanding

Detailed Image Description Generation

Multi-Element Scene Analysis

Use Cases

Content Generation

Automatic Image Annotation

Automatically generates detailed descriptions for images in a library.

Improves image retrieval efficiency and accessibility.

Assistive Technology

Visual Assistance

Provides detailed audio descriptions of image content for visually impaired individuals.

Enhances accessibility of digital content.

🚀 microsoft/Florence-2-large tuned on Ejafa/ye-pop captioned with CogVLM2

This repository holds a fine - tuned version of the microsoft/Florence-2-large model. It has been tuned on a 38,000 - image subset of the Ejafa/ye-pop dataset. The captions for this tuning were generated using THUDM/cogvlm2-llama3-chat-19B, enhancing the model's image - to - text capabilities.

✨ Features

The model is a fine - tuned variant of microsoft/Florence-2-large, optimized for image captioning.
It has been trained on a diverse image dataset, Ejafa/ye-pop, to improve its generalization ability.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2-Large-Freeze", trust_remote_code=True)

# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt

    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")

    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer

from PIL import Image
import requests
import copy

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)

# {'<MORE_DETAILED_CAPTION>': 'a turquoise volkswagen beetle parked on a cobblestone street in front of a yellow wall with two wooden doors. the car's body is painted in a vibrant shade of teal, with a glossy finish that reflects the sunlight, and the wheels are polished with a silver hubcap. the building behind the car has a weathered, aged appearance, with visible cracks and peeling paint. the sky above is clear and blue, suggesting a sunny day.'}

📚 Documentation

Training Details

Vision Encoder: The vision encoder was frozen during training.
Batch Size: 32
Gradient Accumulation Steps: 8
Learning Rate: 4.2667e - 5
Optimizer: AdamW
Scheduler: linear
Epochs: 7

Dataset

The fine - tuning process used a 38,000 - image subset from the Ejafa/ye-pop dataset. This dataset contains a wide variety of images with different subjects, offering a solid training foundation for enhancing the model's captioning capabilities.

Captioning

The captions were generated using THUDM/cogvlm2-llama3-chat-19B.

📄 License

This project is licensed under the MIT license.

Property	Details
Model Type	Fine - tuned version of `microsoft/Florence-2-large`
Training Data	38,000 - image subset of `Ejafa/ye-pop`
Caption Generation Model	`THUDM/cogvlm2-llama3-chat-19B`
License	MIT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご