CogFlorence - 2.1 - Large Open Source Model - Efficiently Implement the Practical Function of Image-to-Text

Cogflorence 2.1 Large

Developed by thwri

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Fine-grained Image Annotation #Multimodal Generation #Art Scene Understanding

Downloads 2,541

Release Time : 7/27/2024

Model Overview

This model is primarily used for image-to-text tasks, capable of generating detailed image descriptions. Fine-tuning on a large-scale image dataset has enhanced its annotation capabilities.

Model Features

High-Quality Image Annotation

Capable of generating detailed and accurate image descriptions, suitable for images of various themes.

Large-Scale Dataset Training

Fine-tuned on a subset of 40,000 images from the Ejafa/ye-pop dataset, improving the model's generalization ability.

Frozen Visual Encoder

The visual encoder was frozen during training, preserving the original model's visual feature extraction capabilities.

Model Capabilities

Image Description Generation

Multi-theme Image Analysis

High-Quality Text Output

Use Cases

Image Annotation

Detailed Image Description

Generates detailed textual descriptions for images, suitable for content management and retrieval.

Produces descriptive text including details such as colors, shapes, backgrounds, etc.

Content Management

Automated Image Tagging

Automatically generates tags for large volumes of images, improving content management efficiency.

Quickly generates accurate image tags, reducing manual annotation workload.

🚀 microsoft/Florence-2-large tuned on Ejafa/ye-pop captioned with CogVLM2

This repository holds a fine - tuned version of the microsoft/Florence-2-large model. It has been tuned on a 40,000 - image subset of the Ejafa/ye-pop dataset, with captions generated using THUDM/cogvlm2-llama3-chat-19B.

✨ Features

This fine - tuned model enhances captioning abilities through training on a diverse image dataset.
It can be easily loaded from the Hugging Face Model Hub for use.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True).to(device).eval()
processor = AutoProcessor.from_pretrained("thwri/CogFlorence-2.1-Large", trust_remote_code=True)
# Function to run the model on an example
def run_example(task_prompt, image):
    prompt = task_prompt
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        do_sample=True
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
    return parsed_answer
from PIL import Image
import requests
import copy
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
result = run_example("<MORE_DETAILED_CAPTION>" , image)
print(result)
# {'<MORE_DETAILED_CAPTION>': 'A vivid, close-up photograph of a classic car, specifically a Volkswagen Beetle, parked on a cobblestone street. The car is painted in a striking shade of turquoise, with a glossy finish that reflects the surrounding environment. The vehicle's rounded shape is accentuated by its rounded tires and chrome detailing. The background reveals a weathered yellow wall with a rustic wooden door, adding to the rustic charm of the scene. The sky above is clear, suggesting a sunny day. The overall style of the image is candid, capturing a moment in time without any posed or staged elements.'}

📚 Documentation

Training Details

Vision Encoder: The vision encoder was frozen during training.
Batch Size: 64
Gradient Accumulation Steps: 16
Learning Rate: 5.12e - 05
Optimizer: AdamW
Scheduler: polynomial
Epochs: 7.37

Dataset

The fine - tuning process utilized a 40,000 - image subset from the Ejafa/ye-pop dataset. This dataset contains a wide array of images with varying subjects, providing a robust training ground for improving the model's captioning abilities.

Captioning

The captions were generated using THUDM/cogvlm2-llama3-chat-19B.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご